Commit Graph

18 Commits

Author SHA1 Message Date
Michael Fitz-Payne 1867202a4d DEV(cache_critical_dns): add option to run once and exit
There are situations where a container running Discourse may want to
cache the critical DNS services without running the cache_critical_dns
service, for example running migrations prior to running a full bore
application container.

Add a `--once` argument for the cache_critical_dns script that will
only execute the main loop once, and return the status code for the
script to use when exiting. 0 indicates no errors occured during SRV
resolution, and 1 indicates a failure during the SRV lookup.

Nothing is reported to prometheus in run_once mode. Generally this
mode of operation would be a part of a unix pipeline, in which the exit
status is a more meaningful and immediate signal than a prometheus metric.

The reporting has been moved into it's own method that can be called
only when the script is running as a service.

See /t/69597.
2022-07-06 14:53:02 +10:00
Michael Fitz-Payne aabbc9e63e DOC(cache_critical_dns): add program description
Describes the behaviour and configuration of the cache_critical_dns
script, mainly cribbed from commit messages. Tries to make this program
a bit less of an enigma.
2022-05-26 14:26:57 +10:00
Michael Fitz-Payne 0553788d3b DEV(cache_critical_dns): improve postgres_healthcheck
The `PG::Connection#ping` method is only reliable for checking if the
given host is accepting connections, and not if the authentication
details are valid.

This extends the healthcheck to confirm that the auth details are
able to both create a connection and execute queries against the
database.

We expect the empty query to return an empty result set, so we can
assert on that. If a failure occurs for any reason, the healthcheck will
return false.
2022-05-24 08:20:10 +10:00
Gabe Pacuilla 4284ba9c27
FIX(cache_critical_dns): use correct DISCOURSE_DB_USERNAME envvar (#16862) 2022-05-18 13:01:18 -04:00
Gabe Pacuilla 9f246e6969
FIX(cache_critical_dns): use discourse database name and user by default (#16856) 2022-05-17 16:09:32 -04:00
Michael Fitz-Payne 35d5c29e10 DEV(cache_critical_dns): add SRV priority tunables
An SRV RR contains a priority value for each of the SRV targets that
are present, ranging from 0 - 65535. When caching SRV records we may want to
filter out any targets above or below a particular threshold.

This change adds support for specifying a lower and/or upper bound on
target priorities for any SRV RRs. Any targets returned when resolving
the SRV RR whose priority does not fall between the lower and upper
thresholds are ignored.

For example: Let's say we are running two Redis servers, a primary and
cold server as a backup (but not a replica). Both servers would pass health
checks, but clearly the primary should be preferred over the backup
server. In this case, we could configure our SRV RR with the primary
target as priority 1 and backup target as priority 10. The
`DISCOURSE_REDIS_HOST_SRV_LE` could then be set to 1 and the target with
priority 10 would be ignored.

See /t/66045.
2022-05-12 08:08:56 +10:00
Michael Fitz-Payne 1acc4751ff
FIX: remove refresh seconds override on cache_critical_dns (#16572)
This removes the option to override the sleep time between caching of
DNS records. The override was invalid because `''.to_i` is 0 in Ruby,
causing a tight loop calling the `run` method.
2022-04-27 12:42:35 +08:00
Michael Fitz-Payne 0784c28702 FIX: cache_critical_dns - add TLS support for Redis healthcheck
For Redis connections that operate over TLS, we need to ensure that we
are setting the correct arguments for the Redis client. We can utilise
the existing environment variable `DISCOURSE_REDIS_USE_SSL` to toggle
this behaviour.

No SSL verification is performed for two reasons:
- the Discourse application will perform a verification against any FQDN
  as specified for the Redis host
- the healthcheck is run against the _resolved_ IP address for the Redis
  hostname, and any SSL verification will always fail against a direct
  IP address

If no SSL arguments are provided, the IP address is never cached against
the hostname as no healthy address is ever found in the HealthyCache.
2022-04-27 12:27:58 +10:00
Michael Fitz-Payne c4ea439cc3 DEV: refactor cache_critical_dns for SRV RR awareness
Modify the cache_critical_dns script for SRV RR awareness. The new
behaviour is only enabled when one or more of the following environment
variables are present (and only for a host where the `DISCOURSE_*_HOST_SRV`
variable is present):
- `DISCOURSE_DB_HOST_SRV`
- `DISCOURSE_DB_REPLICA_HOST_SRV`
- `DISCOURSE_REDIS_HOST_SRV`
- `DISCOURSE_REDIS_REPLICA_HOST_SRV`

Some minor changes in refactor to original script behaviour:
- add Name and SRVName classes for storing resolved addresses for a hostname
- pass DNS client into main run loop instead of creating inside the loop
- ensure all times are UTC
- add environment override for system hosts file path and time between DNS
  checks mainly for testing purposes

The environment variable for `BUNDLE_GEMFILE` is set to enables Ruby to
load gems that are installed and vendored via the project's Gemfile.
This script is usually not run from the project directory as it is
configured as a system service (see
71ba9fb7b5/templates/cache-dns.template.yml (L19))
and therefore cannot load gems like `pg` or `redis` from the default
load paths. Setting this environment variable configures bundler to look
in the correct project directory during it's setup phase.

When a `DISCOURSE_*_HOST_SRV` environment variable is present, the
decision for which target to cache is as follows:
- resolve the SRV targets for the provided hostname
- lookup the addresses for all of the resolved SRV targets via the
  A and AAAA RRs for the target's hostname
- perform a protocol-aware healthcheck (PostgreSQL or Redis pings)
- pick the newest target that passes the healthcheck

From there, the resolved address for the SRV target is cached against
the hostname as specified by the original form of the environment
variable.

For example: The hostname specified by the `DISCOURSE_DB_HOST` record
is `database.example.com`, and the `DISCOURSE_DB_HOST_SRV` record is
`database._postgresql._tcp.sd.example.com`. An SRV RR lookup will return
zero or more targets. Each of the targets will be queried for A and AAAA
RRs. For each of the addresses returned, the newest address that passes
a protocol-aware healthcheck will be cached. This address is cached so
that if any newer address for the SRV target appears we can perform a
health check and prefer the newer address if the check passes.

All resolved SRV targets are cached for a minimum of 30 minutes in memory
so that we can prefer newer hosts over older hosts when more than one target
is returned. Any host in the cache that hasn't been seen for more than 30
minutes is purged.

See /t/61485.
2022-04-27 10:14:33 +10:00
David Taylor 9e43f0303d
DEV: Include DISCOURSE_REDIS_REPLICA_HOST in cache_critical_dns (#15877)
This is the replacement for DISCOURSE_REDIS_SLAVE_HOST
2022-02-09 14:41:26 +00:00
Neil Lalonde 530058918e
DEV: Support env var for prometheus port in cache_critical_dns 2020-08-17 15:48:14 -04:00
Michael Brown 7200653e16 FIX: cache_critical_dns was erroring without IPAddr
* sometimes cache_critical_dns would error out since "IPAddr" was
  undefined
* sometimes it autoloaded, so no error
2019-12-27 12:39:08 -05:00
Michael Brown 7b1783bae8 FIX: cache_critical_dns was never caching pg replica (#7461)
* it's DISCOURSE_DB_REPLICA_HOST not DISCOURSE_DB_BACKUP_HOST
2019-04-30 08:42:51 +08:00
Sam 6acabec423 FIX: script was missing newlines when generating hosts 2018-11-28 15:18:08 +11:00
Sam 6d9d904df5 add missing newline to end of file 2018-11-23 15:43:27 +11:00
Sam d7b0f0069c no need to double strip this line 2018-11-23 14:48:02 +11:00
Sam 4c6eeaac15 Followup on 0739c3b1d1
This corrects some minor style issues
2018-11-23 14:43:52 +11:00
Sam 0739c3b1d1 DEV: this introduces a script capable of caching critical DNS locally
This is useful for cases where you want to add resiliency to DNS lookups
for redis and postgres, so they will continue to work even if there is
a DNS outage
2018-11-22 18:46:59 +11:00