This commit reimplements how we monitor Sidekiq processes that are
forked from the Unicorn master process. Prior to this change, we rely on
`Jobs::Heartbeat` to enqueue a `Jobs::RunHeartbeat` job every 3 minutes.
The `Jobs::RunHeartbeat` job then sets a Redis key with a timestamp. In
the Unicorn master process, we then fetch the timestamp that has been set
by the job from Redis every 30 minutes. If the timestamp has not been
updated for more than 30 minutes, we restart the Sidekiq process. The
fundamental flaw with this approach is that it fails to consider
deployments with multiple hosts and multiple Sidekiq processes. A
sidekiq process on a host may be in a bad state but the heartbeat check
will not restart the process because the `Jobs::RunHeartbeat` job is
still being executed by the working Sidekiq processes on other hosts.
In order to properly ensure that stuck Sidekiq processs are restarted,
we now rely on the [Sidekiq::ProcessSet](https://github.com/sidekiq/sidekiq/wiki/API#processes)
API that is supported by Sidekiq. The API provides us with "near real-time (updated every 5 sec)
info about the current set of Sidekiq processes running". The API
provides useful information like the hostname, pid and also when Sidekiq
last did its own heartbeat check. With that information, we can easily
determine if a Sidekiq process needs to be restarted from the Unicorn
master process.
In 6cafe59c76, we added a monkey patch to
`Unicorn::HtppServer#murder_lazy_workers` to log a message and send a
`USR2` signal to the Unicorn worker process when they Unicorn worker
process is 2 seconds away from being timed out by the Unicorn master
process. However, we ended up loggging multiple messages and sending
multiple USR2 signal during the 2 seconds before the Unicorn worker
process hit the time out.
To overcome this problem, we will now set an instance variable on the
`Unicorn::Worker` instance and use it to ensure that the log message is
only logged once and USR2 signal to the Unicorn worker process is only
sent one as well.
In our production environment, we have been seeing Sidekiq processes
getting stuck randomly when a USR1 signal is sent to the Unicorn master
process. We have not been able to identify the root cause of why the
Sidekiq process gets stuck. We however noticed that when the Unicorn
master process receives a USR1 signal, it will reopen the logs for the
Unicorn master process first before sending a USR1 signal for the
Unicorn worker processes to reopen the logs. We figured that we should
do the same for the Sidekiq process as well when a USR1 signal.
In this commit, we introduce an arbitrary delay of 1 second before we
the Sidekiq process reopens its log files so as to allow enough time for the Unicorn
master to finish reopening it logs first.
We also do not send reopen logs for the Sidekiq process if the `DISCOURSE_LOG_SIDEKIQ`
env is not present because there is no need to reopen any logs.
This commit updates `Demon::Base#start` to call `Discourse.before_fork`
before forking. According to the docs in `mini_racer`, we need to
"Dispose manually of all MiniRacer::Context objects prior to forking".
This commit is motivated by a segmentation fault which we are seeing in
production when killing a daemon process. Backtrace of the core dump
includes traces of `mini_racer` so we think this is the cause. Note that
we are not 100% sure if this will fix the issue.
This commit rewrites `DiscourseLogstashLogger` to not be an instance
of `LogstashLogger`. The reason we don't want it to be an instance of
`LogstashLogger` is because we want the new logger to be chained to
Logster's logger which can then pass down useful information like the
request's env and error backtraces which Logster has already gathered.
Note that this commit does not bother to maintain backwards
compatibility and drops the `LOGSTASH_URI` and `UNICORN_LOGSTASH_URI`
ENV variables which were previously used to configure the destination in
which `logstash-logger` would send the logs to. Instead, we introduce
the `ENABLE_LOGSTASH_LOGGER` ENV variable to replace both ENV and remove
the need for the log paths to be specified. Note that the previous
feature was considered experimental as stated in d888d3c54c
and the new feature should be considered experimental as well. The code
may be moved into a plugin in the future.
This commit introduces the following changes:
1. Introduce the `SignalTrapLogger` singleton which starts a single
thread that polls a queue to log messages with the specified logger.
This thread is necessary becasue most loggers cannot be used inside
the `Signal.trap` context as they rely on mutexes which are not
allowed within the context.
2. Moves the monkey patch in `freedom_patches/unicorn_http_server_patch.rb` to
`config/unicorn.config.rb` which is already monkey patching
`Unicorn::HttpServer`.
3. `Unicorn::HttpServer` will now automatically send a `USR2` signal to
a unicorn worker 2 seconds before the worker is timed out by the
Unicorn master.
4. When a Unicorn worker receives a `USR2` signal, it will now log only
the main thread's backtraces to `Rails.logger`. Previously, it was
`put`ing the backtraces to `STDOUT` which most people wouldn't read.
Logging it via `Rails.logger` will make the backtraces easily
accessible via `/logs`.
This commit updates all Sidekiq signal handling event logs to go through
Unicorn's logger instead of logging to STDOUT. Going through a proper logger
means the log messages are logged in the format which the logger has configured.
This means we get proper timestamp for the log messages.
This commit updates various spots in `config/unicorn.conf.rb` which were
doing `STDERR.puts` to either use `server.logger` which is unicorn's
logger or `Rails.logger` which is Rails' logger. The reason we want to
do so is because `STDERR.puts` doesn't format the logs properly and is a
problem especially when custom loggers with structured formatting is
enabled.
This commit adds a `DISCOURSE_DUMP_BACKTRACES_ON_UNICORN_WORKER_TIMEOUT`
environment that will allow us to dump all backtraces for all threads of
a Unicorn worker 2 seconds before it times out. In development,
backtraces are dumped to `STDOUT` and in production we will dump it to
`unicorn.stdout.log`.
We want to dump all the backtraces to make it easier to identify the
cause of a Unicorn worker timing out.
Also:
* Remove an unused method (#fill_email)
* Replace a method that was used just once (#generate_username) with `SecureRandom.alphanumeric`
* Remove an obsolete dev puma `tmp/restart` file logic
* File.exists? is deprecated and removed in Ruby 3.2 in favor of
File.exist?
* Dir.exists? is deprecated and removed in Ruby 3.2 in favor of
Dir.exist?
This commit makes a few changes to improve boot time in development environments. It will have no effect on production boot times.
- Skip the SchemaCache warmup. In development mode, the SchemaCache is refreshed every time there is a code change, so warmup is of limited use.
- Skip warming up PrettyText. This adds ~2s to each web worker's boot time. The vast majority of requests do not use PrettyText, so it is more efficient to defer its warmup until it's needed
- Skip the intentional 1 second pause during Unicorn worker forking. The comment (which also exists in Unicorn's documentation) suggests this works around a Unix signal handling bug, but I haven't been able to locate any more information. Skipping it in dev will significantly speed up boot. If we start to see issues, we can revert this change.
On my machine, this improves `/bin/unicorn` boot time from >10s to ~4s
This allows plugins to call `register_demon_process` with a Class inheriting from Demon::Base. The unicorn master process will take care of spawning, monitoring and restarting the process. This API should be used with extreme caution, but it is significantly cleaner than spawning processes/threads in an `after_initialize` block.
This commit also cleans up the demon spawning logging so that it uses the same format as unicorn worker logging. It also switches to the block form of `fork` to ensure that Demons exit after running, rather than returning execution to where the fork took place.
Useful if you want to, say, have your unicorn listen on a Unix domain
socket, rather than a TCP port, or you want to be able to bind to a
single address other than 127.0.0.1.
Demon::EmailSync is used in conjunction with the SiteSetting.enable_imap to sync N IMAP mailboxes with specific groups. It is a process started in unicorn.conf, and it spawns N threads (one for each multisite connection) and for each database spans another N threads (one for each configured group).
We want this off by default so the process is not started when it does not need to be (e.g. development, test, certain hosting tiers)
Unicorn uses the USR1 to indicate that log files should be reopened. This commit implements the same functionality for our forked sidekiq workers:
- USR1 is intercepted in the unicorn master, and re-issued to all child processes
- USR1 is trapped in the sidekiq processes, and `Unicorn::Util.reopen_logs` is used to re-open log files
We preload to ensure as much memory as possible is reused from unicorn master
to various workers using copy-on-write (sidekiq, unicorn)
This migrates the preloading code into the Discourse module for easier
reuse and adds 3 notable preloading changes
1. We attempt to localize a string on each site, ensuring we warmup
the i18n
2. We preload all our templates (compiling .erb to class)
3. We warm-up our search tokenizer which uses cppjieba which is a large
memory consumer, this will only cause a warmup on CJK sites or sites with
the special site setting enabled.
This makes sure that all processes that fork off the master have a fully
operation schema cache.
In Rails 6, schema cache is now bolted to the connection pool. This change
ensures the cache on all pools is fully populated prior to forking.
The bolting of cache to connection pool does lead to some strange cases
where a connection can "steal" the cache from another connection, which
can cause stuff to possibly hang or deadlock. This change minimizes the risk
of this happening cause it is already primed.
We make a STRONG assumption that the schema is always the same on all sites
when we spin up a multisite cluster.
This reverts commit e805d44965.
We now have mechanisms in place to ensure heartbeat will always
be scheduled even if the scheduler is overloaded per: 098f938b
* FIX: Heartbeat check per sidekiq process
* Rename method
* Remove heartbeat queues of previous bootups
* Regis feedback
* Refactor before_start
* Update lib/demon/sidekiq.rb
Co-Authored-By: Régis Hanol <regis@hanol.fr>
* Update lib/demon/sidekiq.rb
Co-Authored-By: Régis Hanol <regis@hanol.fr>
* Expire redis keys after 3600 seconds
* Don't use redis to store the list of queues
v8 forking is not supported and can lead to memory leaks.
This commit handles the most common case which is the unicorn master forking
There are still some cases related to backup where we fork, however those
forks are usually short lived so the memory leak is not severe, burning
the contexts in the master process could break sidekiq or web process that
do the actual forking
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.
Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
I get this error if I stop a dev server, ``rm -rf tmp`` and start it again:
```
`mkdir': No such file or directory @ dir_s_mkdir - /Users/angusmcleod/discourse/discourse/tmp/pids (Errno::ENOENT)
```
This fixes it.
See: f3549291a3 (diff-26ac62db6c6a4582de3bbf2615790c23R22)
This commit also cleans up a bunch of pointless noise each time we boot app
- narrative was loading i18n cause redefinition of consts
- discourse.rb was loaded twice as was auth
- bin/unicorn now does all the smart things and boots unicron in dev
- bin/rails s will boot unicorn with no params
- remove bin/puma which only causes confusion