PERF: Avoid full `posts` table scans during anonymisation (#21081)

2e78045a fixed the anonymization job so that it correctly updated self-mentions, which are not logged in the post_actions table. The solution was to scan the entire `posts` table with an `raw ILIKE` query. On sites with many posts, this can take a very long time.

This commit updates the job to take a two-pass approach:

First, we update posts based on the post_actions table. This is much more efficient than a full table scan, and takes care of all 'non-self' mentions.

Then, we make a second pass using the `raw ILIKE` approach. Since we already took care of most posts, we can scope this down to self-mentions only. By filtering the query to a specific posts.user_id, it is significantly more performant than a full table scan.
This commit is contained in:
David Taylor 2023-04-12 18:39:10 +01:00 committed by GitHub
parent fa5a423681
commit 93c33e02f0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 12 additions and 0 deletions

View File

@ -46,9 +46,21 @@ module Jobs
def update_posts def update_posts
updated_post_ids = Set.new updated_post_ids = Set.new
# Other people mentioning this user
Post
.with_deleted
.joins(mentioned("posts.id"))
.where("a.user_id = :user_id", user_id: @user_id)
.find_each do |post|
update_post(post)
updated_post_ids << post.id
end
# User mentioning self (not included in post_actions table)
Post Post
.with_deleted .with_deleted
.where("raw ILIKE ?", "%@#{@old_username}%") .where("raw ILIKE ?", "%@#{@old_username}%")
.where("posts.user_id = :user_id", user_id: @user_id)
.find_each do |post| .find_each do |post|
update_post(post) update_post(post)
updated_post_ids << post.id updated_post_ids << post.id