discourse/app/jobs/scheduled/clean_up_uploads.rb

# frozen_string_literal: true

module Jobs
  class CleanUpUploads < ::Jobs::Scheduled
    every 1.hour

    def execute(args)
      grace_period = [SiteSetting.clean_orphan_uploads_grace_period_hours, 1].max

      # Always remove invalid upload records regardless of clean_up_uploads setting.
      Upload
        .by_users
        .where(
          "retain_hours IS NULL OR created_at < current_timestamp - interval '1 hour' * retain_hours",
        )
        .where("created_at < ?", grace_period.hour.ago)
        .where(url: "")
        .find_each(&:destroy!)

      return unless SiteSetting.clean_up_uploads?

      # Do nothing if the last cleanup was run too recently.
      last_cleanup_timestamp = last_cleanup
      if last_cleanup_timestamp.present? &&
           (Time.zone.now.to_i - last_cleanup_timestamp) < (grace_period / 2).hours
        return
      end

      result = Upload.by_users
      Upload.unused_callbacks&.each { |handler| result = handler.call(result) }
      result =
        result
          .where(
            "uploads.retain_hours IS NULL OR uploads.created_at < current_timestamp - interval '1 hour' * uploads.retain_hours",
          )
          .where("uploads.created_at < ?", grace_period.hour.ago)
          # Don't remove any secure uploads.
          .where("uploads.access_control_post_id IS NULL")
          .joins("LEFT JOIN upload_references ON upload_references.upload_id = uploads.id")
          # Don't remove any uploads linked to an UploadReference.
          .where("upload_references.upload_id IS NULL")
          .with_no_non_post_relations

      result.find_each do |upload|
        next if Upload.in_use_callbacks&.any? { |callback| callback.call(upload) }
        upload.sha1.present? ? upload.destroy : upload.delete
      end

      ExternalUploadStub.cleanup!

      self.last_cleanup = Time.zone.now.to_i
    end

    def last_cleanup=(timestamp)
      Discourse.redis.setex(last_cleanup_key, 7.days.to_i, timestamp.to_s)
    end

    def last_cleanup
      timestamp = Discourse.redis.get(last_cleanup_key)
      timestamp ? timestamp.to_i : timestamp
    end

    def reset_last_cleanup!
      Discourse.redis.del(last_cleanup_key)
    end

    protected

    def last_cleanup_key
      "LAST_UPLOAD_CLEANUP"
    end
  end
end
DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging 2019-05-02 18:17:27 -04:00			`# frozen_string_literal: true`

added a job to clean up orphan uploads 2013-10-14 08:27:41 -04:00			`module Jobs`
DEV: Upgrading Discourse to Zeitwerk (#8098) Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains. We no longer need to use Rails "require_dependency" anywhere and instead can just use standard Ruby patterns to require files. This is a far reaching change and we expect some followups here. 2019-10-02 00:01:53 -04:00			`class CleanUpUploads < ::Jobs::Scheduled`
FEATURE: new scheduler Removed sidetiq, introduced new scheduler - add basic UI - add schedule discover - add scheduling in initializer 2014-02-05 18:14:41 -05:00			`every 1.hour`
added a job to clean up orphan uploads 2013-10-14 08:27:41 -04:00
			`def execute(args)`
FIX: always delete invalid upload records 2018-06-04 12:40:57 -04:00			`grace_period = [SiteSetting.clean_orphan_uploads_grace_period_hours, 1].max`
make rubocop happy 2018-06-04 13:06:52 -04:00
DEV: Housekeeping for CleanUpUploads job (#24361) Followup to 9db8f00b3dd6f2881adf1b786e29426889225e7a, we don't need this dead code any more. Also made some minor improvements and comments. 2023-11-19 18:50:09 -05:00			`# Always remove invalid upload records regardless of clean_up_uploads setting.`
FIX: always delete invalid upload records 2018-06-04 12:40:57 -04:00			`Upload`
FIX: Properly support defaults for upload site settings. 2019-01-02 02:29:17 -05:00			`.by_users`
FIX: always delete invalid upload records 2018-06-04 12:40:57 -04:00			`.where(`
			`"retain_hours IS NULL OR created_at < current_timestamp - interval '1 hour' * retain_hours",`
			`)`
			`.where("created_at < ?", grace_period.hour.ago)`
Upload.url can't be NULL 2018-06-04 12:43:00 -04:00			`.where(url: "")`
FIX: Avoid `destroy_all` in `Jobs::CleanUpUploads`. `destroy_all` loads all the relation into memory as once. See https://github.com/rails/rails/issues/22510 2018-07-02 00:41:53 -04:00			`.find_each(&:destroy!)`
make rubocop happy 2018-06-04 13:06:52 -04:00
add a sitesetting to enable the CleanUpUploads job 2013-10-16 04:55:42 -04:00			`return unless SiteSetting.clean_up_uploads?`
added a job to clean up orphan uploads 2013-10-14 08:27:41 -04:00
DEV: Housekeeping for CleanUpUploads job (#24361) Followup to 9db8f00b3dd6f2881adf1b786e29426889225e7a, we don't need this dead code any more. Also made some minor improvements and comments. 2023-11-19 18:50:09 -05:00			`# Do nothing if the last cleanup was run too recently.`
			`last_cleanup_timestamp = last_cleanup`
			`if last_cleanup_timestamp.present? &&`
			`(Time.zone.now.to_i - last_cleanup_timestamp) < (grace_period / 2).hours`
			`return`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-27 20:14:52 -04:00			`end`

FIX: Properly support defaults for upload site settings. 2019-01-02 02:29:17 -05:00			`result = Upload.by_users`
FEATURE: New plugin API to check if upload is used (#15545) This commit introduces two new APIs for handling unused uploads, one can be used to exclude uploads in bulk when the data model allow and the other one excludes uploads one by one. 2022-02-16 02:00:30 -05:00			`Upload.unused_callbacks&.each { \|handler\| result = handler.call(result) }`
			`result =`
			`result`
FIX: Properly support defaults for upload site settings. 2019-01-02 02:29:17 -05:00			`.where(`
			`"uploads.retain_hours IS NULL OR uploads.created_at < current_timestamp - interval '1 hour' * uploads.retain_hours",`
			`)`
PERF: `NOT IN` query is really inefficient for large tables. 2016-11-01 23:14:02 -04:00			`.where("uploads.created_at < ?", grace_period.hour.ago)`
DEV: Housekeeping for CleanUpUploads job (#24361) Followup to 9db8f00b3dd6f2881adf1b786e29426889225e7a, we don't need this dead code any more. Also made some minor improvements and comments. 2023-11-19 18:50:09 -05:00			`# Don't remove any secure uploads.`
FEATURE: Secure media allowing duplicated uploads with category-level privacy and post-based access rules (#8664) ### General Changes and Duplication * We now consider a post `with_secure_media?` if it is in a read-restricted category. * When uploading we now set an upload's secure status straight away. * When uploading if `SiteSetting.secure_media` is enabled, we do not check to see if the upload already exists using the `sha1` digest of the upload. The `sha1` column of the upload is filled with a `SecureRandom.hex(20)` value which is the same length as `Upload::SHA1_LENGTH`. The `original_sha1` column is filled with the _real_ sha1 digest of the file. * Whether an upload `should_be_secure?` is now determined by whether the `access_control_post` is `with_secure_media?` (if there is no access control post then we leave the secure status as is). * When serializing the upload, we now cook the URL if the upload is secure. This is so it shows up correctly in the composer preview, because we set secure status on upload. ### Viewing Secure Media * The secure-media-upload URL will take the post that the upload is attached to into account via `Guardian.can_see?` for access permissions * If there is no `access_control_post` then we just deliver the media. This should be a rare occurrance and shouldn't cause issues as the `access_control_post` is set when `link_post_uploads` is called via `CookedPostProcessor` ### Removed We no longer do any of these because we do not reuse uploads by sha1 if secure media is enabled. * We no longer have a way to prevent cross-posting of a secure upload from a private context to a public context. * We no longer have to set `secure: false` for uploads when uploading for a theme component. 2020-01-15 22:50:27 -05:00			`.where("uploads.access_control_post_id IS NULL")`
FEATURE: Create upload_references table (#16146) This table holds associations between uploads and other models. This can be used to prevent removing uploads that are still in use. * DEV: Create upload_references * DEV: Use UploadReference instead of PostUpload * DEV: Use UploadReference for SiteSetting * DEV: Use UploadReference for Badge * DEV: Use UploadReference for Category * DEV: Use UploadReference for CustomEmoji * DEV: Use UploadReference for Group * DEV: Use UploadReference for ThemeField * DEV: Use UploadReference for ThemeSetting * DEV: Use UploadReference for User * DEV: Use UploadReference for UserAvatar * DEV: Use UploadReference for UserExport * DEV: Use UploadReference for UserProfile * DEV: Add method to extract uploads from raw text * DEV: Use UploadReference for Draft * DEV: Use UploadReference for ReviewableQueuedPost * DEV: Use UploadReference for UserProfile's bio_raw * DEV: Do not copy user uploads to upload references * DEV: Copy post uploads again after deploy * DEV: Use created_at and updated_at from uploads table * FIX: Check if upload site setting is empty * DEV: Copy user uploads to upload references * DEV: Make upload extraction less strict 2022-06-08 19:24:30 -04:00			`.joins("LEFT JOIN upload_references ON upload_references.upload_id = uploads.id")`
DEV: Housekeeping for CleanUpUploads job (#24361) Followup to 9db8f00b3dd6f2881adf1b786e29426889225e7a, we don't need this dead code any more. Also made some minor improvements and comments. 2023-11-19 18:50:09 -05:00			`# Don't remove any uploads linked to an UploadReference.`
FEATURE: Create upload_references table (#16146) This table holds associations between uploads and other models. This can be used to prevent removing uploads that are still in use. * DEV: Create upload_references * DEV: Use UploadReference instead of PostUpload * DEV: Use UploadReference for SiteSetting * DEV: Use UploadReference for Badge * DEV: Use UploadReference for Category * DEV: Use UploadReference for CustomEmoji * DEV: Use UploadReference for Group * DEV: Use UploadReference for ThemeField * DEV: Use UploadReference for ThemeSetting * DEV: Use UploadReference for User * DEV: Use UploadReference for UserAvatar * DEV: Use UploadReference for UserExport * DEV: Use UploadReference for UserProfile * DEV: Add method to extract uploads from raw text * DEV: Use UploadReference for Draft * DEV: Use UploadReference for ReviewableQueuedPost * DEV: Use UploadReference for UserProfile's bio_raw * DEV: Do not copy user uploads to upload references * DEV: Copy post uploads again after deploy * DEV: Use created_at and updated_at from uploads table * FIX: Check if upload site setting is empty * DEV: Copy user uploads to upload references * DEV: Make upload extraction less strict 2022-06-08 19:24:30 -04:00			`.where("upload_references.upload_id IS NULL")`
DEV: Improve `script/downsize_uploads.rb` (#13508) * Only shrink images that are used in Posts and no other models * Don't save the upload if the size is the same 2021-06-23 18:09:40 -04:00			`.with_no_non_post_relations`
added a job to clean up orphan uploads 2013-10-14 08:27:41 -04:00
FIX: don't destroy uploads in queued posts and drafts 2016-08-01 12:35:57 -04:00			`result.find_each do \|upload\|`
FEATURE: Create upload_references table (#16146) This table holds associations between uploads and other models. This can be used to prevent removing uploads that are still in use. * DEV: Create upload_references * DEV: Use UploadReference instead of PostUpload * DEV: Use UploadReference for SiteSetting * DEV: Use UploadReference for Badge * DEV: Use UploadReference for Category * DEV: Use UploadReference for CustomEmoji * DEV: Use UploadReference for Group * DEV: Use UploadReference for ThemeField * DEV: Use UploadReference for ThemeSetting * DEV: Use UploadReference for User * DEV: Use UploadReference for UserAvatar * DEV: Use UploadReference for UserExport * DEV: Use UploadReference for UserProfile * DEV: Add method to extract uploads from raw text * DEV: Use UploadReference for Draft * DEV: Use UploadReference for ReviewableQueuedPost * DEV: Use UploadReference for UserProfile's bio_raw * DEV: Do not copy user uploads to upload references * DEV: Copy post uploads again after deploy * DEV: Use created_at and updated_at from uploads table * FIX: Check if upload site setting is empty * DEV: Copy user uploads to upload references * DEV: Make upload extraction less strict 2022-06-08 19:24:30 -04:00			`next if Upload.in_use_callbacks&.any? { \|callback\| callback.call(upload) }`
DEV: Housekeeping for CleanUpUploads job (#24361) Followup to 9db8f00b3dd6f2881adf1b786e29426889225e7a, we don't need this dead code any more. Also made some minor improvements and comments. 2023-11-19 18:50:09 -05:00			`upload.sha1.present? ? upload.destroy : upload.delete`
PERF: Split queries when cleaning uploads. This reduces the number of scans that the db has to do in the query to fetch orphan uploads. Futheremore, we were not batching our records which bloats memory. 2016-07-01 03:22:30 -04:00			`end`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-27 20:14:52 -04:00
FEATURE: Initial implementation of direct S3 uploads with uppy and stubs (#13787) This adds a few different things to allow for direct S3 uploads using uppy. These changes are still not the default. There are hidden `enable_experimental_image_uploader` and `enable_direct_s3_uploads` settings that must be turned on for any of this code to be used, and even if they are turned on only the User Card Background for the user profile actually uses uppy-image-uploader. A new `ExternalUploadStub` model and database table is introduced in this pull request. This is used to keep track of uploads that are uploaded to a temporary location in S3 with the direct to S3 code, and they are eventually deleted a) when the direct upload is completed and b) after a certain time period of not being used. ### Starting a direct S3 upload When an S3 direct upload is initiated with uppy, we first request a presigned PUT URL from the new `generate-presigned-put` endpoint in `UploadsController`. This generates an S3 key in the `temp` folder inside the correct bucket path, along with any metadata from the clientside (e.g. the SHA1 checksum described below). This will also create an `ExternalUploadStub` and store the details of the temp object key and the file being uploaded. Once the clientside has this URL, uppy will upload the file direct to S3 using the presigned URL. Once the upload is complete we go to the next stage. ### Completing a direct S3 upload Once the upload to S3 is done we call the new `complete-external-upload` route with the unique identifier of the `ExternalUploadStub` created earlier. Only the user who made the stub can complete the external upload. One of two paths is followed via the `ExternalUploadManager`. 1. If the object in S3 is too large (currently 100mb defined by `ExternalUploadManager::DOWNLOAD_LIMIT`) we do not download and generate the SHA1 for that file. Instead we create the `Upload` record via `UploadCreator` and simply copy it to its final destination on S3 then delete the initial temp file. Several modifications to `UploadCreator` have been made to accommodate this. 2. If the object in S3 is small enough, we download it. When the temporary S3 file is downloaded, we compare the SHA1 checksum generated by the browser with the actual SHA1 checksum of the file generated by ruby. The browser SHA1 checksum is stored on the object in S3 with metadata, and is generated via the `UppyChecksum` plugin. Keep in mind that some browsers will not generate this due to compatibility or other issues. We then follow the normal `UploadCreator` path with one exception. To cut down on having to re-upload the file again, if there are no changes (such as resizing etc) to the file in `UploadCreator` we follow the same copy + delete temp path that we do for files that are too large. 3. Finally we return the serialized upload record back to the client There are several errors that could happen that are handled by `UploadsController` as well. Also in this PR is some refactoring of `displayErrorForUpload` to handle both uppy and jquery file uploader errors. 2021-07-27 18:42:25 -04:00			`ExternalUploadStub.cleanup!`

PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-27 20:14:52 -04:00			`self.last_cleanup = Time.zone.now.to_i`
			`end`

DEV: Housekeeping for CleanUpUploads job (#24361) Followup to 9db8f00b3dd6f2881adf1b786e29426889225e7a, we don't need this dead code any more. Also made some minor improvements and comments. 2023-11-19 18:50:09 -05:00			`def last_cleanup=(timestamp)`
			`Discourse.redis.setex(last_cleanup_key, 7.days.to_i, timestamp.to_s)`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-27 20:14:52 -04:00			`end`

			`def last_cleanup`
DEV: Housekeeping for CleanUpUploads job (#24361) Followup to 9db8f00b3dd6f2881adf1b786e29426889225e7a, we don't need this dead code any more. Also made some minor improvements and comments. 2023-11-19 18:50:09 -05:00			`timestamp = Discourse.redis.get(last_cleanup_key)`
			`timestamp ? timestamp.to_i : timestamp`
PERF: Split queries when cleaning uploads. This reduces the number of scans that the db has to do in the query to fetch orphan uploads. Futheremore, we were not batching our records which bloats memory. 2016-07-01 03:22:30 -04:00			`end`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-27 20:14:52 -04:00
			`def reset_last_cleanup!`
DEV: s/\$redis/Discourse\.redis (#8431) This commit also adds a rubocop rule to prevent global variables. 2019-12-03 04:05:53 -05:00			`Discourse.redis.del(last_cleanup_key)`
PERF: run expensive clean up uploads less frequently Previously every hour we would run a full scan of the entire DB searching for expired uploads that need to be moved to the tombstone folder. This commit amends it so we only run the job 2 times per clean_orpha_uploads_grace_period_hours There is a upper bound of 7 days so even if the grace period is set really high it will still run at least once a week. By default we have a 48 grace period so this amends it to run this cleanup daily instead of hourly. This eliminates 23 times we run this ultra expensive query. 2019-10-27 20:14:52 -04:00			`end`

			`protected`

			`def last_cleanup_key`
			`"LAST_UPLOAD_CLEANUP"`
			`end`
PERF: Split queries when cleaning uploads. This reduces the number of scans that the db has to do in the query to fetch orphan uploads. Futheremore, we were not batching our records which bloats memory. 2016-07-01 03:22:30 -04:00			`end`
added a job to clean up orphan uploads 2013-10-14 08:27:41 -04:00			`end`