We observed that some sites seemingly put us in a tarpit when we attempt to pull hotlinked images. Increasing the timeout will help in these situations.
This commit introduces a new site setting: `block_hotlinked_media`. When enabled, all attempts to hotlink media (images, videos, and audio) will fail, and be replaced with a linked placeholder. Exceptions to the rule can be added via `block_hotlinked_media_exceptions`.
`download_remote_image_to_local` can be used alongside this feature. In that case, hotlinked images will be blocked immediately when the post is created, but will then be replaced with the downloaded version a few seconds later.
This implementation is purely server-side, and does not impact the composer preview.
Technically, there are two stages to this feature:
1. `PrettyText.sanitize_hotlinked_media` is called during `PrettyText.cook`, and whenever new images are introduced by Onebox. It will iterate over all src/srcset attributes in the post HTML and check if they're allowed. If not, the attributes will be removed and replaced with a `data-blocked-hotlinked-src(set)` attribute
2. In the `CookedPostProcessor`, we iterate over all `data-blocked-hotlinked-src(set)` attributes and check whether we have a downloaded version of the media. If yes, we update the src to use the downloaded version. If not, the entire media element is replaced with a placeholder. The placeholder is labelled 'external media', and is a link to the offsite media.
Previously, with the default `editing_grace_period`, hotlinked images were pulled 5 minutes after a post is created. This delay was added to reduce the chance of automated edits clashing with user edits.
This commit refactors things so that we can pull hotlinked images immediately. URLs are immediately updated in the post's `cooked` HTML. The post's raw markdown is updated later, after the `editing_grace_period`.
This involves a number of behind-the-scenes changes including:
- Schedule Jobs::PullHotlinkedImages immediately after Jobs::ProcessPost. Move scheduling to after the `update_column` call to avoid race conditions
- Move raw changes into a separate job, which is delayed until after the ninja-edit window
- Move disable_if_low_on_disk_space logic into the `pull_hotlinked_images` job
- Move raw-parsing/replacing logic into `InlineUpload` so it can be easily be shared between `UpdateHotlinkedRaw` and `PullUserProfileHotlinkedImages`
Previously this mapping of **cooked** images was only being run for oneboxes. Now it runs for all images, so we can transform hotlinked images without needing to immediately update `raw`
This will make future changes to the 'pull hotlinked images' system easier. This commit should not introduce any functional change.
For now, the old post_custom_field data is kept in the database. This will be dropped in a future commit.
raw_html posts (i.e. those which are pulled as part of our comments integration) don't go through our markdown pipeline, so `upload://` URLs are not supported. Running pull_hotlinked_images will break any images in the post.
In future we may add support for pulling hotlinked images in these posts. But for now, disabling it will stop it breaking images.
This commit should cause no functional change
- Split into functions to avoid deep nesting
- Register custom field type, and remove manual json parse/serialize
- Recover from deleted upload records
Also adds a test to ensure pull_hotlinked_images redownloads secure images only once
Previously the pull hotlinked images job was skipped after system edits. This ensured that we never had an infinite loop of system-edit/pull-hotlinked/system-edit/pull-hotlinked etc.
A side effect was that edits made by system for any other reason (e.g. API, removing full quotes) would prevent pulling hotlinked images. This commit removes the system edit check, and replaces it with another method to avoid an infinite job scheduling loop.
Basically, say you had already downloaded a certain image from a certain URL
using pull_hotlinked_images and the onebox. The upload would be stored
by its sha as an upload record. Whenever you linked to the same URL again
in a post (e.g. in our case an og:image on review.discourse) we would
would reuse the original upload record because of the sha1.
However when you turned on secure media this could cause problems as
the first post that uses that upload after secure media is enabled
will set the access control post for the upload to the new post.
Then if the post is deleted every single onebox/link to that same image
URL will fail forever with 403 as the secure-media-uploads URL fails
if the access control post has been deleted.
To fix this when cooking posts and pulling hotlinked images, we only
allow using an original upload by URL if its access control post
matches the current post, and if the original_sha1 is filled in,
meaning it was uploaded AFTER secure media was enabled. otherwise
we just redownload the media again to be safe, as the URL will always
be new then.
When pull_hotlinked_images tried to run on posts with secure media (which had already been downloaded from external sources) we were getting a 404 when trying to download the image because the secure endpoint doesn't allow anon downloads.
Also, we were getting into an infinite loop of pull_hotlinked_images because the job didn't consider the secure media URLs as "downloaded" already so it kept trying to download them over and over.
In this PR I have also refactored secure-media-upload URL checks and mutations into single source of truth in Upload, adding a SECURE_MEDIA_ROUTE constant to check URLs against too.
Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains.
We no longer need to use Rails "require_dependency" anywhere and instead can just use standard
Ruby patterns to require files.
This is a far reaching change and we expect some followups here.
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.
Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
* `rescue nil` is a really bad pattern to use in our code base.
We should rescue errors that we expect the code to throw and
not rescue everything because we're unsure of what errors the
code would throw. This would reduce the amount of pain we face
when debugging why something isn't working as expexted. I've
been bitten countless of times by errors being swallowed as a
result during debugging sessions.
This refactors handling of s3 so it can be specified via GlobalSetting
This means that in a multisite environment you can configure s3 uploads
without actual sites knowing credentials in s3
It is a critical setting for situations where assets are mirrored to s3.