Commit Graph

73 Commits

Author SHA1 Message Date
Osama Sayegh d15867463f
FEATURE: Site setting for blocking onebox of URLs that redirect (#16881)
Meta topic: https://meta.discourse.org/t/prevent-to-linkify-when-there-is-a-redirect/226964/2?u=osama.

This commit adds a new site setting `block_onebox_on_redirect` (default off) for blocking oneboxes (full and inline) of URLs that redirect. Note that an initial http → https redirect is still allowed if the redirect location is identical to the source (minus the scheme of course). For example, if a user includes a link to `http://example.com/page` and the link resolves to `https://example.com/page`, then the link will onebox (assuming it can be oneboxed) even if the setting is enabled. The reason for this is a user may type out a URL (i.e. the URL is short and memorizable) with http and since a lot of sites support TLS with http traffic automatically redirected to https, so we should still allow the URL to onebox.
2022-05-23 13:52:06 +03:00
Dan Ungureanu 8e9cbe9db4
FIX: Do not raise if title cannot be crawled (#16247)
If the crawled page returned an error, `FinalDestination#safe_get`
yielded `nil` for `uri` and `chunk` arguments. Another problem is that
`get` did not handle the case when `safe_get` failed and did not return
the `location` and `set_cookie` headers.
2022-03-22 20:13:27 +02:00
Bianca Nenciu b0f414f7f5
DEV: Remove unused uri parameter (#16179)
The parameter is not used and it did not work properly anyway
because sometimes `@uri` is used instead of `uri`, which can
be different.
2022-03-16 16:42:25 +02:00
Osama Sayegh b0656f3ed0
FIX: Apply onebox blocked domain checks on every redirect (#16150)
The `blocked onebox domains` setting lets site owners change what sites
are allowed to be oneboxed. When a link is entered into a post,
Discourse checks the domain of the link against that setting and blocks
the onebox if the domain is blocked. But if there's a chain of
redirects, then only the final destination website is checked against
the site setting.

This commit amends that behavior so that every website in the redirect
chain is checked against the site setting, and if anything is blocked
the original link doesn't onebox at all in the post. The
`Discourse-No-Onebox` header is also checked in every response and the
onebox is blocked if the header is set to "1".

Additionally, Discourse will now include the `Discourse-No-Onebox`
header with every response if the site requires login to access content.
This is done to signal to a Discourse instance that it shouldn't attempt
to onebox other Discourse instances if they're login-only. Non-Discourse
websites can also use include that header if they don't wish to have
Discourse onebox their content.

Internal ticket: t59305.
2022-03-11 09:18:12 +03:00
Osama Sayegh 9b5cc1424f
DEV: Don't mutate `Excon.defaults[:middlewares]` (#16151)
`Excon.defaults` and its middlewares array are constants that we
shouldn't mutate everytime `FinalDestination#resolve` is called.
2022-03-10 14:21:45 +03:00
jbrw cf545be338
FIX: Increase FinalDestination MAX_REQUEST_SIZE_BYTES (#15998)
The default of 1Mb was preventing some valid Onebox requests from successfully completing.

Increasing this to 5Mb should reduce the number of unexpected failures.
2022-02-18 13:37:31 -05:00
Krzysztof Kotlarek a34075d205
SECURITY: Onebox response timeout and size limit (#15927)
Validation to ensure that Onebox request is no longer than 10 seconds and response size is not bigger than 1 MB
2022-02-14 12:11:09 +11:00
Natalie Tay aac9f43038
Only block domains at the final destination (#15689)
In an earlier PR, we decided that we only want to block a domain if 
the blocked domain in the SiteSetting is the final destination (/t/59305). That 
PR used `FinalDestination#get`. `resolve` however is used several places
 but blocks domains along the redirect chain when certain options are provided.

This commit changes the default options for `resolve` to not do that. Existing
users of `FinalDestination#resolve` are
- `Oneboxer#external_onebox`
- our onebox helper `fetch_html_doc`, which is used in amazon, standard embed 
and youtube
  - these folks already go through `Oneboxer#external_onebox` which already
  blocks correctly
2022-01-31 15:35:12 +08:00
Roman Rizzi 53abcd825d
FIX: Canonical URLs may be relative (#14825)
FinalDestination's follow_canonical mode used for embedded topics should work when canonical URLs are relative, as specified in [RFC 6596](https://datatracker.ietf.org/doc/html/rfc6596)
2021-11-05 14:20:14 -03:00
jbrw 978a005a42
FIX: resolve responses of 103 should be retried using small_get (#14773)
If the initial `get`/`head` response within `resolve` returns a status code of `103`, attempt to fetch the same URL with the alternative `small_get` method.
2021-10-29 14:51:56 -04:00
Roman Rizzi 4c2d5158c5
FIX: Follow the canonical URL when importing a remote topic. (#14489)
FinalDestination now supports the `follow_canonical` option, which will perform an initial GET request, parse the canonical link if present, and perform a HEAD request to it.

We use this mode during embeds to avoid treating URLs with different query parameters as different topics.
2021-10-01 12:48:21 -03:00
jbrw 2f28ba318c
FEATURE: Onebox can match engines based on the content_type (#13876)
* FEATURE: Onebox can match engines based on the content_type

`FinalDestination` now returns the `content_type` of a resolved URL.

`Oneboxer` passes this value to `Onebox` itself. Onebox engines can now specify a `matches_content_type` regex of content_types that the engine can handle, regardless of the URL.

`ImageOnebox` will match URLs with a content type of `image/png`, `jpg`, `gif`, `bmp`, `tif`, etc.

This will allow images that exist at a URL without a file type extension to be correctly rendered, assuming a valid `content_type` is returned.
2021-07-30 13:36:30 -04:00
jbrw 09d23a37a5
DEV: Add default `Accept-Language` to FinalDestination requests (#13817)
Not specifying an `Accept-Language` should be equivalent to specifying an `Accept-Language` of `*`, however some webservers seem to prefer it if we are explicit about being able to handle a response of content in any language.
2021-07-22 15:49:59 +10:00
Joffrey JAFFEUX e50b7e9111
SECURITY: ensures timeouts are correctly used on connect (#13455) 2021-06-21 17:34:01 +02:00
jbrw 19182b1386
DEV: Oneboxer wildcard subdomains (#13015)
* DEV: Allow wildcards in Oneboxer optional domain Site Settings

Allows a wildcard to be used as a subdomain on Oneboxer-related SiteSettings, e.g.:

- `force_get_hosts`
- `cache_onebox_response_body_domains`
- `force_custom_user_agent_hosts`

* DEV: fix typos

* FIX: Try doing a GET after receiving a 500 error from a HEAD

By default we try to do a `HEAD` requests. If this results in a 500 error response, we should try to do a `GET`

* DEV: `force_get_hosts` should be a hidden setting

* DEV: Oneboxer Strategies

Have an alternative oneboxing ‘strategy’ (i.e., set of options) to use when an attempt to generate a Onebox fails. Keep track of any non-default strategies that were used on a particular host, and use that strategy for that host in the future.

Initially, the alternate strategy (`force_get_and_ua`) forces the FinalDestination step of Oneboxing to do a `GET` rather than `HEAD`, and forces a custom user agent.

* DEV: change stubbed return code

The stubbed status code needs to be a value not recognized by FinalDestination
2021-05-13 15:48:35 -04:00
Joffrey JAFFEUX 64dda7112d
FIX: correctly use timeouts in `FileHelper` and `FinalDestination` (#12921)
Previous refactors have lost usage of read_timeout in `FileHelper.download` and `FinalDestination` was incorrectly using `Net::HTTP.start` by setting `open_timeout` in the block instead of directly during the invocation.

Couldn't figure how to write a good test for this without slowing the spec.
2021-05-03 09:21:11 +02:00
jbrw 68d0916eb5
FEATURE: Oneboxer cache response body (#12562)
* FEATURE: Cache successful HTTP GET requests during Oneboxing

Some oneboxes may fail if when making excessive and/or odd requests against the target domains. This change provides a simple mechanism to cache the results of succesful GET requests as part of the oneboxing process, with the goal of reducing repeated requests and ultimately improving the rate of successful oneboxing.

To enable:

Set `SiteSetting.cache_onebox_response_body` to `true`

Add the domains you’re interesting in caching to `SiteSetting. cache_onebox_response_body_domains` e.g. `example.com|example.org|example.net`

Optionally set `SiteSetting.cache_onebox_user_agent` to a user agent string of your choice to use when making requests against domains in the above list.

* FIX: Swap order of duration and value in redis call

The correct order for `setex` arguments is `key`, `duration`, and `value`.

Duration and value had been flipped, however the code would not have thrown an error because we were caching the value of `1.day.to_i` for a period of 1 seconds… The intention appears to be to set a value of 1 (purely as a flag) for a period of 1 day.
2021-03-31 13:19:34 -04:00
jbrw 331236d6d7
Onebox improved error handling and support for Instagram Access Tokens (#11253)
* FEATURE: display error if Oneboxing fails due to HTTP error

- display warning if onebox URL is unresolvable
- display warning if attributes are missing

* FEATURE: Use new Instagram oEmbed endpoint if access token is configured

Instagram requires an Access Token to access their oEmbed endpoint. The requirements (from https://developers.facebook.com/docs/instagram/oembed/) are as follows:

- a Facebook Developer account, which you can create at developers.facebook.com
- a registered Facebook app
- the oEmbed Product added to the app
- an Access Token
- The Facebook app must be in Live Mode

The generated Access Token, once added to SiteSetting.facebook_app_access_token, will be passed to onebox. Onebox can then use this token to access the oEmbed endpoint to generate a onebox for Instagram.

* DEV: update user agent string

* DEV: don’t do HEAD requests against news.yahoo.com

* DEV: Bump onebox version from 2.1.5 to 2.1.6

* DEV: Avoid re-reading templates

* DEV: Tweaks to onebox mustache templates

* DEV: simplified error message for missing onebox data

* Apply suggestions from code review
Co-authored-by: Gerhard Schlager <mail@gerhard-schlager.at>
2020-11-18 12:55:16 -05:00
Krzysztof Kotlarek e0d9232259
FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
Martin Brennan edbc356593
FIX: Replace deprecated URI.encode, URI.escape, URI.unescape and URI.unencode (#8528)
The following methods have long been deprecated in ruby due to flaws in their implementation per http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-core/29293?29179-31097:

URI.escape
URI.unescape
URI.encode
URI.unencode
escape/encode are just aliases for one another. This PR uses the Addressable gem to replace these methods with its own encode, unencode, and encode_component methods where appropriate.

I have put all references to Addressable::URI here into the UrlHelper to keep them corralled in one place to make changes to this implementation easier.

Addressable is now also an explicit gem dependency.
2019-12-12 12:49:21 +10:00
Joffrey JAFFEUX 0d3d2c43a0
DEV: s/\$redis/Discourse\.redis (#8431)
This commit also adds a rubocop rule to prevent global variables.
2019-12-03 10:05:53 +01:00
Arpit Jalan 00c406520e FEATURE: allow FinalDestination to use custom user agent for specific hosts 2019-11-07 14:47:51 +05:30
Arpit Jalan e90aac11cb fix the build 2019-08-07 16:39:58 +05:30
Arpit Jalan b0e781e2d4 FIX: do not follow redirect on same host with path /login or /session 2019-08-07 16:26:55 +05:30
Sam Saffron 7429700389 FIX: ensure we can download maxmind without redis or db config
This also corrects FileHelper.download so it supports "follow_redirect"
correctly (it used to always follow 1 redirect) and adds a `validate_url`
param that will bypass all uri validation if set to false (default is true)
2019-05-28 10:28:57 +10:00
Sam Saffron 30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Gerhard Schlager 92df6890df FIX: GET request didn't use headers 2019-03-08 21:36:49 +01:00
Sam cfddfa6de2 SECURITY: bypass long GET requests
In some rare cases we would check URLs with very large payloads
this ensures we always bypass and do not read entire payloads
2019-02-27 14:51:28 +11:00
Arpit Jalan 1ab91f0474 FIX: preserve github fragment URL 2018-12-19 12:34:47 +05:30
Guo Xiang Tan 8dc1463ab3 Enable `Lint/ShadowingOuterLocalVariable` for Rubocop. 2018-09-04 10:16:42 +08:00
Bianca Nenciu b6963b8ffb FIX: Ignore OneBox blacklisted domains. 2018-08-27 20:40:55 +02:00
Régis Hanol de92913bf4 FIX: store the topic links using the cooked upload url 2018-08-14 12:23:32 +02:00
Robin Ward 7058205f70 FIX: Broken specs 2018-07-24 12:00:34 -04:00
Robin Ward 236243f38a SECURITY: Consider `0.0.0.0` a private IP 2018-07-24 11:16:27 -04:00
Guo Xiang Tan d43895e2a0 Don't log 404s for `FinalDestination`.
* We can't do anything about 404s
2018-05-25 10:11:16 +08:00
Guo Xiang Tan 142571bba0 Remove use of `rescue nil`.
* `rescue nil` is a really bad pattern to use in our code base.
  We should rescue errors that we expect the code to throw and
  not rescue everything because we're unsure of what errors the
  code would throw. This would reduce the amount of pain we face
  when debugging why something isn't working as expexted. I've
  been bitten countless of times by errors being swallowed as a
  result during debugging sessions.
2018-04-02 13:52:51 +08:00
Guo Xiang Tan ee69d58a59 FIX: Tests could get stucked in infinite loop if it fails to resolve IP of a hostname. 2018-03-28 14:49:05 +08:00
Gerhard Schlager 4a54c09e46 FIX: Retry with GET request when HEAD fails with error 400 2018-02-27 12:07:16 +01:00
Régis Hanol 0559a4736a FIX: don't double request when downloading a file 2018-02-24 12:35:57 +01:00
Gerhard Schlager b6277e208b FIX: Cookies header didn't have the right format 2018-02-19 12:46:57 +01:00
Sam fa5880e04f PERF: ability to crawl for titles without extra HEAD req
Also, introduces a much more aggressive timeout for title crawling
and introduces gzip to body that is crawled
2018-01-29 15:40:12 +11:00
Gerhard Schlager e30851e45a Move escape_uri method to a more suitable place 2017-12-12 20:17:46 +01:00
Régis Hanol de037da731 FIX: FinalDestination's small_get method wasn't using proper request headers 2017-11-17 17:24:35 +01:00
Régis Hanol aebcd56300 FIX: try a GET for error code 406 2017-11-17 16:59:51 +01:00
Régis Hanol 221ff24418 SQL != Ruby 2017-11-17 16:12:20 +01:00
Régis Hanol a0fc8bd924 don't log 404s to gravatar.com 2017-11-17 15:38:26 +01:00
Sam 3ac7d041ae UX: generic onebox treats all square images as avatars and renders them smaller 2017-11-13 11:21:19 +11:00
Gerhard Schlager d1f257d275 FinalDestination should only log when verbose is enabled 2017-10-31 17:16:59 +01:00
Gerhard Schlager 8c27f28dcb add more logging to FinalDestination 2017-10-31 12:26:35 +01:00
Sam Saffron 8185b8cb06 FEATURE: cache https redirects per hostname
If a hostname does an https redirect we cache that so next
lookup does not incur it.

Also, only rate limit per ip once per final destination

Raise final destination protection to 1000 ip lookups an hour
2017-10-17 16:22:54 +11:00