discourse

Commit Graph

Author	SHA1	Message	Date
Osama Sayegh	d15867463f	FEATURE: Site setting for blocking onebox of URLs that redirect (#16881 ) Meta topic: https://meta.discourse.org/t/prevent-to-linkify-when-there-is-a-redirect/226964/2?u=osama. This commit adds a new site setting `block_onebox_on_redirect` (default off) for blocking oneboxes (full and inline) of URLs that redirect. Note that an initial http → https redirect is still allowed if the redirect location is identical to the source (minus the scheme of course). For example, if a user includes a link to `http://example.com/page` and the link resolves to `https://example.com/page`, then the link will onebox (assuming it can be oneboxed) even if the setting is enabled. The reason for this is a user may type out a URL (i.e. the URL is short and memorizable) with http and since a lot of sites support TLS with http traffic automatically redirected to https, so we should still allow the URL to onebox.	2022-05-23 13:52:06 +03:00
Dan Ungureanu	8e9cbe9db4	FIX: Do not raise if title cannot be crawled (#16247 ) If the crawled page returned an error, `FinalDestination#safe_get` yielded `nil` for `uri` and `chunk` arguments. Another problem is that `get` did not handle the case when `safe_get` failed and did not return the `location` and `set_cookie` headers.	2022-03-22 20:13:27 +02:00
Bianca Nenciu	b0f414f7f5	DEV: Remove unused uri parameter (#16179 ) The parameter is not used and it did not work properly anyway because sometimes `@uri` is used instead of `uri`, which can be different.	2022-03-16 16:42:25 +02:00
Osama Sayegh	b0656f3ed0	FIX: Apply onebox blocked domain checks on every redirect (#16150 ) The `blocked onebox domains` setting lets site owners change what sites are allowed to be oneboxed. When a link is entered into a post, Discourse checks the domain of the link against that setting and blocks the onebox if the domain is blocked. But if there's a chain of redirects, then only the final destination website is checked against the site setting. This commit amends that behavior so that every website in the redirect chain is checked against the site setting, and if anything is blocked the original link doesn't onebox at all in the post. The `Discourse-No-Onebox` header is also checked in every response and the onebox is blocked if the header is set to "1". Additionally, Discourse will now include the `Discourse-No-Onebox` header with every response if the site requires login to access content. This is done to signal to a Discourse instance that it shouldn't attempt to onebox other Discourse instances if they're login-only. Non-Discourse websites can also use include that header if they don't wish to have Discourse onebox their content. Internal ticket: t59305.	2022-03-11 09:18:12 +03:00
Osama Sayegh	9b5cc1424f	DEV: Don't mutate `Excon.defaults[:middlewares]` (#16151 ) `Excon.defaults` and its middlewares array are constants that we shouldn't mutate everytime `FinalDestination#resolve` is called.	2022-03-10 14:21:45 +03:00
jbrw	cf545be338	FIX: Increase FinalDestination MAX_REQUEST_SIZE_BYTES (#15998 ) The default of 1Mb was preventing some valid Onebox requests from successfully completing. Increasing this to 5Mb should reduce the number of unexpected failures.	2022-02-18 13:37:31 -05:00
Krzysztof Kotlarek	a34075d205	SECURITY: Onebox response timeout and size limit (#15927 ) Validation to ensure that Onebox request is no longer than 10 seconds and response size is not bigger than 1 MB	2022-02-14 12:11:09 +11:00
Natalie Tay	aac9f43038	Only block domains at the final destination (#15689 ) In an earlier PR, we decided that we only want to block a domain if the blocked domain in the SiteSetting is the final destination (/t/59305). That PR used `FinalDestination#get`. `resolve` however is used several places but blocks domains along the redirect chain when certain options are provided. This commit changes the default options for `resolve` to not do that. Existing users of `FinalDestination#resolve` are - `Oneboxer#external_onebox` - our onebox helper `fetch_html_doc`, which is used in amazon, standard embed and youtube - these folks already go through `Oneboxer#external_onebox` which already blocks correctly	2022-01-31 15:35:12 +08:00
Roman Rizzi	53abcd825d	FIX: Canonical URLs may be relative (#14825 ) FinalDestination's follow_canonical mode used for embedded topics should work when canonical URLs are relative, as specified in [RFC 6596](https://datatracker.ietf.org/doc/html/rfc6596)	2021-11-05 14:20:14 -03:00
jbrw	978a005a42	FIX: resolve responses of 103 should be retried using small_get (#14773 ) If the initial `get`/`head` response within `resolve` returns a status code of `103`, attempt to fetch the same URL with the alternative `small_get` method.	2021-10-29 14:51:56 -04:00
Roman Rizzi	4c2d5158c5	FIX: Follow the canonical URL when importing a remote topic. (#14489 ) FinalDestination now supports the `follow_canonical` option, which will perform an initial GET request, parse the canonical link if present, and perform a HEAD request to it. We use this mode during embeds to avoid treating URLs with different query parameters as different topics.	2021-10-01 12:48:21 -03:00
jbrw	2f28ba318c	FEATURE: Onebox can match engines based on the content_type (#13876 ) * FEATURE: Onebox can match engines based on the content_type `FinalDestination` now returns the `content_type` of a resolved URL. `Oneboxer` passes this value to `Onebox` itself. Onebox engines can now specify a `matches_content_type` regex of content_types that the engine can handle, regardless of the URL. `ImageOnebox` will match URLs with a content type of `image/png`, `jpg`, `gif`, `bmp`, `tif`, etc. This will allow images that exist at a URL without a file type extension to be correctly rendered, assuming a valid `content_type` is returned.	2021-07-30 13:36:30 -04:00
jbrw	09d23a37a5	DEV: Add default `Accept-Language` to FinalDestination requests (#13817 ) Not specifying an `Accept-Language` should be equivalent to specifying an `Accept-Language` of `*`, however some webservers seem to prefer it if we are explicit about being able to handle a response of content in any language.	2021-07-22 15:49:59 +10:00
Joffrey JAFFEUX	e50b7e9111	SECURITY: ensures timeouts are correctly used on connect (#13455 )	2021-06-21 17:34:01 +02:00
jbrw	19182b1386	DEV: Oneboxer wildcard subdomains (#13015 ) * DEV: Allow wildcards in Oneboxer optional domain Site Settings Allows a wildcard to be used as a subdomain on Oneboxer-related SiteSettings, e.g.: - `force_get_hosts` - `cache_onebox_response_body_domains` - `force_custom_user_agent_hosts` * DEV: fix typos * FIX: Try doing a GET after receiving a 500 error from a HEAD By default we try to do a `HEAD` requests. If this results in a 500 error response, we should try to do a `GET` * DEV: `force_get_hosts` should be a hidden setting * DEV: Oneboxer Strategies Have an alternative oneboxing ‘strategy’ (i.e., set of options) to use when an attempt to generate a Onebox fails. Keep track of any non-default strategies that were used on a particular host, and use that strategy for that host in the future. Initially, the alternate strategy (`force_get_and_ua`) forces the FinalDestination step of Oneboxing to do a `GET` rather than `HEAD`, and forces a custom user agent. * DEV: change stubbed return code The stubbed status code needs to be a value not recognized by FinalDestination	2021-05-13 15:48:35 -04:00
Joffrey JAFFEUX	64dda7112d	FIX: correctly use timeouts in `FileHelper` and `FinalDestination` (#12921 ) Previous refactors have lost usage of read_timeout in `FileHelper.download` and `FinalDestination` was incorrectly using `Net::HTTP.start` by setting `open_timeout` in the block instead of directly during the invocation. Couldn't figure how to write a good test for this without slowing the spec.	2021-05-03 09:21:11 +02:00
jbrw	68d0916eb5	FEATURE: Oneboxer cache response body (#12562 ) * FEATURE: Cache successful HTTP GET requests during Oneboxing Some oneboxes may fail if when making excessive and/or odd requests against the target domains. This change provides a simple mechanism to cache the results of succesful GET requests as part of the oneboxing process, with the goal of reducing repeated requests and ultimately improving the rate of successful oneboxing. To enable: Set `SiteSetting.cache_onebox_response_body` to `true` Add the domains you’re interesting in caching to `SiteSetting. cache_onebox_response_body_domains` e.g. `example.com\|example.org\|example.net` Optionally set `SiteSetting.cache_onebox_user_agent` to a user agent string of your choice to use when making requests against domains in the above list. * FIX: Swap order of duration and value in redis call The correct order for `setex` arguments is `key`, `duration`, and `value`. Duration and value had been flipped, however the code would not have thrown an error because we were caching the value of `1.day.to_i` for a period of 1 seconds… The intention appears to be to set a value of 1 (purely as a flag) for a period of 1 day.	2021-03-31 13:19:34 -04:00
jbrw	331236d6d7	Onebox improved error handling and support for Instagram Access Tokens (#11253 ) * FEATURE: display error if Oneboxing fails due to HTTP error - display warning if onebox URL is unresolvable - display warning if attributes are missing * FEATURE: Use new Instagram oEmbed endpoint if access token is configured Instagram requires an Access Token to access their oEmbed endpoint. The requirements (from https://developers.facebook.com/docs/instagram/oembed/) are as follows: - a Facebook Developer account, which you can create at developers.facebook.com - a registered Facebook app - the oEmbed Product added to the app - an Access Token - The Facebook app must be in Live Mode The generated Access Token, once added to SiteSetting.facebook_app_access_token, will be passed to onebox. Onebox can then use this token to access the oEmbed endpoint to generate a onebox for Instagram. * DEV: update user agent string * DEV: don’t do HEAD requests against news.yahoo.com * DEV: Bump onebox version from 2.1.5 to 2.1.6 * DEV: Avoid re-reading templates * DEV: Tweaks to onebox mustache templates * DEV: simplified error message for missing onebox data * Apply suggestions from code review Co-authored-by: Gerhard Schlager <mail@gerhard-schlager.at>	2020-11-18 12:55:16 -05:00
Krzysztof Kotlarek	e0d9232259	FIX: use allowlist and blocklist terminology (#10209 ) This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.	2020-07-27 10:23:54 +10:00
Martin Brennan	edbc356593	FIX: Replace deprecated URI.encode, URI.escape, URI.unescape and URI.unencode (#8528 ) The following methods have long been deprecated in ruby due to flaws in their implementation per http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-core/29293?29179-31097: URI.escape URI.unescape URI.encode URI.unencode escape/encode are just aliases for one another. This PR uses the Addressable gem to replace these methods with its own encode, unencode, and encode_component methods where appropriate. I have put all references to Addressable::URI here into the UrlHelper to keep them corralled in one place to make changes to this implementation easier. Addressable is now also an explicit gem dependency.	2019-12-12 12:49:21 +10:00
Joffrey JAFFEUX	0d3d2c43a0	DEV: s/\$redis/Discourse\.redis (#8431 ) This commit also adds a rubocop rule to prevent global variables.	2019-12-03 10:05:53 +01:00
Arpit Jalan	00c406520e	FEATURE: allow FinalDestination to use custom user agent for specific hosts	2019-11-07 14:47:51 +05:30
Arpit Jalan	e90aac11cb	fix the build	2019-08-07 16:39:58 +05:30
Arpit Jalan	b0e781e2d4	FIX: do not follow redirect on same host with path /login or /session	2019-08-07 16:26:55 +05:30
Sam Saffron	7429700389	FIX: ensure we can download maxmind without redis or db config This also corrects FileHelper.download so it supports "follow_redirect" correctly (it used to always follow 1 redirect) and adds a `validate_url` param that will bypass all uri validation if set to false (default is true)	2019-05-28 10:28:57 +10:00
Sam Saffron	30990006a9	DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging	2019-05-13 09:31:32 +08:00
Gerhard Schlager	92df6890df	FIX: GET request didn't use headers	2019-03-08 21:36:49 +01:00
Sam	cfddfa6de2	SECURITY: bypass long GET requests In some rare cases we would check URLs with very large payloads this ensures we always bypass and do not read entire payloads	2019-02-27 14:51:28 +11:00
Arpit Jalan	1ab91f0474	FIX: preserve github fragment URL	2018-12-19 12:34:47 +05:30
Guo Xiang Tan	8dc1463ab3	Enable `Lint/ShadowingOuterLocalVariable` for Rubocop.	2018-09-04 10:16:42 +08:00
Bianca Nenciu	b6963b8ffb	FIX: Ignore OneBox blacklisted domains.	2018-08-27 20:40:55 +02:00
Régis Hanol	de92913bf4	FIX: store the topic links using the cooked upload url	2018-08-14 12:23:32 +02:00
Robin Ward	7058205f70	FIX: Broken specs	2018-07-24 12:00:34 -04:00
Robin Ward	236243f38a	SECURITY: Consider `0.0.0.0` a private IP	2018-07-24 11:16:27 -04:00
Guo Xiang Tan	d43895e2a0	Don't log 404s for `FinalDestination`. * We can't do anything about 404s	2018-05-25 10:11:16 +08:00
Guo Xiang Tan	142571bba0	Remove use of `rescue nil`. * `rescue nil` is a really bad pattern to use in our code base. We should rescue errors that we expect the code to throw and not rescue everything because we're unsure of what errors the code would throw. This would reduce the amount of pain we face when debugging why something isn't working as expexted. I've been bitten countless of times by errors being swallowed as a result during debugging sessions.	2018-04-02 13:52:51 +08:00
Guo Xiang Tan	ee69d58a59	FIX: Tests could get stucked in infinite loop if it fails to resolve IP of a hostname.	2018-03-28 14:49:05 +08:00
Gerhard Schlager	4a54c09e46	FIX: Retry with GET request when HEAD fails with error 400	2018-02-27 12:07:16 +01:00
Régis Hanol	0559a4736a	FIX: don't double request when downloading a file	2018-02-24 12:35:57 +01:00
Gerhard Schlager	b6277e208b	FIX: Cookies header didn't have the right format	2018-02-19 12:46:57 +01:00
Sam	fa5880e04f	PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled	2018-01-29 15:40:12 +11:00
Gerhard Schlager	e30851e45a	Move escape_uri method to a more suitable place	2017-12-12 20:17:46 +01:00
Régis Hanol	de037da731	FIX: FinalDestination's small_get method wasn't using proper request headers	2017-11-17 17:24:35 +01:00
Régis Hanol	aebcd56300	FIX: try a GET for error code 406	2017-11-17 16:59:51 +01:00
Régis Hanol	221ff24418	SQL != Ruby	2017-11-17 16:12:20 +01:00
Régis Hanol	a0fc8bd924	don't log 404s to gravatar.com	2017-11-17 15:38:26 +01:00
Sam	3ac7d041ae	UX: generic onebox treats all square images as avatars and renders them smaller	2017-11-13 11:21:19 +11:00
Gerhard Schlager	d1f257d275	FinalDestination should only log when verbose is enabled	2017-10-31 17:16:59 +01:00
Gerhard Schlager	8c27f28dcb	add more logging to FinalDestination	2017-10-31 12:26:35 +01:00
Sam Saffron	8185b8cb06	FEATURE: cache https redirects per hostname If a hostname does an https redirect we cache that so next lookup does not incur it. Also, only rate limit per ip once per final destination Raise final destination protection to 1000 ip lookups an hour	2017-10-17 16:22:54 +11:00

1 2

73 Commits