Commit Graph

40 Commits

Author SHA1 Message Date
Roman Rizzi 53abcd825d
FIX: Canonical URLs may be relative (#14825)
FinalDestination's follow_canonical mode used for embedded topics should work when canonical URLs are relative, as specified in [RFC 6596](https://datatracker.ietf.org/doc/html/rfc6596)
2021-11-05 14:20:14 -03:00
Roman Rizzi 4c2d5158c5
FIX: Follow the canonical URL when importing a remote topic. (#14489)
FinalDestination now supports the `follow_canonical` option, which will perform an initial GET request, parse the canonical link if present, and perform a HEAD request to it.

We use this mode during embeds to avoid treating URLs with different query parameters as different topics.
2021-10-01 12:48:21 -03:00
jbrw 2f28ba318c
FEATURE: Onebox can match engines based on the content_type (#13876)
* FEATURE: Onebox can match engines based on the content_type

`FinalDestination` now returns the `content_type` of a resolved URL.

`Oneboxer` passes this value to `Onebox` itself. Onebox engines can now specify a `matches_content_type` regex of content_types that the engine can handle, regardless of the URL.

`ImageOnebox` will match URLs with a content type of `image/png`, `jpg`, `gif`, `bmp`, `tif`, etc.

This will allow images that exist at a URL without a file type extension to be correctly rendered, assuming a valid `content_type` is returned.
2021-07-30 13:36:30 -04:00
jbrw 19182b1386
DEV: Oneboxer wildcard subdomains (#13015)
* DEV: Allow wildcards in Oneboxer optional domain Site Settings

Allows a wildcard to be used as a subdomain on Oneboxer-related SiteSettings, e.g.:

- `force_get_hosts`
- `cache_onebox_response_body_domains`
- `force_custom_user_agent_hosts`

* DEV: fix typos

* FIX: Try doing a GET after receiving a 500 error from a HEAD

By default we try to do a `HEAD` requests. If this results in a 500 error response, we should try to do a `GET`

* DEV: `force_get_hosts` should be a hidden setting

* DEV: Oneboxer Strategies

Have an alternative oneboxing ‘strategy’ (i.e., set of options) to use when an attempt to generate a Onebox fails. Keep track of any non-default strategies that were used on a particular host, and use that strategy for that host in the future.

Initially, the alternate strategy (`force_get_and_ua`) forces the FinalDestination step of Oneboxing to do a `GET` rather than `HEAD`, and forces a custom user agent.

* DEV: change stubbed return code

The stubbed status code needs to be a value not recognized by FinalDestination
2021-05-13 15:48:35 -04:00
jbrw da9b837da0
DEV: More robust processing of URLs (#11361)
* DEV: More robust processing of URLs

The previous `UrlHelper.encode_component(CGI.unescapeHTML(UrlHelper.unencode(uri))` method would naively process URLs, which could result in a badly formed response.

`Addressable::URI.normalized_encode(uri)` appears to deal with these edge-cases in a more robust way.

* DEV: onebox should use UrlHelper

* DEV: fix spec

* DEV: Escape output when rendering local links
2020-12-03 17:16:01 -05:00
Krzysztof Kotlarek e0d9232259
FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
Arpit Jalan b0e781e2d4 FIX: do not follow redirect on same host with path /login or /session 2019-08-07 16:26:55 +05:30
Sam Saffron 4ea21fa2d0 DEV: use #frozen_string_literal: true on all spec
This change both speeds up specs (less strings to allocate) and helps catch
cases where methods in Discourse are mutating inputs.

Overall we will be migrating everything to use #frozen_string_literal: true
it will take a while, but this is the first and safest move in this direction
2019-04-30 10:27:42 +10:00
Arpit Jalan 1ab91f0474 FIX: preserve github fragment URL 2018-12-19 12:34:47 +05:30
Bianca Nenciu b6963b8ffb FIX: Ignore OneBox blacklisted domains. 2018-08-27 20:40:55 +02:00
Robin Ward 7058205f70 FIX: Broken specs 2018-07-24 12:00:34 -04:00
Robin Ward 236243f38a SECURITY: Consider `0.0.0.0` a private IP 2018-07-24 11:16:27 -04:00
Guo Xiang Tan 142571bba0 Remove use of `rescue nil`.
* `rescue nil` is a really bad pattern to use in our code base.
  We should rescue errors that we expect the code to throw and
  not rescue everything because we're unsure of what errors the
  code would throw. This would reduce the amount of pain we face
  when debugging why something isn't working as expexted. I've
  been bitten countless of times by errors being swallowed as a
  result during debugging sessions.
2018-04-02 13:52:51 +08:00
Régis Hanol 0559a4736a FIX: don't double request when downloading a file 2018-02-24 12:35:57 +01:00
Gerhard Schlager b6277e208b FIX: Cookies header didn't have the right format 2018-02-19 12:46:57 +01:00
Sam fa5880e04f PERF: ability to crawl for titles without extra HEAD req
Also, introduces a much more aggressive timeout for title crawling
and introduces gzip to body that is crawled
2018-01-29 15:40:12 +11:00
Sam 1dd2b51059 remove redundent stubs 2017-10-18 12:10:30 +11:00
Sam Saffron 8185b8cb06 FEATURE: cache https redirects per hostname
If a hostname does an https redirect we cache that so next
lookup does not incur it.

Also, only rate limit per ip once per final destination

Raise final destination protection to 1000 ip lookups an hour
2017-10-17 16:22:54 +11:00
Sam 70bb2aa426 FEATURE: allow specifying s3 config via globals
This refactors handling of s3 so it can be specified via GlobalSetting

This means that in a multisite environment you can configure s3 uploads
without actual sites knowing credentials in s3

It is a critical setting for situations where assets are mirrored to s3.
2017-10-06 16:20:01 +11:00
Guo Xiang Tan 5324c01209 FIX: Don't raise an error if reading from URL timeout. 2017-09-27 14:53:22 +08:00
Guo Xiang Tan 367fb1c524 FIX: Onebox fails on encoded URL.
https://meta.discourse.org/t/onebox-breaks-if-theres-chinese-text-in-url/67364
2017-09-26 18:34:54 +08:00
Joffrey JAFFEUX 6cd8203686 FIX: allows onebox to force GET hosts returning wrong headers on HEAD 2017-08-08 11:44:27 +02:00
Arpit Jalan b059a0f789 extract url escaping to a dedicated class method and improved tests 2017-07-29 22:16:51 +05:30
Arpit Jalan 1fe553873c FIX: preserve fragment identifier when escaping url 2017-07-29 17:22:45 +05:30
Guo Xiang Tan b534778f46 FIX: Escape URL before attempting to resolve it. 2017-07-18 10:04:24 +09:00
Robin Ward db485ae0da FIX: Support for skipping redirects on certain domains (like steam) 2017-06-26 15:38:43 -04:00
Robin Ward 009f0921dc FEATURE: Whitelist hosts for internal crawling 2017-06-13 12:59:54 -04:00
Robin Ward a3729b51eb FIX: Always allow the host the forum is hosted on 2017-06-12 13:22:51 -04:00
Robin Ward 53b95f009f FIX: If HEAD is not supported, try GET. Also set cookies 2017-06-06 13:53:49 -04:00
Guo Xiang Tan 56f98de7b2 Use webmock to stub external web requests. 2017-05-26 15:19:09 +08:00
Guo Xiang Tan f8f1548fd4 Revert "FIX: Use Excon to do its own stubbing"
This reverts commit 80af54460a.
2017-05-26 13:04:25 +08:00
Robin Ward 3b0cbf7013 FIX: Always allow downloads from CDN 2017-05-23 16:32:54 -04:00
Robin Ward b81e7be9a1 FEATURE: Rate limit how often we'll crawl a destination IP 2017-05-23 15:03:04 -04:00
Robin Ward 36e477750c FIX: Use same code path for downloading images 2017-05-23 14:51:30 -04:00
Robin Ward e5e7a15a85 SECURITY: Never crawl by IP 2017-05-23 13:07:18 -04:00
Robin Ward 93a5fc62bf FEATURE: A site setting to prevent crawling on private IP blocks 2017-05-23 11:56:06 -04:00
Robin Ward 80af54460a FIX: Use Excon to do its own stubbing 2017-05-22 18:19:20 -04:00
Robin Ward b51126dd5e FIX: Reset the WebMock after before every test 2017-05-22 17:52:31 -04:00
Robin Ward 4c690f7089 Use `FinalDestination` to ensure public redirects for onebox 2017-05-22 16:42:49 -04:00
Robin Ward b23fc2bf84 Helper to find the final destination for a URL 2017-05-22 15:52:41 -04:00