Commit Graph

17 Commits

Author SHA1 Message Date
Osama Sayegh 7bd3986b21
FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131)
We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down.

When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively.

This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. 

The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed:

1) each value added to setting must 3 characters or longer
2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.
2021-11-30 12:55:25 +03:00
Sam 758e160862
FEATURE: explicitly ban outlier traffic sources in robots.txt (#11553)
Googlebot handles no-index headers very elegantly. It advises to leave as many routes as possible open and uses headers for high fidelity rules regarding indexes.

Discourse adds special `x-robot-tags` noindex headers to users, badges, groups, search and tag routes.

Following up on b52143feff we now have it so Googlebot gets special handling.

Rest of the crawlers get a far more aggressive disallow list to protect against excessive crawling.
2020-12-23 08:51:14 +11:00
Joshua Rosenfeld b12afa9435
Fix spec (#10539) 2020-08-26 17:31:02 -04:00
Krzysztof Kotlarek e0d9232259
FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
Joshua Rosenfeld f60dc7f5b4
FIX: Broken specs
`/u/` is no longer in robots.txt, so don't test for it
2020-06-25 14:30:57 -04:00
Sam Saffron bb4e8899c4
FEATURE: let Google index pages so it can remove them
Google insists on indexing pages so it can figure out if they
can be removed from the index.

see: https://support.google.com/webmasters/answer/6332384?hl=en

This change ensures the we have special behavior for Googlebot
where we allow indexing, but block the actual indexing via
X-Robots-Tag
2020-05-11 12:15:18 +10:00
Jarek Radosz 781e3f5e10
DEV: Use `response.parsed_body` in specs (#9615)
Most of it was autofixed with rubocop-discourse 2.1.1.
2020-05-07 17:04:12 +02:00
Sam Saffron e7cf4579a8 DEV: improve usability of subfolder specs
Previously people were not consistent about mocking which left internals in
a fragile state when running subfolder specs.

This introduces a simple helper `set_subfolder` which you can use to set
the subfolder for the spec. It takes care of proper configuration of subfolder
and teardown.

```
# usage
set_subfolder "/my_amazing_subfolder"
```

You should no longer stub base_uri or global_settings
2019-11-15 16:48:24 +11:00
Sam Saffron 5feb342914 Revert "FEATURE: add Noindex to robots.txt for disallowed routes"
This reverts commit d84256a876.

This is not supported by Google and causes robots.txt to be flagged as
invalid

Removing Noindex
2019-07-30 11:33:38 +10:00
Osama Sayegh 6515ff19e5
FEATURE: Allow customization of robots.txt (#7884)
* FEATURE: Allow customization of robots.txt

This allows admins to customize/override the content of the robots.txt
file at /admin/customize/robots. That page is not linked to anywhere in
the UI -- admins have to manually type the URL to access that page.

* use Ember.computed.not

* Jeff feedback

* Feedback

* Remove unused import
2019-07-15 20:47:44 +03:00
Sam Saffron 4ea21fa2d0 DEV: use #frozen_string_literal: true on all spec
This change both speeds up specs (less strings to allocate) and helps catch
cases where methods in Discourse are mutating inputs.

Overall we will be migrating everything to use #frozen_string_literal: true
it will take a while, but this is the first and safest move in this direction
2019-04-30 10:27:42 +10:00
Sam d84256a876 FEATURE: add Noindex to robots.txt for disallowed routes
This strips pages out of indexes that should not exist see:

https://meta.discourse.org/t/pages-listed-in-the-robots-txt-are-crawled-and-indexed-by-google/100309/11?u=sam
2018-11-02 16:39:47 +11:00
Robin Ward 3d7dbdedc0 FEATURE: An API to help sites build robots.txt files programatically
This is mainly useful for subfolder sites, who need to expose their
robots.txt contents to a parent site.
2018-04-16 15:43:20 -04:00
Régis Hanol df7970a6f6 prefix the robots.txt rules with the directory when using subfolder 2018-04-11 22:05:02 +02:00
Sam 3a7b696703 FEATURE: allow for setting crawl delay per user agent
Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed

New site settings:

"slow_down_crawler_user_agents" - list of crawlers that will be slowed down
"slow_down_crawler_rate" - how many seconds to wait between requests

Not enforced server side yet
2018-04-06 10:15:23 +10:00
Neil Lalonde ced7e9a691 FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-22 15:41:02 -04:00
Guo Xiang Tan 77d4c4d8dc Fix all the errors to get our tests green on Rails 5.1. 2017-09-25 13:48:58 +08:00