Commit Graph

24 Commits

Author SHA1 Message Date
Osama Sayegh 7bd3986b21
FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131)
We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down.

When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively.

This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. 

The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed:

1) each value added to setting must 3 characters or longer
2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.
2021-11-30 12:55:25 +03:00
ByteHamster 36ec09a07b
FIX: Do not block `uploads` path in robots.txt (#12349)
The `/u` rule also matches the `/uploads` path, which prevents Twitter from showing the site logo in its link previews.
2021-03-11 09:36:49 -05:00
Vinoth Kannan e3d8e828b8
FEATURE: allow search engines to index tag pages. (#12248)
Previously, we blocked search engines in tag pages since they may get marked as a duplicate content.

* DEV: block tag inner pages from search engines crawling.
2021-03-09 23:55:57 +05:30
Sam 758e160862
FEATURE: explicitly ban outlier traffic sources in robots.txt (#11553)
Googlebot handles no-index headers very elegantly. It advises to leave as many routes as possible open and uses headers for high fidelity rules regarding indexes.

Discourse adds special `x-robot-tags` noindex headers to users, badges, groups, search and tag routes.

Following up on b52143feff we now have it so Googlebot gets special handling.

Rest of the crawlers get a far more aggressive disallow list to protect against excessive crawling.
2020-12-23 08:51:14 +11:00
Daniel Waterworth 721ee36425
Replace `base_uri` with `base_path` (#10879)
DEV: Replace instances of Discourse.base_uri with Discourse.base_path

This is clearer because the base_uri is actually just a path prefix. This continues the work started in 555f467.
2020-10-09 12:51:24 +01:00
Joshua Rosenfeld 1e6d125db6
FIX: Remove additional paths from robots.txt
admin path is protected by guardian and thus inaccessible to crawlers

Follow up to b52143f
2020-08-26 16:52:22 -04:00
Krzysztof Kotlarek e0d9232259
FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
Joshua Rosenfeld b52143feff
FIX: Remove paths from robots.txt in favor of noindex header
Google no longer supports the use of robots.txt to block indexing.
See https://support.google.com/webmasters/answer/6062608 and
https://support.google.com/webmasters/answer/93710

Previous commits have added the `noindex` header to appropriate pages,
now we need to remove the paths from robots.txt so the pages can be
crawled.

Follow up to:
13f229808a
b6765aac4b
676be3a853
07b728c5e5
c94e6a9a66
2020-06-25 13:55:06 -04:00
Osama Sayegh 6515ff19e5
FEATURE: Allow customization of robots.txt (#7884)
* FEATURE: Allow customization of robots.txt

This allows admins to customize/override the content of the robots.txt
file at /admin/customize/robots. That page is not linked to anywhere in
the UI -- admins have to manually type the URL to access that page.

* use Ember.computed.not

* Jeff feedback

* Feedback

* Remove unused import
2019-07-15 20:47:44 +03:00
Sam Saffron 30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Sam baa72d18f8 FIX: simplify so we ban all auth paths
previously plugins that have auth paths were not disallowed and robots
tend to call them
2018-08-16 19:16:47 +10:00
Sam Saffron 030e322a39 FEATURE: block top level /my/ routes 2018-06-12 19:47:45 +10:00
Robin Ward 3d7dbdedc0 FEATURE: An API to help sites build robots.txt files programatically
This is mainly useful for subfolder sites, who need to expose their
robots.txt contents to a parent site.
2018-04-16 15:43:20 -04:00
Régis Hanol df7970a6f6 prefix the robots.txt rules with the directory when using subfolder 2018-04-11 22:05:02 +02:00
Sam 3a7b696703 FEATURE: allow for setting crawl delay per user agent
Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed

New site settings:

"slow_down_crawler_user_agents" - list of crawlers that will be slowed down
"slow_down_crawler_rate" - how many seconds to wait between requests

Not enforced server side yet
2018-04-06 10:15:23 +10:00
Neil Lalonde ced7e9a691 FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-22 15:41:02 -04:00
Guo Xiang Tan 77d4c4d8dc Fix all the errors to get our tests green on Rails 5.1. 2017-09-25 13:48:58 +08:00
Régis Hanol d75cc67d86 FIX: robots.txt should be accessible even when login is required 2015-10-15 11:42:41 +02:00
Sam e5888cf090 PERF: avoid preloading json in cases where it is not needed
(uploads / avatars / non GET requests)
2015-05-20 17:12:16 +10:00
Neil Lalonde a86b35c873 Remove the access_password site setting 2013-06-25 15:05:25 -04:00
Robin Ward d2596c3c4c Remove unusued site_settings, show checkbox in UI for boolean values, remove restrict_access
boolean to avoid locking yourself out by setting access_password to empty string. Minor
UI tweaks.
2013-03-01 14:27:41 -05:00
Gosha Arinich cafc75b238 remove trailing whitespaces ❤️ 2013-02-26 07:31:35 +03:00
Sam Saffron 1c12c91d0c forgot to skip a filter 2013-02-11 17:14:36 +11:00
Sam Saffron c50a9e4d01 added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false 2013-02-11 11:02:57 +11:00