We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down.
When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively.
This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response.
The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed:
1) each value added to setting must 3 characters or longer
2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.
Previously, we blocked search engines in tag pages since they may get marked as a duplicate content.
* DEV: block tag inner pages from search engines crawling.
Googlebot handles no-index headers very elegantly. It advises to leave as many routes as possible open and uses headers for high fidelity rules regarding indexes.
Discourse adds special `x-robot-tags` noindex headers to users, badges, groups, search and tag routes.
Following up on b52143feff8c32f2 we now have it so Googlebot gets special handling.
Rest of the crawlers get a far more aggressive disallow list to protect against excessive crawling.
DEV: Replace instances of Discourse.base_uri with Discourse.base_path
This is clearer because the base_uri is actually just a path prefix. This continues the work started in 555f467.
Google no longer supports the use of robots.txt to block indexing.
See https://support.google.com/webmasters/answer/6062608 and
https://support.google.com/webmasters/answer/93710
Previous commits have added the `noindex` header to appropriate pages,
now we need to remove the paths from robots.txt so the pages can be
crawled.
Follow up to:
13f229808a22db9e1032832a313ab701b66614c8
b6765aac4b532c026418a7ffd9effd0741ab8a37
676be3a853454a33cf627c3d570feb37d3bb0bfd
07b728c5e557c9aae91c51f3eaac5c32d479f2a2
c94e6a9a66757ea48d99e3ee8d880523871cb6f4
* FEATURE: Allow customization of robots.txt
This allows admins to customize/override the content of the robots.txt
file at /admin/customize/robots. That page is not linked to anywhere in
the UI -- admins have to manually type the URL to access that page.
* use Ember.computed.not
* Jeff feedback
* Feedback
* Remove unused import
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.
Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed
New site settings:
"slow_down_crawler_user_agents" - list of crawlers that will be slowed down
"slow_down_crawler_rate" - how many seconds to wait between requests
Not enforced server side yet