discourse

Commit Graph

Author	SHA1	Message	Date
Osama Sayegh	7bd3986b21	FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131 ) We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down. When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively. This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed: 1) each value added to setting must 3 characters or longer 2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.	2021-11-30 12:55:25 +03:00
Sam	758e160862	FEATURE: explicitly ban outlier traffic sources in robots.txt (#11553 ) Googlebot handles no-index headers very elegantly. It advises to leave as many routes as possible open and uses headers for high fidelity rules regarding indexes. Discourse adds special `x-robot-tags` noindex headers to users, badges, groups, search and tag routes. Following up on `b52143feff` we now have it so Googlebot gets special handling. Rest of the crawlers get a far more aggressive disallow list to protect against excessive crawling.	2020-12-23 08:51:14 +11:00
Joshua Rosenfeld	b12afa9435	Fix spec (#10539 )	2020-08-26 17:31:02 -04:00
Krzysztof Kotlarek	e0d9232259	FIX: use allowlist and blocklist terminology (#10209 ) This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.	2020-07-27 10:23:54 +10:00
Joshua Rosenfeld	f60dc7f5b4	FIX: Broken specs `/u/` is no longer in robots.txt, so don't test for it	2020-06-25 14:30:57 -04:00
Sam Saffron	bb4e8899c4	FEATURE: let Google index pages so it can remove them Google insists on indexing pages so it can figure out if they can be removed from the index. see: https://support.google.com/webmasters/answer/6332384?hl=en This change ensures the we have special behavior for Googlebot where we allow indexing, but block the actual indexing via X-Robots-Tag	2020-05-11 12:15:18 +10:00
Jarek Radosz	781e3f5e10	DEV: Use `response.parsed_body` in specs (#9615 ) Most of it was autofixed with rubocop-discourse 2.1.1.	2020-05-07 17:04:12 +02:00
Sam Saffron	e7cf4579a8	DEV: improve usability of subfolder specs Previously people were not consistent about mocking which left internals in a fragile state when running subfolder specs. This introduces a simple helper `set_subfolder` which you can use to set the subfolder for the spec. It takes care of proper configuration of subfolder and teardown. ``` # usage set_subfolder "/my_amazing_subfolder" ``` You should no longer stub base_uri or global_settings	2019-11-15 16:48:24 +11:00
Sam Saffron	5feb342914	Revert "FEATURE: add Noindex to robots.txt for disallowed routes" This reverts commit `d84256a876`. This is not supported by Google and causes robots.txt to be flagged as invalid Removing Noindex	2019-07-30 11:33:38 +10:00
Osama Sayegh	6515ff19e5	FEATURE: Allow customization of robots.txt (#7884 ) * FEATURE: Allow customization of robots.txt This allows admins to customize/override the content of the robots.txt file at /admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. * use Ember.computed.not * Jeff feedback * Feedback * Remove unused import	2019-07-15 20:47:44 +03:00
Sam Saffron	4ea21fa2d0	DEV: use #frozen_string_literal: true on all spec This change both speeds up specs (less strings to allocate) and helps catch cases where methods in Discourse are mutating inputs. Overall we will be migrating everything to use #frozen_string_literal: true it will take a while, but this is the first and safest move in this direction	2019-04-30 10:27:42 +10:00
Sam	d84256a876	FEATURE: add Noindex to robots.txt for disallowed routes This strips pages out of indexes that should not exist see: https://meta.discourse.org/t/pages-listed-in-the-robots-txt-are-crawled-and-indexed-by-google/100309/11?u=sam	2018-11-02 16:39:47 +11:00
Robin Ward	3d7dbdedc0	FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site.	2018-04-16 15:43:20 -04:00
Régis Hanol	df7970a6f6	prefix the robots.txt rules with the directory when using subfolder	2018-04-11 22:05:02 +02:00
Sam	3a7b696703	FEATURE: allow for setting crawl delay per user agent Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed New site settings: "slow_down_crawler_user_agents" - list of crawlers that will be slowed down "slow_down_crawler_rate" - how many seconds to wait between requests Not enforced server side yet	2018-04-06 10:15:23 +10:00
Neil Lalonde	ced7e9a691	FEATURE: control which web crawlers can access using a whitelist or blacklist	2018-03-22 15:41:02 -04:00
Guo Xiang Tan	77d4c4d8dc	Fix all the errors to get our tests green on Rails 5.1.	2017-09-25 13:48:58 +08:00

17 Commits