discourse

Commit Graph

Author	SHA1	Message	Date
Osama Sayegh	7bd3986b21	FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131 ) We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down. When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively. This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed: 1) each value added to setting must 3 characters or longer 2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.	2021-11-30 12:55:25 +03:00
Daniel Waterworth	721ee36425	Replace `base_uri` with `base_path` (#10879 ) DEV: Replace instances of Discourse.base_uri with Discourse.base_path This is clearer because the base_uri is actually just a path prefix. This continues the work started in `555f467`.	2020-10-09 12:51:24 +01:00
Sam Saffron	bb4e8899c4	FEATURE: let Google index pages so it can remove them Google insists on indexing pages so it can figure out if they can be removed from the index. see: https://support.google.com/webmasters/answer/6332384?hl=en This change ensures the we have special behavior for Googlebot where we allow indexing, but block the actual indexing via X-Robots-Tag	2020-05-11 12:15:18 +10:00
Sam Saffron	5feb342914	Revert "FEATURE: add Noindex to robots.txt for disallowed routes" This reverts commit `d84256a876`. This is not supported by Google and causes robots.txt to be flagged as invalid Removing Noindex	2019-07-30 11:33:38 +10:00
Sam	d84256a876	FEATURE: add Noindex to robots.txt for disallowed routes This strips pages out of indexes that should not exist see: https://meta.discourse.org/t/pages-listed-in-the-robots-txt-are-crawled-and-indexed-by-google/100309/11?u=sam	2018-11-02 16:39:47 +11:00
Robin Ward	3d7dbdedc0	FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site.	2018-04-16 15:43:20 -04:00
Sam	223379e21a	per spec we need to repeat disallow paths per agent	2018-04-16 15:38:10 +10:00
Régis Hanol	1a9271dd2f	add a warning in robots.txt when using subfolder	2018-04-12 00:00:15 +02:00
Régis Hanol	df7970a6f6	prefix the robots.txt rules with the directory when using subfolder	2018-04-11 22:05:02 +02:00
Sam	489c22d93c	FEATURE: Disallow tags and categories rss feeds This stops crawlers from hitting tags and category rss feeds to discover new content, instead they should focus on latest/posts if they need to consume something regular	2018-04-11 14:36:10 +10:00
Sam	f40f10240c	FEATURE: remove topic rss from robots Crawlers love hitting the rss feeds (confirmed that both Google and Bing do) Experimenting with the impact of blocking these feeds and forcing Crawlers to hit the content direct. It is better if they hit the actual page to start with as opposed to 1. Hit RSS feed 2. Find new content 3. Hit post link 4. Get canonical 5. Hit canonical Lots of pointless work. We do not know for sure what impact this will have on newsreader apps, we will listen for feedback.	2018-04-11 11:57:52 +10:00
Sam	3a7b696703	FEATURE: allow for setting crawl delay per user agent Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed New site settings: "slow_down_crawler_user_agents" - list of crawlers that will be slowed down "slow_down_crawler_rate" - how many seconds to wait between requests Not enforced server side yet	2018-04-06 10:15:23 +10:00
Arpit Jalan	5e4dd20795	Revert "Prevent robots from indexing uploads" This reverts commit `0fd622e5d1`.	2018-04-02 21:29:29 +05:30
Neil Lalonde	ced7e9a691	FEATURE: control which web crawlers can access using a whitelist or blacklist	2018-03-22 15:41:02 -04:00
Dan Nicholson	0fd622e5d1	Prevent robots from indexing uploads Although most user uploads are probably harmless, it's possible someone has (either maliciously or not) uploaded sensitive information. Prevent robots from indexing the uploads route.	2018-03-09 05:51:55 -06:00
Sam	e19ae6c55e	FEATURE: disallow groups from being indexed	2018-03-02 13:38:30 +11:00
Robin Ward	0776340b29	SECURITY: Prevent robots from indexing more routes These routes could contain sensitive material and should never be indexed for content.	2018-02-04 13:24:36 -05:00
Guo Xiang Tan	77d4c4d8dc	Fix all the errors to get our tests green on Rails 5.1.	2017-09-25 13:48:58 +08:00
Robin Ward	14410b71fb	Convert server side paths to use `/u/`	2017-03-30 10:23:24 -04:00
Vinoth Kannan	08c14dd689	new: server plugin outlet for indexable robots.txt	2017-02-13 17:31:10 +05:30
Neil Lalonde	ae671355da	FIX: add /tags routes to robots.txt	2017-02-03 11:57:00 -05:00
Sam	54645261aa	better disallow search ... this could get ugly	2015-04-02 17:08:00 +11:00
Robin Ward	e66c53a4a7	Add /badges to robots.txt for now, we don't have a crawlable view so it's better to exclude it.	2014-10-30 14:32:42 -04:00
Neil Lalonde	8267a451b2	Disallow /users/ in robots.txt	2014-05-23 10:28:26 -04:00
Neil Lalonde	9c4dc9a966	Block browser-update.js in robots.txt. Move noscript block above everything else in application layout.	2014-02-14 15:33:00 -05:00
Sam	7ad00f426c	FEATURE REMOVAL: persona login see: https://meta.discourse.org/t/pulling-persona-out-of-discourse-core/12613	2014-02-11 16:56:48 +11:00
Neil Lalonde	88d9f3a786	Disallow auth callbacks in robots.txt	2014-01-14 10:42:22 -05:00
Sam Saffron	c50a9e4d01	added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false	2013-02-11 11:02:57 +11:00

28 Commits