Commit Graph

26 Commits

Author SHA1 Message Date
Osama Sayegh 7bd3986b21
FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131)
We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down.

When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively.

This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. 

The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed:

1) each value added to setting must 3 characters or longer
2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.
2021-11-30 12:55:25 +03:00
Daniel Waterworth 721ee36425
Replace `base_uri` with `base_path` (#10879)
DEV: Replace instances of Discourse.base_uri with Discourse.base_path

This is clearer because the base_uri is actually just a path prefix. This continues the work started in 555f467.
2020-10-09 12:51:24 +01:00
Sam Saffron 5feb342914 Revert "FEATURE: add Noindex to robots.txt for disallowed routes"
This reverts commit d84256a876.

This is not supported by Google and causes robots.txt to be flagged as
invalid

Removing Noindex
2019-07-30 11:33:38 +10:00
Sam d84256a876 FEATURE: add Noindex to robots.txt for disallowed routes
This strips pages out of indexes that should not exist see:

https://meta.discourse.org/t/pages-listed-in-the-robots-txt-are-crawled-and-indexed-by-google/100309/11?u=sam
2018-11-02 16:39:47 +11:00
Robin Ward 3d7dbdedc0 FEATURE: An API to help sites build robots.txt files programatically
This is mainly useful for subfolder sites, who need to expose their
robots.txt contents to a parent site.
2018-04-16 15:43:20 -04:00
Sam 223379e21a per spec we need to repeat disallow paths per agent 2018-04-16 15:38:10 +10:00
Régis Hanol 1a9271dd2f add a warning in robots.txt when using subfolder 2018-04-12 00:00:15 +02:00
Régis Hanol df7970a6f6 prefix the robots.txt rules with the directory when using subfolder 2018-04-11 22:05:02 +02:00
Sam 489c22d93c FEATURE: Disallow tags and categories rss feeds
This stops crawlers from hitting tags and category rss feeds to discover
new content, instead they should focus on latest/posts if they need to
consume something regular
2018-04-11 14:36:10 +10:00
Sam f40f10240c FEATURE: remove topic rss from robots
Crawlers love hitting the rss feeds (confirmed that both Google and Bing do)

Experimenting with the impact of blocking these feeds and forcing Crawlers to hit
the content direct. It is better if they hit the actual page to start with as opposed to

1. Hit RSS feed
2. Find new content
3. Hit post link
4. Get canonical
5. Hit canonical

Lots of pointless work.

We do not know for sure what impact this will have on newsreader apps,
we will listen for feedback.
2018-04-11 11:57:52 +10:00
Sam 3a7b696703 FEATURE: allow for setting crawl delay per user agent
Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed

New site settings:

"slow_down_crawler_user_agents" - list of crawlers that will be slowed down
"slow_down_crawler_rate" - how many seconds to wait between requests

Not enforced server side yet
2018-04-06 10:15:23 +10:00
Arpit Jalan 5e4dd20795 Revert "Prevent robots from indexing uploads"
This reverts commit 0fd622e5d1.
2018-04-02 21:29:29 +05:30
Neil Lalonde ced7e9a691 FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-22 15:41:02 -04:00
Dan Nicholson 0fd622e5d1 Prevent robots from indexing uploads
Although most user uploads are probably harmless, it's possible someone
has (either maliciously or not) uploaded sensitive information. Prevent
robots from indexing the uploads route.
2018-03-09 05:51:55 -06:00
Sam e19ae6c55e FEATURE: disallow groups from being indexed 2018-03-02 13:38:30 +11:00
Robin Ward 0776340b29 SECURITY: Prevent robots from indexing more routes
These routes could contain sensitive material and should never be
indexed for content.
2018-02-04 13:24:36 -05:00
Robin Ward 14410b71fb Convert server side paths to use `/u/` 2017-03-30 10:23:24 -04:00
Vinoth Kannan 08c14dd689 new: server plugin outlet for indexable robots.txt 2017-02-13 17:31:10 +05:30
Neil Lalonde ae671355da FIX: add /tags routes to robots.txt 2017-02-03 11:57:00 -05:00
Sam 54645261aa better disallow search ... this could get ugly 2015-04-02 17:08:00 +11:00
Robin Ward e66c53a4a7 Add /badges to robots.txt for now, we don't have a crawlable view so
it's better to exclude it.
2014-10-30 14:32:42 -04:00
Neil Lalonde 8267a451b2 Disallow /users/ in robots.txt 2014-05-23 10:28:26 -04:00
Neil Lalonde 9c4dc9a966 Block browser-update.js in robots.txt. Move noscript block above everything else in application layout. 2014-02-14 15:33:00 -05:00
Sam 7ad00f426c FEATURE REMOVAL: persona login
see: https://meta.discourse.org/t/pulling-persona-out-of-discourse-core/12613
2014-02-11 16:56:48 +11:00
Neil Lalonde 88d9f3a786 Disallow auth callbacks in robots.txt 2014-01-14 10:42:22 -05:00
Sam Saffron c50a9e4d01 added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false 2013-02-11 11:02:57 +11:00