discourse/spec
Osama Sayegh 7bd3986b21
FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131)
We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down.

When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively.

This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. 

The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed:

1) each value added to setting must 3 characters or longer
2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.
2021-11-30 12:55:25 +03:00
..
components FIX: Use CDN URL for internal onebox avatars (#15077) 2021-11-25 12:07:34 +00:00
fabricators DEV: Hash tokens stored from email_tokens (#14493) 2021-11-25 09:34:39 +02:00
fixtures FEATURE: Allow theme settings to request refresh (#15037) 2021-11-22 13:16:56 +01:00
helpers DEV: Remove xlink hrefs (#15059) 2021-11-25 15:22:43 +11:00
import_export FEATURE: Rake task to export groups (#9450) 2020-04-17 14:59:54 -07:00
initializers FEATURE: A low priority filter for the review queue. (#12822) 2021-04-23 15:34:24 -03:00
integration SECURITY: Ensure _forum_session cookies cannot be reused between sites (#14950) 2021-11-15 15:50:12 +00:00
integrity DEV: Fix a flaky Onceoff spec (#13314) 2021-06-07 20:38:31 +02:00
jobs DEV: Hash tokens stored from email_tokens (#14493) 2021-11-25 09:34:39 +02:00
lib FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131) 2021-11-30 12:55:25 +03:00
mailers DEV: Hash tokens stored from email_tokens (#14493) 2021-11-25 09:34:39 +02:00
models FEATURE: Display pending posts on user’s page 2021-11-29 10:26:33 +01:00
multisite FEATURE: Apply rate limits per user instead of IP for trusted users (#14706) 2021-11-17 23:27:30 +03:00
requests FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131) 2021-11-30 12:55:25 +03:00
script/import_scripts DEV: If disabled do not change setting after import (#12142) 2021-02-19 09:33:35 -07:00
serializers FEATURE: Display pending posts on user’s page 2021-11-29 10:26:33 +01:00
services FIX: Use CDN URL for internal onebox avatars (#15077) 2021-11-25 12:07:34 +00:00
support FEATURE: Apply rate limits per user instead of IP for trusted users (#14706) 2021-11-17 23:27:30 +03:00
tasks FIX: remove migrate_from_s3 task that silently corrupts data (#11703) 2021-01-17 22:33:29 +01:00
views/omniauth_callbacks FEATURE: Use full page redirection for all external auth methods (#8092) 2019-10-08 12:10:43 +01:00
rails_helper.rb DEV: Load fabricators for plugins automatically. (#15106) 2021-11-30 15:55:45 +11:00
swagger_helper.rb DEV: Refactor the api docs for the user endpoint (#14377) 2021-09-20 10:04:57 -06:00