discourse

Commit Graph

Author	SHA1	Message	Date
Osama Sayegh	7bd3986b21	FEATURE: Replace `Crawl-delay` directive with proper rate limiting (#15131 ) We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down. When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively. This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed: 1) each value added to setting must 3 characters or longer 2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.	2021-11-30 12:55:25 +03:00
ByteHamster	36ec09a07b	FIX: Do not block `uploads` path in robots.txt (#12349 ) The `/u` rule also matches the `/uploads` path, which prevents Twitter from showing the site logo in its link previews.	2021-03-11 09:36:49 -05:00
Vinoth Kannan	e3d8e828b8	FEATURE: allow search engines to index tag pages. (#12248 ) Previously, we blocked search engines in tag pages since they may get marked as a duplicate content. * DEV: block tag inner pages from search engines crawling.	2021-03-09 23:55:57 +05:30
Sam	758e160862	FEATURE: explicitly ban outlier traffic sources in robots.txt (#11553 ) Googlebot handles no-index headers very elegantly. It advises to leave as many routes as possible open and uses headers for high fidelity rules regarding indexes. Discourse adds special `x-robot-tags` noindex headers to users, badges, groups, search and tag routes. Following up on `b52143feff` we now have it so Googlebot gets special handling. Rest of the crawlers get a far more aggressive disallow list to protect against excessive crawling.	2020-12-23 08:51:14 +11:00
Daniel Waterworth	721ee36425	Replace `base_uri` with `base_path` (#10879 ) DEV: Replace instances of Discourse.base_uri with Discourse.base_path This is clearer because the base_uri is actually just a path prefix. This continues the work started in `555f467`.	2020-10-09 12:51:24 +01:00
Joshua Rosenfeld	1e6d125db6	FIX: Remove additional paths from robots.txt admin path is protected by guardian and thus inaccessible to crawlers Follow up to `b52143f`	2020-08-26 16:52:22 -04:00
Krzysztof Kotlarek	e0d9232259	FIX: use allowlist and blocklist terminology (#10209 ) This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.	2020-07-27 10:23:54 +10:00
Joshua Rosenfeld	b52143feff	FIX: Remove paths from robots.txt in favor of noindex header Google no longer supports the use of robots.txt to block indexing. See https://support.google.com/webmasters/answer/6062608 and https://support.google.com/webmasters/answer/93710 Previous commits have added the `noindex` header to appropriate pages, now we need to remove the paths from robots.txt so the pages can be crawled. Follow up to: `13f229808a` `b6765aac4b` `676be3a853` `07b728c5e5` `c94e6a9a66`	2020-06-25 13:55:06 -04:00
Osama Sayegh	6515ff19e5	FEATURE: Allow customization of robots.txt (#7884 ) * FEATURE: Allow customization of robots.txt This allows admins to customize/override the content of the robots.txt file at /admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. * use Ember.computed.not * Jeff feedback * Feedback * Remove unused import	2019-07-15 20:47:44 +03:00
Sam Saffron	30990006a9	DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging	2019-05-13 09:31:32 +08:00
Sam	baa72d18f8	FIX: simplify so we ban all auth paths previously plugins that have auth paths were not disallowed and robots tend to call them	2018-08-16 19:16:47 +10:00
Sam Saffron	030e322a39	FEATURE: block top level /my/ routes	2018-06-12 19:47:45 +10:00
Robin Ward	3d7dbdedc0	FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site.	2018-04-16 15:43:20 -04:00
Régis Hanol	df7970a6f6	prefix the robots.txt rules with the directory when using subfolder	2018-04-11 22:05:02 +02:00
Sam	3a7b696703	FEATURE: allow for setting crawl delay per user agent Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed New site settings: "slow_down_crawler_user_agents" - list of crawlers that will be slowed down "slow_down_crawler_rate" - how many seconds to wait between requests Not enforced server side yet	2018-04-06 10:15:23 +10:00
Neil Lalonde	ced7e9a691	FEATURE: control which web crawlers can access using a whitelist or blacklist	2018-03-22 15:41:02 -04:00
Guo Xiang Tan	77d4c4d8dc	Fix all the errors to get our tests green on Rails 5.1.	2017-09-25 13:48:58 +08:00
Régis Hanol	d75cc67d86	FIX: robots.txt should be accessible even when login is required	2015-10-15 11:42:41 +02:00
Sam	e5888cf090	PERF: avoid preloading json in cases where it is not needed (uploads / avatars / non GET requests)	2015-05-20 17:12:16 +10:00
Neil Lalonde	a86b35c873	Remove the access_password site setting	2013-06-25 15:05:25 -04:00
Robin Ward	d2596c3c4c	Remove unusued site_settings, show checkbox in UI for boolean values, remove restrict_access boolean to avoid locking yourself out by setting access_password to empty string. Minor UI tweaks.	2013-03-01 14:27:41 -05:00
Gosha Arinich	cafc75b238	remove trailing whitespaces ❤️	2013-02-26 07:31:35 +03:00
Sam Saffron	1c12c91d0c	forgot to skip a filter	2013-02-11 17:14:36 +11:00
Sam Saffron	c50a9e4d01	added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false	2013-02-11 11:02:57 +11:00

24 Commits