Commit Graph

13 Commits

Author SHA1 Message Date
Sam f66efc601d FIX: cubot android devices were detected as crawlers 2018-06-21 10:56:46 +10:00
Neil Lalonde ced7e9a691 FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-22 15:41:02 -04:00
Sam d7657d8e47 correct specs, ensure crawler layout only applies to html 2018-01-16 16:28:11 +11:00
Sam 7b562d2f46 FEATURE: much improved and simplified crawler detection
- phase one does it match 'trident|webkit|gecko|chrome|safari|msie|opera'
    yes- well it is possibly a browser

- phase two does it match 'rss|bot|spider|crawler|facebook|archive|wayback|ping|monitor'
    probably a crawler then

Based off: https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53
2018-01-16 15:41:45 +11:00
Sam f6fdc1ebe8 FEATURE: flexible crawler detection
You can use the crawler user agents site setting to amend what user agents
are considered crawlers based on a string match in the user agent

Also improves performance of crawler detection slightly
2017-09-29 12:31:50 +10:00
mcmcclur a307ad6517 Update crawler_detection.rb
Add HTTrack to the list of detected crawlers so that Discourse will serve vanilla HTML per https://meta.discourse.org/t/a-basic-discourse-archival-tool/62614/25
2017-05-16 11:17:05 -04:00
Robin Ward 2a4006fe0c Add `YandexBot` to our list of crawlers 2016-07-26 13:21:37 -04:00
Jeff Atwood bbb1348118 add Swiftbot to crawler regex 2015-05-02 03:18:58 -07:00
Erick Guan 026cdd8fc3 FEATURE: add 360Spider UA to allow 360 crawl Discourse sites 2015-03-16 22:58:33 +08:00
Jeff Atwood ceef06e771 add support for "Save Page Now" archive.org/web 2015-01-06 01:05:45 -08:00
riking 37dbc4b5e6 Add archive.org to crawler list to serve no-js to 2014-11-02 16:51:23 -08:00
Vikhyat Korrapati e3702ecb30 Improved crawler detection: add Twitterbot, Facebook, curl, Bing, Baidu. 2014-03-16 19:30:20 +05:30
Robin Ward c4b5455c21 REFACTOR: Rename `GooglebotDetection` to `CrawlerDetection` because we
will likely whitelist more crawlers in the future.
2014-02-20 16:07:02 -05:00