Commit Graph

19 Commits

Author SHA1 Message Date
David Taylor 6417173082
DEV: Apply syntax_tree formatting to `lib/*` 2023-01-09 12:10:19 +00:00
Dan Ungureanu 4e46732346
FEATURE: Implement browser update in crawler view (#12448)
browser-update script does not work correctly in some very old browsers
because the contents of <noscript> is not accessible in JavaScript.
For these browsers, the server can display the crawler page and add the
browser update notice.

Simply loading the browser-update script in the crawler view is not a
solution because that means all crawlers will also see it.
2021-03-22 19:41:42 +02:00
Krzysztof Kotlarek e0d9232259
FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
Dan Ungureanu 3ed6a0e904
FIX: Detect Wayback Machine using user agent (#9777) 2020-05-14 21:10:07 +10:00
Maja Komel 42809f4d69 FIX: use crawler layout when saving url in Wayback Machine (#7667) 2019-06-03 12:13:32 +10:00
Sam Saffron 30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Sam f66efc601d FIX: cubot android devices were detected as crawlers 2018-06-21 10:56:46 +10:00
Neil Lalonde ced7e9a691 FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-22 15:41:02 -04:00
Sam d7657d8e47 correct specs, ensure crawler layout only applies to html 2018-01-16 16:28:11 +11:00
Sam 7b562d2f46 FEATURE: much improved and simplified crawler detection
- phase one does it match 'trident|webkit|gecko|chrome|safari|msie|opera'
    yes- well it is possibly a browser

- phase two does it match 'rss|bot|spider|crawler|facebook|archive|wayback|ping|monitor'
    probably a crawler then

Based off: https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53
2018-01-16 15:41:45 +11:00
Sam f6fdc1ebe8 FEATURE: flexible crawler detection
You can use the crawler user agents site setting to amend what user agents
are considered crawlers based on a string match in the user agent

Also improves performance of crawler detection slightly
2017-09-29 12:31:50 +10:00
mcmcclur a307ad6517 Update crawler_detection.rb
Add HTTrack to the list of detected crawlers so that Discourse will serve vanilla HTML per https://meta.discourse.org/t/a-basic-discourse-archival-tool/62614/25
2017-05-16 11:17:05 -04:00
Robin Ward 2a4006fe0c Add `YandexBot` to our list of crawlers 2016-07-26 13:21:37 -04:00
Jeff Atwood bbb1348118 add Swiftbot to crawler regex 2015-05-02 03:18:58 -07:00
Erick Guan 026cdd8fc3 FEATURE: add 360Spider UA to allow 360 crawl Discourse sites 2015-03-16 22:58:33 +08:00
Jeff Atwood ceef06e771 add support for "Save Page Now" archive.org/web 2015-01-06 01:05:45 -08:00
riking 37dbc4b5e6 Add archive.org to crawler list to serve no-js to 2014-11-02 16:51:23 -08:00
Vikhyat Korrapati e3702ecb30 Improved crawler detection: add Twitterbot, Facebook, curl, Bing, Baidu. 2014-03-16 19:30:20 +05:30
Robin Ward c4b5455c21 REFACTOR: Rename `GooglebotDetection` to `CrawlerDetection` because we
will likely whitelist more crawlers in the future.
2014-02-20 16:07:02 -05:00