discourse/lib/retrieve_title.rb

# frozen_string_literal: true

module RetrieveTitle
  CRAWL_TIMEOUT = 1

  def self.crawl(url)
    fetch_title(url)
  rescue Exception => ex
    raise if Rails.env.test?
    Rails.logger.error(ex)
    nil
  end

  def self.extract_title(html, encoding = nil)
    title = nil
    if html =~ /<title>/ && html !~ /<\/title>/
      return nil
    end
    if doc = Nokogiri::HTML5(html, nil, encoding)

      title = doc.at('title')&.inner_text

      # A horrible hack - YouTube uses `document.title` to populate the title
      # for some reason. For any other site than YouTube this wouldn't be worth it.
      if title == "YouTube" && html =~ /document\.title *= *"(.*)";/
        title = Regexp.last_match[1].sub(/ - YouTube$/, '')
      end

      if !title && node = doc.at('meta[property="og:title"]')
        title = node['content']
      end
    end

    if title.present?
      title.gsub!(/\n/, ' ')
      title.gsub!(/ +/, ' ')
      title.strip!
      return title
    end
    nil
  end

  private

  def self.max_chunk_size(uri)
    # Exception for sites that leave the title until very late.
    return 500 if uri.host =~ /(^|\.)amazon\.(com|ca|co\.uk|es|fr|de|it|com\.au|com\.br|cn|in|co\.jp|com\.mx)$/
    return 300 if uri.host =~ /(^|\.)youtube\.com$/ || uri.host =~ /(^|\.)youtu\.be$/
    return 50 if uri.host =~ /(^|\.)github\.com$/

    # default is 20k
    20
  end

  # Fetch the beginning of a HTML document at a url
  def self.fetch_title(url)
    fd = FinalDestination.new(url, timeout: CRAWL_TIMEOUT, stop_at_blocked_pages: true)

    current = nil
    title = nil
    encoding = nil

    fd.get do |_response, chunk, uri|
      unless Net::HTTPRedirection === _response
        if current
          current << chunk
        else
          current = chunk
        end

        if !encoding && content_type = _response['content-type']&.strip&.downcase
          if content_type =~ /charset="?([a-z0-9_-]+)"?/
            encoding = Regexp.last_match(1)
            if !Encoding.list.map(&:name).map(&:downcase).include?(encoding)
              encoding = nil
            end
          end
        end

        max_size = max_chunk_size(uri) * 1024
        title = extract_title(current, encoding)
        throw :done if title || max_size < current.length
      end
    end
    title
  end
end
DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging 2019-05-02 18:17:27 -04:00			`# frozen_string_literal: true`

FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`module RetrieveTitle`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00			`CRAWL_TIMEOUT = 1`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00
			`def self.crawl(url)`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00			`fetch_title(url)`
FIX: Apply onebox blocked domain checks on every redirect (#16150) The `blocked onebox domains` setting lets site owners change what sites are allowed to be oneboxed. When a link is entered into a post, Discourse checks the domain of the link against that setting and blocks the onebox if the domain is blocked. But if there's a chain of redirects, then only the final destination website is checked against the site setting. This commit amends that behavior so that every website in the redirect chain is checked against the site setting, and if anything is blocked the original link doesn't onebox at all in the post. The `Discourse-No-Onebox` header is also checked in every response and the onebox is blocked if the header is set to "1". Additionally, Discourse will now include the `Discourse-No-Onebox` header with every response if the site requires login to access content. This is done to signal to a Discourse instance that it shouldn't attempt to onebox other Discourse instances if they're login-only. Non-Discourse websites can also use include that header if they don't wish to have Discourse onebox their content. Internal ticket: t59305. 2022-03-11 01:18:12 -05:00			`rescue Exception => ex`
			`raise if Rails.env.test?`
			`Rails.logger.error(ex)`
FIX: return nil when RetrieveTitle.crawl fails (#16167) 2022-03-11 15:53:10 -05:00			`nil`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`end`

FIX: Inline Onebox should use encoding from Content-Type header when present (#11625) * FIX: Inline onebox should use encoding from Content-Type header when present * Use Regexp.last_match(1) Signed-off-by: OsamaSayegh <asooomaasoooma90@gmail.com> 2021-01-04 14:32:08 -05:00			`def self.extract_title(html, encoding = nil)`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`title = nil`
FIX: increase chunk size to fetch title tag correctly (#14144) 2021-09-03 03:45:58 -04:00			`if html =~ /<title>/ && html !~ /<\/title>/`
			`return nil`
			`end`
FIX: Inline Onebox should use encoding from Content-Type header when present (#11625) * FIX: Inline onebox should use encoding from Content-Type header when present * Use Regexp.last_match(1) Signed-off-by: OsamaSayegh <asooomaasoooma90@gmail.com> 2021-01-04 14:32:08 -05:00			`if doc = Nokogiri::HTML5(html, nil, encoding)`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00
FEATURE: option to enable inline oneboxes for all domains Also, change to prefer title over open graph which is often way too sparse 2017-08-02 14:27:21 -04:00			`title = doc.at('title')&.inner_text`

FIX: Hack our title retriever so that it parses YouTube URLs 2017-09-28 09:29:50 -04:00			# A horrible hack - YouTube uses `document.title` to populate the title
			`# for some reason. For any other site than YouTube this wouldn't be worth it.`
			`if title == "YouTube" && html =~ /document\.title = "(.*)";/`
			`title = Regexp.last_match[1].sub(/ - YouTube$/, '')`
			`end`

FEATURE: option to enable inline oneboxes for all domains Also, change to prefer title over open graph which is often way too sparse 2017-08-02 14:27:21 -04:00			`if !title && node = doc.at('meta[property="og:title"]')`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`title = node['content']`
			`end`
			`end`

			`if title.present?`
			`title.gsub!(/\n/, ' ')`
			`title.gsub!(/ +/, ' ')`
			`title.strip!`
			`return title`
			`end`
			`nil`
			`end`

			`private`

			`def self.max_chunk_size(uri)`
FIX: inline onebox for github (#15859) Increase size of downloaded HTML for Github when getting title for inline Onebox. 2022-02-09 16:53:27 -05:00			`# Exception for sites that leave the title until very late.`
			`return 500 if uri.host =~ /(^\|\.)amazon\.(com\|ca\|co\.uk\|es\|fr\|de\|it\|com\.au\|com\.br\|cn\|in\|co\.jp\|com\.mx)$/`
			`return 300 if uri.host =~ /(^\|\.)youtube\.com$/ \|\| uri.host =~ /(^\|\.)youtu\.be$/`
			`return 50 if uri.host =~ /(^\|\.)github\.com$/`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00
FIX: increase chunk size to fetch title tag correctly (#14144) 2021-09-03 03:45:58 -04:00			`# default is 20k`
			`20`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`end`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00
			`# Fetch the beginning of a HTML document at a url`
			`def self.fetch_title(url)`
FIX: Apply onebox blocked domain checks on every redirect (#16150) The `blocked onebox domains` setting lets site owners change what sites are allowed to be oneboxed. When a link is entered into a post, Discourse checks the domain of the link against that setting and blocks the onebox if the domain is blocked. But if there's a chain of redirects, then only the final destination website is checked against the site setting. This commit amends that behavior so that every website in the redirect chain is checked against the site setting, and if anything is blocked the original link doesn't onebox at all in the post. The `Discourse-No-Onebox` header is also checked in every response and the onebox is blocked if the header is set to "1". Additionally, Discourse will now include the `Discourse-No-Onebox` header with every response if the site requires login to access content. This is done to signal to a Discourse instance that it shouldn't attempt to onebox other Discourse instances if they're login-only. Non-Discourse websites can also use include that header if they don't wish to have Discourse onebox their content. Internal ticket: t59305. 2022-03-11 01:18:12 -05:00			`fd = FinalDestination.new(url, timeout: CRAWL_TIMEOUT, stop_at_blocked_pages: true)`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00
			`current = nil`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`title = nil`
FIX: Inline Onebox should use encoding from Content-Type header when present (#11625) * FIX: Inline onebox should use encoding from Content-Type header when present * Use Regexp.last_match(1) Signed-off-by: OsamaSayegh <asooomaasoooma90@gmail.com> 2021-01-04 14:32:08 -05:00			`encoding = nil`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00
			`fd.get do \|_response, chunk, uri\|`
FIX: follow redirects for inline/mini onebox (#13512) 2021-06-24 10:23:39 -04:00			`unless Net::HTTPRedirection === _response`
			`if current`
			`current << chunk`
			`else`
			`current = chunk`
			`end`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00
FIX: follow redirects for inline/mini onebox (#13512) 2021-06-24 10:23:39 -04:00			`if !encoding && content_type = _response['content-type']&.strip&.downcase`
			`if content_type =~ /charset="?([a-z0-9_-]+)"?/`
			`encoding = Regexp.last_match(1)`
			`if !Encoding.list.map(&:name).map(&:downcase).include?(encoding)`
			`encoding = nil`
			`end`
FIX: Inline Onebox should use encoding from Content-Type header when present (#11625) * FIX: Inline onebox should use encoding from Content-Type header when present * Use Regexp.last_match(1) Signed-off-by: OsamaSayegh <asooomaasoooma90@gmail.com> 2021-01-04 14:32:08 -05:00			`end`
			`end`
Make rubocop happy again. 2018-06-07 01:28:18 -04:00
FIX: follow redirects for inline/mini onebox (#13512) 2021-06-24 10:23:39 -04:00			`max_size = max_chunk_size(uri) * 1024`
			`title = extract_title(current, encoding)`
			`throw :done if title \|\| max_size < current.length`
			`end`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`end`
DEV: Apply Rubocop redundant return style 2019-11-14 15:10:51 -05:00			`title`
Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`end`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`end`