discourse/lib/retrieve_title.rb

# frozen_string_literal: true

module RetrieveTitle
  CRAWL_TIMEOUT = 1

  def self.crawl(url)
    fetch_title(url)
  rescue Exception
    # If there was a connection error, do nothing
  end

  def self.extract_title(html)
    title = nil
    if doc = Nokogiri::HTML(html)

      title = doc.at('title')&.inner_text

      # A horrible hack - YouTube uses `document.title` to populate the title
      # for some reason. For any other site than YouTube this wouldn't be worth it.
      if title == "YouTube" && html =~ /document\.title *= *"(.*)";/
        title = Regexp.last_match[1].sub(/ - YouTube$/, '')
      end

      if !title && node = doc.at('meta[property="og:title"]')
        title = node['content']
      end
    end

    if title.present?
      title.gsub!(/\n/, ' ')
      title.gsub!(/ +/, ' ')
      title.strip!
      return title
    end
    nil
  end

  private

  def self.max_chunk_size(uri)

    # Amazon and YouTube leave the title until very late. Exceptions are bad
    # but these are large sites.
    return 500 if uri.host =~ /amazon\.(com|ca|co\.uk|es|fr|de|it|com\.au|com\.br|cn|in|co\.jp|com\.mx)$/
    return 300 if uri.host =~ /youtube\.com$/ || uri.host =~ /youtu.be/

    # default is 10k
    10
  end

  # Fetch the beginning of a HTML document at a url
  def self.fetch_title(url)
    fd = FinalDestination.new(url, timeout: CRAWL_TIMEOUT)

    current = nil
    title = nil

    fd.get do |_response, chunk, uri|

      if current
        current << chunk
      else
        current = chunk
      end

      max_size = max_chunk_size(uri) * 1024
      title = extract_title(current)
      throw :done if title || max_size < current.length
    end
    title
  end
end
DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging 2019-05-02 18:17:27 -04:00			`# frozen_string_literal: true`

FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`module RetrieveTitle`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00			`CRAWL_TIMEOUT = 1`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00
			`def self.crawl(url)`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00			`fetch_title(url)`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`rescue Exception`
			`# If there was a connection error, do nothing`
			`end`

			`def self.extract_title(html)`
			`title = nil`
			`if doc = Nokogiri::HTML(html)`

FEATURE: option to enable inline oneboxes for all domains Also, change to prefer title over open graph which is often way too sparse 2017-08-02 14:27:21 -04:00			`title = doc.at('title')&.inner_text`

FIX: Hack our title retriever so that it parses YouTube URLs 2017-09-28 09:29:50 -04:00			# A horrible hack - YouTube uses `document.title` to populate the title
			`# for some reason. For any other site than YouTube this wouldn't be worth it.`
			`if title == "YouTube" && html =~ /document\.title = "(.*)";/`
			`title = Regexp.last_match[1].sub(/ - YouTube$/, '')`
			`end`

FEATURE: option to enable inline oneboxes for all domains Also, change to prefer title over open graph which is often way too sparse 2017-08-02 14:27:21 -04:00			`if !title && node = doc.at('meta[property="og:title"]')`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`title = node['content']`
			`end`
			`end`

			`if title.present?`
			`title.gsub!(/\n/, ' ')`
			`title.gsub!(/ +/, ' ')`
			`title.strip!`
			`return title`
			`end`
			`nil`
			`end`

			`private`

Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`def self.max_chunk_size(uri)`
FIX: Hack our title retriever so that it parses YouTube URLs 2017-09-28 09:29:50 -04:00
Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`# Amazon and YouTube leave the title until very late. Exceptions are bad`
			`# but these are large sites.`
			`return 500 if uri.host =~ /amazon\.(com\|ca\|co\.uk\|es\|fr\|de\|it\|com\.au\|com\.br\|cn\|in\|co\.jp\|com\.mx)$/`
			`return 300 if uri.host =~ /youtube\.com$/ \|\| uri.host =~ /youtu.be/`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00
Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`# default is 10k`
			`10`
			`end`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00
Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`# Fetch the beginning of a HTML document at a url`
			`def self.fetch_title(url)`
			`fd = FinalDestination.new(url, timeout: CRAWL_TIMEOUT)`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00
Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`current = nil`
			`title = nil`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00
Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`fd.get do \|_response, chunk, uri\|`
PERF: ability to crawl for titles without extra HEAD req Also, introduces a much more aggressive timeout for title crawling and introduces gzip to body that is crawled 2018-01-28 23:36:52 -05:00
Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`if current`
			`current << chunk`
			`else`
			`current = chunk`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`end`
Make rubocop happy again. 2018-06-07 01:28:18 -04:00
			`max_size = max_chunk_size(uri) * 1024`
			`title = extract_title(current)`
			`throw :done if title \|\| max_size < current.length`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`end`
DEV: Apply Rubocop redundant return style 2019-11-14 15:10:51 -05:00			`title`
Make rubocop happy again. 2018-06-07 01:28:18 -04:00			`end`
FEATURE: Whitelists for inline oneboxing 2017-07-21 15:29:04 -04:00			`end`