discourse-ai/lib/shared/tokenizer/tokenizer.rb

# frozen_string_literal: true

module DiscourseAi
  module Tokenizer
    class BasicTokenizer
      class << self
        def tokenizer
          raise NotImplementedError
        end

        def tokenize(text)
          tokenizer.encode(text).tokens
        end

        def size(text)
          tokenize(text).size
        end

        def truncate(text, max_length)
          # Fast track the common case where the text is already short enough.
          return text if text.size < max_length

          tokenizer.decode(tokenizer.encode(text).ids.take(max_length))
        end

        def can_expand_tokens?(text, addition, max_length)
          return true if text.size + addition.size < max_length

          tokenizer.encode(text).ids.length + tokenizer.encode(addition).ids.length < max_length
        end
      end
    end

    class BertTokenizer < BasicTokenizer
      def self.tokenizer
        @@tokenizer ||=
          Tokenizers.from_file("./plugins/discourse-ai/tokenizers/bert-base-uncased.json")
      end
    end

    class AnthropicTokenizer < BasicTokenizer
      def self.tokenizer
        @@tokenizer ||=
          Tokenizers.from_file("./plugins/discourse-ai/tokenizers/claude-v1-tokenization.json")
      end
    end

    class AllMpnetBaseV2Tokenizer < BasicTokenizer
      def self.tokenizer
        @@tokenizer ||=
          Tokenizers.from_file("./plugins/discourse-ai/tokenizers/all-mpnet-base-v2.json")
      end
    end

    class Llama2Tokenizer < BasicTokenizer
      def self.tokenizer
        @@tokenizer ||=
          Tokenizers.from_file("./plugins/discourse-ai/tokenizers/llama-2-70b-chat-hf.json")
      end
    end

    class MultilingualE5LargeTokenizer < BasicTokenizer
      def self.tokenizer
        @@tokenizer ||=
          Tokenizers.from_file("./plugins/discourse-ai/tokenizers/multilingual-e5-large.json")
      end
    end

    class OpenAiTokenizer < BasicTokenizer
      class << self
        def tokenizer
          @@tokenizer ||= Tiktoken.get_encoding("cl100k_base")
        end

        def tokenize(text)
          tokenizer.encode(text)
        end

        def truncate(text, max_length)
          # Fast track the common case where the text is already short enough.
          return text if text.size < max_length

          tokenizer.decode(tokenize(text).take(max_length))
        rescue Tiktoken::UnicodeError
          max_length = max_length - 1
          retry
        end

        def can_expand_tokens?(text, addition, max_length)
          return true if text.size + addition.size < max_length

          tokenizer.encode(text).length + tokenizer.encode(addition).length < max_length
        end
      end
    end
  end
end
FEATURE: Add a basic tokenizer API (#37) * FEATURE: Add a basic tokenizer API * Add tests * lint 2023-04-19 11:55:59 -03:00			`# frozen_string_literal: true`

			`module DiscourseAi`
Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00			`module Tokenizer`
			`class BasicTokenizer`
DEV: Better strategies for summarization (#88) * DEV: Better strategies for summarization The strategy responsibility needs to be "Given a collection of texts, I know how to summarize them most efficiently, using the minimum amount of requests and maximizing token usage". There are different token limits for each model, so it all boils down to two different strategies: Fold all these texts into a single one, doing the summarization in chunks, and then build a summary from those. Build it by combining texts in a single prompt, and truncate it according to your token limits. While the latter is less than ideal, we need it for "bart-large-cnn-samsum" and "flan-t5-base-samsum", both with low limits. The rest will rely on folding. * Expose summarized chunks to users 2023-06-27 12:26:33 -03:00			`class << self`
			`def tokenizer`
			`raise NotImplementedError`
			`end`
Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00
DEV: Better strategies for summarization (#88) * DEV: Better strategies for summarization The strategy responsibility needs to be "Given a collection of texts, I know how to summarize them most efficiently, using the minimum amount of requests and maximizing token usage". There are different token limits for each model, so it all boils down to two different strategies: Fold all these texts into a single one, doing the summarization in chunks, and then build a summary from those. Build it by combining texts in a single prompt, and truncate it according to your token limits. While the latter is less than ideal, we need it for "bart-large-cnn-samsum" and "flan-t5-base-samsum", both with low limits. The rest will rely on folding. * Expose summarized chunks to users 2023-06-27 12:26:33 -03:00			`def tokenize(text)`
			`tokenizer.encode(text).tokens`
			`end`

			`def size(text)`
			`tokenize(text).size`
			`end`

			`def truncate(text, max_length)`
			`# Fast track the common case where the text is already short enough.`
			`return text if text.size < max_length`

			`tokenizer.decode(tokenizer.encode(text).ids.take(max_length))`
			`end`
Fixes for embeddings and truncate (#67) 2023-05-17 20:21:28 -03:00
DEV: Better strategies for summarization (#88) * DEV: Better strategies for summarization The strategy responsibility needs to be "Given a collection of texts, I know how to summarize them most efficiently, using the minimum amount of requests and maximizing token usage". There are different token limits for each model, so it all boils down to two different strategies: Fold all these texts into a single one, doing the summarization in chunks, and then build a summary from those. Build it by combining texts in a single prompt, and truncate it according to your token limits. While the latter is less than ideal, we need it for "bart-large-cnn-samsum" and "flan-t5-base-samsum", both with low limits. The rest will rely on folding. * Expose summarized chunks to users 2023-06-27 12:26:33 -03:00			`def can_expand_tokens?(text, addition, max_length)`
			`return true if text.size + addition.size < max_length`

			`tokenizer.encode(text).ids.length + tokenizer.encode(addition).ids.length < max_length`
			`end`
Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00			`end`
			`end`

			`class BertTokenizer < BasicTokenizer`
			`def self.tokenizer`
			`@@tokenizer \|\|=`
			`Tokenizers.from_file("./plugins/discourse-ai/tokenizers/bert-base-uncased.json")`
			`end`
FEATURE: Add a basic tokenizer API (#37) * FEATURE: Add a basic tokenizer API * Add tests * lint 2023-04-19 11:55:59 -03:00			`end`

Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00			`class AnthropicTokenizer < BasicTokenizer`
			`def self.tokenizer`
			`@@tokenizer \|\|=`
			`Tokenizers.from_file("./plugins/discourse-ai/tokenizers/claude-v1-tokenization.json")`
			`end`
fixes (#53) * Minor... use username suggester in case username already exists * FIX: ensure we truncate long prompts Previously we 1. Used raw length instead of token counts for counting length 2. We totally dropped a prompt if it was too long New implementation will truncate "raw" if it gets too long maintaining meaning. 2023-05-06 20:31:53 +10:00			`end`
Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00
FEATURE: Embeddings to main db (#99) * FEATURE: Embeddings to main db This commit moves our embeddings store from an external configurable PostgreSQL instance back into the main database. This is done to simplify the setup. There is a migration that will try to import the external embeddings into the main DB if it is configured and there are rows. It removes support from embeddings models that aren't all_mpnet_base_v2 or OpenAI text_embedding_ada_002. However it will now be easier to add new models. It also now takes into account: - topic title - topic category - topic tags - replies (as much as the model allows) We introduce an interface so we can eventually support multiple strategies for handling long topics. This PR severely damages the semantic search performance, but this is a temporary until we can get adapt HyDE to make semantic search use the same embeddings we have for semantic related with good performance. Here we also have some ground work to add post level embeddings, but this will be added in a future PR. Please note that this PR will also block Discourse from booting / updating if this plugin is installed and the pgvector extension isn't available on the PostgreSQL instance Discourse uses. 2023-07-13 12:41:36 -03:00			`class AllMpnetBaseV2Tokenizer < BasicTokenizer`
			`def self.tokenizer`
			`@@tokenizer \|\|=`
			`Tokenizers.from_file("./plugins/discourse-ai/tokenizers/all-mpnet-base-v2.json")`
			`end`
			`end`

FEATURE: Llama2 for summarization (#116) 2023-07-27 13:55:32 -03:00			`class Llama2Tokenizer < BasicTokenizer`
			`def self.tokenizer`
			`@@tokenizer \|\|=`
			`Tokenizers.from_file("./plugins/discourse-ai/tokenizers/llama-2-70b-chat-hf.json")`
			`end`
			`end`

FEATURE: Support for locally infered embeddings in 100 languages (#115) * FEATURE: Support for locally infered embeddings in 100 languages * add table 2023-07-27 15:50:03 -03:00			`class MultilingualE5LargeTokenizer < BasicTokenizer`
			`def self.tokenizer`
			`@@tokenizer \|\|=`
			`Tokenizers.from_file("./plugins/discourse-ai/tokenizers/multilingual-e5-large.json")`
			`end`
			`end`

Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00			`class OpenAiTokenizer < BasicTokenizer`
DEV: Better strategies for summarization (#88) * DEV: Better strategies for summarization The strategy responsibility needs to be "Given a collection of texts, I know how to summarize them most efficiently, using the minimum amount of requests and maximizing token usage". There are different token limits for each model, so it all boils down to two different strategies: Fold all these texts into a single one, doing the summarization in chunks, and then build a summary from those. Build it by combining texts in a single prompt, and truncate it according to your token limits. While the latter is less than ideal, we need it for "bart-large-cnn-samsum" and "flan-t5-base-samsum", both with low limits. The rest will rely on folding. * Expose summarized chunks to users 2023-06-27 12:26:33 -03:00			`class << self`
			`def tokenizer`
			`@@tokenizer \|\|= Tiktoken.get_encoding("cl100k_base")`
			`end`
Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00
DEV: Better strategies for summarization (#88) * DEV: Better strategies for summarization The strategy responsibility needs to be "Given a collection of texts, I know how to summarize them most efficiently, using the minimum amount of requests and maximizing token usage". There are different token limits for each model, so it all boils down to two different strategies: Fold all these texts into a single one, doing the summarization in chunks, and then build a summary from those. Build it by combining texts in a single prompt, and truncate it according to your token limits. While the latter is less than ideal, we need it for "bart-large-cnn-samsum" and "flan-t5-base-samsum", both with low limits. The rest will rely on folding. * Expose summarized chunks to users 2023-06-27 12:26:33 -03:00			`def tokenize(text)`
			`tokenizer.encode(text)`
			`end`

			`def truncate(text, max_length)`
			`# Fast track the common case where the text is already short enough.`
			`return text if text.size < max_length`

			`tokenizer.decode(tokenize(text).take(max_length))`
			`rescue Tiktoken::UnicodeError`
			`max_length = max_length - 1`
			`retry`
			`end`
Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00
DEV: Better strategies for summarization (#88) * DEV: Better strategies for summarization The strategy responsibility needs to be "Given a collection of texts, I know how to summarize them most efficiently, using the minimum amount of requests and maximizing token usage". There are different token limits for each model, so it all boils down to two different strategies: Fold all these texts into a single one, doing the summarization in chunks, and then build a summary from those. Build it by combining texts in a single prompt, and truncate it according to your token limits. While the latter is less than ideal, we need it for "bart-large-cnn-samsum" and "flan-t5-base-samsum", both with low limits. The rest will rely on folding. * Expose summarized chunks to users 2023-06-27 12:26:33 -03:00			`def can_expand_tokens?(text, addition, max_length)`
			`return true if text.size + addition.size < max_length`
Fixes for embeddings and truncate (#67) 2023-05-17 20:21:28 -03:00
DEV: Better strategies for summarization (#88) * DEV: Better strategies for summarization The strategy responsibility needs to be "Given a collection of texts, I know how to summarize them most efficiently, using the minimum amount of requests and maximizing token usage". There are different token limits for each model, so it all boils down to two different strategies: Fold all these texts into a single one, doing the summarization in chunks, and then build a summary from those. Build it by combining texts in a single prompt, and truncate it according to your token limits. While the latter is less than ideal, we need it for "bart-large-cnn-samsum" and "flan-t5-base-samsum", both with low limits. The rest will rely on folding. * Expose summarized chunks to users 2023-06-27 12:26:33 -03:00			`tokenizer.encode(text).length + tokenizer.encode(addition).length < max_length`
			`end`
Refinements to embeddings and tokenizers (#61) * Refinements to embeddings and tokenizers * lint * Truncate with tokenizers for summary * fix 2023-05-15 15:10:42 -03:00			`end`
FEATURE: Add a basic tokenizer API (#37) * FEATURE: Add a basic tokenizer API * Add tests * lint 2023-04-19 11:55:59 -03:00			`end`
			`end`
			`end`