discourse-ai/bge_m3_tokenizer.rb at a1eb1ba18028f8bcad0363778b56d3dd60ff43e7 - discourse-ai - iSharkFly SRC

Discource-C/discourse-ai

mirror of https://github.com/discourse/discourse-ai.git synced 2025-02-06 19:48:15 +00:00

Rafael dos Santos Silva eb93b21769

FEATURE: Add BGE-M3 embeddings support (#569 )

BAAI/bge-m3 is an interesting model, that is multilingual and with a
context size of 8192. Even with a 16x larger context, it's only 4x slower
to compute it's embeddings on the worst case scenario.

Also includes a minor refactor of the rake task, including setting model
and concurrency levels when running the backfill task.

2024-04-10 17:24:01 -03:00

12 lines

258 B

Ruby

Raw Blame History

 # frozen_string_literal: true
 module DiscourseAi
   module Tokenizer
     class BgeM3Tokenizer < BasicTokenizer
       def self.tokenizer
         @@tokenizer ||= Tokenizers.from_file("./plugins/discourse-ai/tokenizers/bge-m3.json")
       end
     end
   end
 end