discourse-ai/lib/tokenizer
Rafael dos Santos Silva eb93b21769
FEATURE: Add BGE-M3 embeddings support (#569)
BAAI/bge-m3 is an interesting model, that is multilingual and with a
context size of 8192. Even with a 16x larger context, it's only 4x slower
to compute it's embeddings on the worst case scenario.

Also includes a minor refactor of the rake task, including setting model
and concurrency levels when running the backfill task.
2024-04-10 17:24:01 -03:00
..
all_mpnet_base_v2_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
anthropic_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
basic_tokenizer.rb FIX: Handle unicode on tokenizer (#515) 2024-03-14 17:33:30 -03:00
bert_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
bge_large_en_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
bge_m3_tokenizer.rb FEATURE: Add BGE-M3 embeddings support (#569) 2024-04-10 17:24:01 -03:00
llama2_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
mixtral_tokenizer.rb Mixtral (#376) 2023-12-26 14:49:55 -03:00
multilingual_e5_large_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
open_ai_tokenizer.rb FIX: Handle unicode on tokenizer (#515) 2024-03-14 17:33:30 -03:00