discourse-ai/lib/tokenizer
Rafael dos Santos Silva 3b8f900486
FIX: Handle unicode on tokenizer (#515)
* FIX: Handle unicode on tokenizer

Our fast track code broke when strings had characters who are longer in tokens than
in UTF-8.

Admins can set `DISCOURSE_AI_STRICT_TOKEN_COUNTING: true` in app.yml to ensure token counting is strict, even if slower.


Co-authored-by: wozulong <sidle.pax_0e@icloud.com>
2024-03-14 17:33:30 -03:00
..
all_mpnet_base_v2_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
anthropic_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
basic_tokenizer.rb FIX: Handle unicode on tokenizer (#515) 2024-03-14 17:33:30 -03:00
bert_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
bge_large_en_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
llama2_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
mixtral_tokenizer.rb Mixtral (#376) 2023-12-26 14:49:55 -03:00
multilingual_e5_large_tokenizer.rb DEV: port directory structure to Zeitwerk (#319) 2023-11-29 15:17:46 +11:00
open_ai_tokenizer.rb FIX: Handle unicode on tokenizer (#515) 2024-03-14 17:33:30 -03:00