discourse-ai/tokenizers
Rafael dos Santos Silva eb93b21769
FEATURE: Add BGE-M3 embeddings support (#569)
BAAI/bge-m3 is an interesting model, that is multilingual and with a
context size of 8192. Even with a 16x larger context, it's only 4x slower
to compute it's embeddings on the worst case scenario.

Also includes a minor refactor of the rake task, including setting model
and concurrency levels when running the backfill task.
2024-04-10 17:24:01 -03:00
..
Apache License Refinements to embeddings and tokenizers (#61) 2023-05-15 15:10:42 -03:00
MIT License Refinements to embeddings and tokenizers (#61) 2023-05-15 15:10:42 -03:00
README.md FEATURE: Add BGE-M3 embeddings support (#569) 2024-04-10 17:24:01 -03:00
all-mpnet-base-v2.json FIX: Disable truncation and padding in all-mpnet-base-v2 tokenizer (#105) 2023-07-13 21:09:46 -03:00
bert-base-uncased.json FEATURE: Add a basic tokenizer API (#37) 2023-04-19 11:55:59 -03:00
bge-large-en.json FEATURE: Bge-large-en embeddings via Cloudflare Workers AI API (#241) 2023-10-04 13:47:51 -03:00
bge-m3.json FEATURE: Add BGE-M3 embeddings support (#569) 2024-04-10 17:24:01 -03:00
claude-v1-tokenization.json Refinements to embeddings and tokenizers (#61) 2023-05-15 15:10:42 -03:00
llama-2-70b-chat-hf.json FEATURE: Llama2 for summarization (#116) 2023-07-27 13:55:32 -03:00
mixtral.json Mixtral (#376) 2023-12-26 14:49:55 -03:00
multilingual-e5-large.json FEATURE: Support for locally infered embeddings in 100 languages (#115) 2023-07-27 15:50:03 -03:00

README.md

bert-base-uncased.json

Licensed under Apache License

claude-v1-tokenization.json

Licensed under MIT License

all-mpnet-base-v2.json

Licensed under Apache License

llama-2-70b-chat-hf

Licensed under LLAMA 2 COMMUNITY LICENSE AGREEMENT

multilingual-e5-large

Licensed under MIT License

bge-large-en

Licensed under MIT License

mixtral

Licensed under Apache 2.0 License

bge-m3

Licensed under MIT License