Commit Graph

11 Commits

Author SHA1 Message Date
Roman Rizzi 4ba74511c2
FIX: Make sure limits are updated and applied on each step (#1002) 2024-12-05 10:31:39 -03:00
Rafael dos Santos Silva 0d3e6b2726
FIX: Fix ordering of random post embeddings backfill (#965)
* FIX: Fix ordering of random post embeddings backfill

* fix annotations

---------

Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com>
2024-11-27 17:01:54 -03:00
Roman Rizzi ef07fcb308
FIX: Skip records without content to classify (#960) 2024-11-26 15:54:20 -03:00
Roman Rizzi ddf2bf7034
DEV: Backfill embeddings concurrently. (#941)
We are adding a new method for generating and storing embeddings in bulk, which relies on `Concurrent::Promises::Future`. Generating an embedding consists of three steps:

Prepare text
HTTP call to retrieve the vector
Save to DB.
Each one is independently executed on whatever thread the pool gives us.

We are bringing a custom thread pool instead of the global executor since we want control over how many threads we spawn to limit concurrency. We also avoid firing thousands of HTTP requests when working with large batches.
2024-11-26 14:12:32 -03:00
Rafael dos Santos Silva 791fad1e6a
FEATURE: Index embeddings using bit vectors (#824)
On very large sites, the rare cache misses for Related Topics can take around 200ms, which affects our p99 metric on the topic page. In order to mitigate this impact, we now have several tools at our disposal.

First, one is to migrate the index embedding type from halfvec to bit and change the related topic query to leverage the new bit index by changing the search algorithm from inner product to Hamming distance. This will reduce our index sizes by 90%, severely reducing the impact of embeddings on our storage. By making the related query a bit smarter, we can have zero impact on recall by using the index to over-capture N*2 results, then re-ordering those N*2 using the full halfvec vectors and taking the top N. The expected impact is to go from 200ms to <20ms for cache misses and from a 2.5GB index to a 250MB index on a large site.

Another tool is migrating our index type from IVFFLAT to HNSW, which can increase the cache misses performance even further, eventually putting us in the under 5ms territory. 

Co-authored-by: Roman Rizzi <roman@discourse.org>
2024-10-14 13:26:03 -03:00
Sam 584753cf60
FIX: we were never reindexing old content (#786)
* FIX: we were never reindexing old content

Embedding backfill contains logic for searching for old content
change and then backfilling.

Unfortunately it was excluding all topics that had embedding
unconditionally, leading to no backfill ever happening.


This change adds a test and ensures we backfill.

* over select results, this ensures we will be more likely to find
ai results when filtered
2024-08-30 14:37:55 +10:00
Rafael dos Santos Silva fd6fcfdb61
DEV: Increase embeddings backfill job frequency (#453)
The idea is to increase the frequency so we can run with smaller batch sizes.
Big batches cause problems when running backups, so it's better to have shorter but
more frequent jobs.
2024-01-31 15:09:39 -03:00
Sam dcafc8032f
FIX: improve embedding generation (#452)
1. on failure we were queuing a job to generate embeddings, it had the wrong params. This is both fixed and covered in a test.
2. backfill embedding in the order of bumped_at, so newest content is embedded first, cover with a test
3. add a safeguard for hidden site setting that only allows batches of 50k in an embedding job run

Previously old embeddings were updated in a random order, this changes it so we update in a consistent order
2024-01-31 10:38:47 -03:00
Rafael dos Santos Silva 04bc402aae
FEATURE: Setting to control per post embeddings (#439)
* FEATURE: Setting to control per post embeddings
2024-01-23 22:09:27 -03:00
Rafael dos Santos Silva 140359c2ef
FEATURE: Per post embeddings (#387) 2023-12-29 12:28:45 -03:00
Sam 6ddc17fd61
DEV: port directory structure to Zeitwerk (#319)
Previous to this change we relied on explicit loading for a files in Discourse AI.

This had a few downsides:

- Busywork whenever you add a file (an extra require relative)
- We were not keeping to conventions internally ... some places were OpenAI others are OpenAi
- Autoloader did not work which lead to lots of full application broken reloads when developing.

This moves all of DiscourseAI into a Zeitwerk compatible structure.

It also leaves some minimal amount of manual loading (automation - which is loading into an existing namespace that may or may not be there)

To avoid needing /lib/discourse_ai/... we mount a namespace thus we are able to keep /lib pointed at ::DiscourseAi

Various files were renamed to get around zeitwerk rules and minimize usage of custom inflections

Though we can get custom inflections to work it is not worth it, will require a Discourse core patch which means we create a hard dependency.
2023-11-29 15:17:46 +11:00