Rafael dos Santos Silva 5e3f4e1b78
FEATURE: Embeddings to main db (#99)
* FEATURE: Embeddings to main db

This commit moves our embeddings store from an external configurable PostgreSQL
instance back into the main database. This is done to simplify the setup.

There is a migration that will try to import the external embeddings into
the main DB if it is configured and there are rows.

It removes support from embeddings models that aren't all_mpnet_base_v2 or OpenAI
text_embedding_ada_002. However it will now be easier to add new models.

It also now takes into account:
  - topic title
  - topic category
  - topic tags
  - replies (as much as the model allows)

We introduce an interface so we can eventually support multiple strategies
for handling long topics.

This PR severely damages the semantic search performance, but this is a
temporary until we can get adapt HyDE to make semantic search use the same
embeddings we have for semantic related with good performance.

Here we also have some ground work to add post level embeddings, but this
will be added in a future PR.

Please note that this PR will also block Discourse from booting / updating if 
this plugin is installed and the pgvector extension isn't available on the 
PostgreSQL instance Discourse uses.
2023-07-13 12:41:36 -03:00

65 lines
1.9 KiB
Ruby

# frozen_string_literal: true
module DiscourseAi
module Embeddings
class Manager
attr_reader :target, :model, :strategy
def initialize(target)
@target = target
@model =
DiscourseAi::Embeddings::Models::Base.subclasses.find do
_1.name == SiteSetting.ai_embeddings_model
end
@strategy = DiscourseAi::Embeddings::Strategies::Truncation.new(@target, @model)
end
def generate!
@strategy.process!
# TODO bail here if we already have an embedding with matching version and digest
@embeddings = @model.generate_embeddings(@strategy.processed_target)
persist!
end
def persist!
begin
DB.exec(
<<~SQL,
INSERT INTO ai_topic_embeddings_#{table_suffix} (topic_id, model_version, strategy_version, digest, embeddings, created_at, updated_at)
VALUES (:topic_id, :model_version, :strategy_version, :digest, '[:embeddings]', CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)
ON CONFLICT (topic_id)
DO UPDATE SET
model_version = :model_version,
strategy_version = :strategy_version,
digest = :digest,
embeddings = '[:embeddings]',
updated_at = CURRENT_TIMESTAMP
SQL
topic_id: @target.id,
model_version: @model.version,
strategy_version: @strategy.version,
digest: @strategy.digest,
embeddings: @embeddings,
)
rescue PG::Error => e
Rails.logger.error(
"Error #{e} persisting embedding for topic #{topic.id} and model #{model.name}",
)
end
end
def table_suffix
"#{@model.id}_#{@strategy.id}"
end
def topic_embeddings_table
"ai_topic_embeddings_#{table_suffix}"
end
end
end
end