discourse-ai/lib/modules/embeddings/semantic_search.rb
Rafael dos Santos Silva 2c0f535bab
FEATURE: HyDE-powered semantic search. (#136)
* FEATURE: HyDE-powered semantic search.

It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way.

We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search.

This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying.

Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead.

* Missing translation and rate limiting

---------

Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com>
2023-09-05 11:08:23 -03:00

69 lines
2.0 KiB
Ruby

# frozen_string_literal: true
module DiscourseAi
module Embeddings
class SemanticSearch
def self.clear_cache_for(query)
digest = OpenSSL::Digest::SHA1.hexdigest(query)
Discourse.cache.delete("hyde-doc-#{digest}")
Discourse.cache.delete("hyde-doc-embedding-#{digest}")
end
def initialize(guardian)
@guardian = guardian
end
def cached_query?(query)
digest = OpenSSL::Digest::SHA1.hexdigest(query)
Discourse.cache.read("hyde-doc-embedding-#{digest}").present?
end
def search_for_topics(query, page = 1)
max_results_per_page = 50
limit = [Search.per_filter, max_results_per_page].min + 1
offset = (page - 1) * limit
strategy = DiscourseAi::Embeddings::Strategies::Truncation.new
vector_rep =
DiscourseAi::Embeddings::VectorRepresentations::Base.current_representation(strategy)
digest = OpenSSL::Digest::SHA1.hexdigest(query)
hypothetical_post =
Discourse
.cache
.fetch("hyde-doc-#{digest}", expires_in: 1.week) do
hyde_generator = DiscourseAi::Embeddings::HydeGenerators::Base.current_hyde_model.new
hyde_generator.hypothetical_post_from(query)
end
hypothetical_post_embedding =
Discourse
.cache
.fetch("hyde-doc-embedding-#{digest}", expires_in: 1.week) do
vector_rep.vector_from(hypothetical_post)
end
candidate_topic_ids =
vector_rep.asymmetric_topics_similarity_search(
hypothetical_post_embedding,
limit: limit,
offset: offset,
)
::Post
.where(post_type: ::Topic.visible_post_types(guardian.user))
.public_posts
.where("topics.visible")
.where(topic_id: candidate_topic_ids, post_number: 1)
.order("array_position(ARRAY#{candidate_topic_ids}, topic_id)")
end
private
attr_reader :guardian
end
end
end