discourse-ai/lib/modules/embeddings/semantic_search.rb

# frozen_string_literal: true

module DiscourseAi
  module Embeddings
    class SemanticSearch
      def self.clear_cache_for(query)
        digest = OpenSSL::Digest::SHA1.hexdigest(query)

        Discourse.cache.delete("hyde-doc-#{digest}")
        Discourse.cache.delete("hyde-doc-embedding-#{digest}")
      end

      def initialize(guardian)
        @guardian = guardian
      end

      def cached_query?(query)
        digest = OpenSSL::Digest::SHA1.hexdigest(query)
        Discourse.cache.read("hyde-doc-embedding-#{digest}").present?
      end

      def search_for_topics(query, page = 1)
        max_results_per_page = 50
        limit = [Search.per_filter, max_results_per_page].min + 1
        offset = (page - 1) * limit

        strategy = DiscourseAi::Embeddings::Strategies::Truncation.new
        vector_rep =
          DiscourseAi::Embeddings::VectorRepresentations::Base.current_representation(strategy)

        digest = OpenSSL::Digest::SHA1.hexdigest(query)

        hypothetical_post =
          Discourse
            .cache
            .fetch("hyde-doc-#{digest}", expires_in: 1.week) do
              hyde_generator = DiscourseAi::Embeddings::HydeGenerators::Base.current_hyde_model.new
              hyde_generator.hypothetical_post_from(query)
            end

        hypothetical_post_embedding =
          Discourse
            .cache
            .fetch("hyde-doc-embedding-#{digest}", expires_in: 1.week) do
              vector_rep.vector_from(hypothetical_post)
            end

        candidate_topic_ids =
          vector_rep.asymmetric_topics_similarity_search(
            hypothetical_post_embedding,
            limit: limit,
            offset: offset,
          )

        ::Post
          .where(post_type: ::Topic.visible_post_types(guardian.user))
          .public_posts
          .where("topics.visible")
          .where(topic_id: candidate_topic_ids, post_number: 1)
          .order("array_position(ARRAY#{candidate_topic_ids}, topic_id)")
      end

      private

      attr_reader :guardian
    end
  end
end
FEATURE: Semantic assymetric full-page search (#34) Depends on discourse/discourse#20915 Hooks to the full-page-search component using an experimental API and performs an assymetric similarity search using our embeddings database. 2023-03-31 15:29:56 -03:00			`# frozen_string_literal: true`

			`module DiscourseAi`
			`module Embeddings`
			`class SemanticSearch`
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 11:08:23 -03:00			`def self.clear_cache_for(query)`
			`digest = OpenSSL::Digest::SHA1.hexdigest(query)`

			`Discourse.cache.delete("hyde-doc-#{digest}")`
			`Discourse.cache.delete("hyde-doc-embedding-#{digest}")`
			`end`

FEATURE: Embeddings to main db (#99) * FEATURE: Embeddings to main db This commit moves our embeddings store from an external configurable PostgreSQL instance back into the main database. This is done to simplify the setup. There is a migration that will try to import the external embeddings into the main DB if it is configured and there are rows. It removes support from embeddings models that aren't all_mpnet_base_v2 or OpenAI text_embedding_ada_002. However it will now be easier to add new models. It also now takes into account: - topic title - topic category - topic tags - replies (as much as the model allows) We introduce an interface so we can eventually support multiple strategies for handling long topics. This PR severely damages the semantic search performance, but this is a temporary until we can get adapt HyDE to make semantic search use the same embeddings we have for semantic related with good performance. Here we also have some ground work to add post level embeddings, but this will be added in a future PR. Please note that this PR will also block Discourse from booting / updating if this plugin is installed and the pgvector extension isn't available on the PostgreSQL instance Discourse uses. 2023-07-13 12:41:36 -03:00			`def initialize(guardian)`
FEATURE: Semantic assymetric full-page search (#34) Depends on discourse/discourse#20915 Hooks to the full-page-search component using an experimental API and performs an assymetric similarity search using our embeddings database. 2023-03-31 15:29:56 -03:00			`@guardian = guardian`
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 11:08:23 -03:00			`end`

			`def cached_query?(query)`
			`digest = OpenSSL::Digest::SHA1.hexdigest(query)`
			`Discourse.cache.read("hyde-doc-embedding-#{digest}").present?`
FEATURE: Semantic assymetric full-page search (#34) Depends on discourse/discourse#20915 Hooks to the full-page-search component using an experimental API and performs an assymetric similarity search using our embeddings database. 2023-03-31 15:29:56 -03:00			`end`

			`def search_for_topics(query, page = 1)`
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 11:08:23 -03:00			`max_results_per_page = 50`
			`limit = [Search.per_filter, max_results_per_page].min + 1`
			`offset = (page - 1) * limit`

			`strategy = DiscourseAi::Embeddings::Strategies::Truncation.new`
			`vector_rep =`
			`DiscourseAi::Embeddings::VectorRepresentations::Base.current_representation(strategy)`

			`digest = OpenSSL::Digest::SHA1.hexdigest(query)`
FEATURE: Semantic assymetric full-page search (#34) Depends on discourse/discourse#20915 Hooks to the full-page-search component using an experimental API and performs an assymetric similarity search using our embeddings database. 2023-03-31 15:29:56 -03:00
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 11:08:23 -03:00			`hypothetical_post =`
			`Discourse`
			`.cache`
			`.fetch("hyde-doc-#{digest}", expires_in: 1.week) do`
			`hyde_generator = DiscourseAi::Embeddings::HydeGenerators::Base.current_hyde_model.new`
			`hyde_generator.hypothetical_post_from(query)`
			`end`

			`hypothetical_post_embedding =`
			`Discourse`
			`.cache`
			`.fetch("hyde-doc-embedding-#{digest}", expires_in: 1.week) do`
			`vector_rep.vector_from(hypothetical_post)`
			`end`

			`candidate_topic_ids =`
			`vector_rep.asymmetric_topics_similarity_search(`
			`hypothetical_post_embedding,`
			`limit: limit,`
			`offset: offset,`
			`)`
FEATURE: Semantic assymetric full-page search (#34) Depends on discourse/discourse#20915 Hooks to the full-page-search component using an experimental API and performs an assymetric similarity search using our embeddings database. 2023-03-31 15:29:56 -03:00
			`::Post`
			`.where(post_type: ::Topic.visible_post_types(guardian.user))`
			`.public_posts`
			`.where("topics.visible")`
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 11:08:23 -03:00			`.where(topic_id: candidate_topic_ids, post_number: 1)`
			`.order("array_position(ARRAY#{candidate_topic_ids}, topic_id)")`
FEATURE: Embeddings to main db (#99) * FEATURE: Embeddings to main db This commit moves our embeddings store from an external configurable PostgreSQL instance back into the main database. This is done to simplify the setup. There is a migration that will try to import the external embeddings into the main DB if it is configured and there are rows. It removes support from embeddings models that aren't all_mpnet_base_v2 or OpenAI text_embedding_ada_002. However it will now be easier to add new models. It also now takes into account: - topic title - topic category - topic tags - replies (as much as the model allows) We introduce an interface so we can eventually support multiple strategies for handling long topics. This PR severely damages the semantic search performance, but this is a temporary until we can get adapt HyDE to make semantic search use the same embeddings we have for semantic related with good performance. Here we also have some ground work to add post level embeddings, but this will be added in a future PR. Please note that this PR will also block Discourse from booting / updating if this plugin is installed and the pgvector extension isn't available on the PostgreSQL instance Discourse uses. 2023-07-13 12:41:36 -03:00			`end`

FEATURE: Semantic assymetric full-page search (#34) Depends on discourse/discourse#20915 Hooks to the full-page-search component using an experimental API and performs an assymetric similarity search using our embeddings database. 2023-03-31 15:29:56 -03:00			`private`

FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 11:08:23 -03:00			`attr_reader :guardian`
FEATURE: Semantic assymetric full-page search (#34) Depends on discourse/discourse#20915 Hooks to the full-page-search component using an experimental API and performs an assymetric similarity search using our embeddings database. 2023-03-31 15:29:56 -03:00			`end`
			`end`
			`end`