discourse-ai/lib/tasks/modules/embeddings/database.rake

# frozen_string_literal: true

desc "Backfill embeddings for all topics"
task "ai:embeddings:backfill", [:start_topic] => [:environment] do |_, args|
  public_categories = Category.where(read_restricted: false).pluck(:id)
  manager = DiscourseAi::Embeddings::Manager.new(Topic.first)

  strategy = DiscourseAi::Embeddings::Strategies::Truncation.new
  vector_rep =
    DiscourseAi::Embeddings::VectorRepresentations::Base.find_vector_representation.new(strategy)
  table_name = vector_rep.table_name

  Topic
    .joins("LEFT JOIN #{table_name} ON #{table_name}.topic_id = topics.id")
    .where("#{table_name}.topic_id IS NULL")
    .where("topics.id >= ?", args[:start_topic].to_i || 0)
    .where("category_id IN (?)", public_categories)
    .where(deleted_at: nil)
    .order("topics.id ASC")
    .find_each do |t|
      print "."
      vector_rep.generate_topic_representation_from(t)
    end
end

desc "Creates indexes for embeddings"
task "ai:embeddings:index", [:work_mem] => [:environment] do |_, args|
  # Using extension maintainer's recommendation for ivfflat indexes
  # Results are not as good as without indexes, but it's much faster
  # Disk usage is ~1x the size of the table, so this doubles table total size
  count = Topic.count
  lists = count < 1_000_000 ? count / 1000 : Math.sqrt(count).to_i
  probes = count < 1_000_000 ? lists / 10 : Math.sqrt(lists).to_i

  vector_representation_klass = DiscourseAi::Embeddings::Vectors::Base.find_vector_representation
  strategy = DiscourseAi::Embeddings::Strategies::Truncation.new

  DB.exec("SET work_mem TO '#{args[:work_mem] || "1GB"}';")
  vector_representation_klass.new(strategy).create_index(lists, probes)
  DB.exec("RESET work_mem;")
  DB.exec("SET ivfflat.probes = #{probes};")
end
FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00			`# frozen_string_literal: true`

			`desc "Backfill embeddings for all topics"`
Fixes for embeddings and truncate (#67) 2023-05-17 19:21:28 -04:00			`task "ai:embeddings:backfill", [:start_topic] => [:environment] do \|_, args\|`
FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00			`public_categories = Category.where(read_restricted: false).pluck(:id)`
PERF: .find_each instead of .find to save us from memory allocation peaks also Fix embeddings rake task for new db structure 2023-07-13 17:59:25 -04:00			`manager = DiscourseAi::Embeddings::Manager.new(Topic.first)`
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 10:08:23 -04:00
			`strategy = DiscourseAi::Embeddings::Strategies::Truncation.new`
			`vector_rep =`
			`DiscourseAi::Embeddings::VectorRepresentations::Base.find_vector_representation.new(strategy)`
			`table_name = vector_rep.table_name`

FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00			`Topic`
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 10:08:23 -04:00			`.joins("LEFT JOIN #{table_name} ON #{table_name}.topic_id = topics.id")`
			`.where("#{table_name}.topic_id IS NULL")`
PERF: .find_each instead of .find to save us from memory allocation peaks also Fix embeddings rake task for new db structure 2023-07-13 17:59:25 -04:00			`.where("topics.id >= ?", args[:start_topic].to_i \|\| 0)`
FEATURE: Semantic assymetric full-page search (#34) Depends on discourse/discourse#20915 Hooks to the full-page-search component using an experimental API and performs an assymetric similarity search using our embeddings database. 2023-03-31 14:29:56 -04:00			`.where("category_id IN (?)", public_categories)`
FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00			`.where(deleted_at: nil)`
PERF: .find_each instead of .find to save us from memory allocation peaks also Fix embeddings rake task for new db structure 2023-07-13 17:59:25 -04:00			`.order("topics.id ASC")`
FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00			`.find_each do \|t\|`
			`print "."`
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 10:08:23 -04:00			`vector_rep.generate_topic_representation_from(t)`
FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00			`end`
			`end`

			`desc "Creates indexes for embeddings"`
FIX: Proper flow when a topic doesn't have embeddings (#20) 2023-03-20 15:44:55 -04:00			`task "ai:embeddings:index", [:work_mem] => [:environment] do \|_, args\|`
Updates to embedding rake tasks (#54) - Creates embeddings in topic ID order, so it's easier to stop and restart from where we stopped - Update index parameters with current best practices 2023-05-09 12:45:16 -04:00			`# Using extension maintainer's recommendation for ivfflat indexes`
FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00			`# Results are not as good as without indexes, but it's much faster`
PERF: .find_each instead of .find to save us from memory allocation peaks also Fix embeddings rake task for new db structure 2023-07-13 17:59:25 -04:00			`# Disk usage is ~1x the size of the table, so this doubles table total size`
Updates to embedding rake tasks (#54) - Creates embeddings in topic ID order, so it's easier to stop and restart from where we stopped - Update index parameters with current best practices 2023-05-09 12:45:16 -04:00			`count = Topic.count`
			`lists = count < 1_000_000 ? count / 1000 : Math.sqrt(count).to_i`
			`probes = count < 1_000_000 ? lists / 10 : Math.sqrt(lists).to_i`
FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 10:08:23 -04:00			`vector_representation_klass = DiscourseAi::Embeddings::Vectors::Base.find_vector_representation`
			`strategy = DiscourseAi::Embeddings::Strategies::Truncation.new`
PERF: .find_each instead of .find to save us from memory allocation peaks also Fix embeddings rake task for new db structure 2023-07-13 17:59:25 -04:00
			`DB.exec("SET work_mem TO '#{args[:work_mem] \|\| "1GB"}';")`
FEATURE: HyDE-powered semantic search. (#136) * FEATURE: HyDE-powered semantic search. It relies on the new outlet added on discourse/discourse#23390 to display semantic search results in an unobtrusive way. We'll use a HyDE-backed approach for semantic search, which consists on generating an hypothetical document from a given keywords, which gets transformed into a vector and used in a asymmetric similarity topic search. This PR also reorganizes the internals to have less moving parts, maintaining one hierarchy of DAOish classes for vector-related operations like transformations and querying. Completions and vectors created by HyDE will remain cached on Redis for now, but we could later use Postgres instead. * Missing translation and rate limiting --------- Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-09-05 10:08:23 -04:00			`vector_representation_klass.new(strategy).create_index(lists, probes)`
PERF: .find_each instead of .find to save us from memory allocation peaks also Fix embeddings rake task for new db structure 2023-07-13 17:59:25 -04:00			`DB.exec("RESET work_mem;")`
			`DB.exec("SET ivfflat.probes = #{probes};")`
FEATURE: Semantic Suggested Topics (#10) 2023-03-15 16:21:45 -04:00			`end`