Roman Rizzi 1f1c94e5c6
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.

For now, we'll only allow plain-text files, but this will change in the future.

Commits:

* FEATURE: RAG embeddings for the AI Bot

This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.

* Basic asymmetric similarity search to provide guidance in system prompt

* Fix tests and lint

* Apply reranker to fragments

* Uploads filter, css adjustments and file validations

* Add placeholder for rag fragments

* Update annotations
2024-04-01 13:43:34 -03:00

81 lines
2.2 KiB
Ruby

# frozen_string_literal: true
module DiscourseAi
module Embeddings
module Strategies
class Truncation
def id
1
end
def version
1
end
def prepare_text_from(target, tokenizer, max_length)
case target
when Topic
topic_truncation(target, tokenizer, max_length)
when Post
post_truncation(target, tokenizer, max_length)
when RagDocumentFragment
tokenizer.truncate(target.fragment, max_length)
else
raise ArgumentError, "Invalid target type"
end
end
private
def topic_information(topic)
info = +""
if topic&.title.present?
info << topic.title
info << "\n\n"
end
if topic&.category&.name.present?
info << topic.category.name
info << "\n\n"
end
if SiteSetting.tagging_enabled && topic&.tags.present?
info << topic.tags.pluck(:name).join(", ")
info << "\n\n"
end
info
end
def topic_truncation(topic, tokenizer, max_length)
text = +topic_information(topic)
if topic&.topic_embed&.embed_content_cache&.present?
text << Nokogiri::HTML5.fragment(topic.topic_embed.embed_content_cache).text
text << "\n\n"
end
topic.posts.find_each do |post|
text << Nokogiri::HTML5.fragment(post.cooked).text
break if tokenizer.size(text) >= max_length #maybe keep a partial counter to speed this up?
text << "\n\n"
end
tokenizer.truncate(text, max_length)
end
def post_truncation(post, tokenizer, max_length)
text = +topic_information(post.topic)
if post.is_first_post? && post.topic&.topic_embed&.embed_content_cache&.present?
text << Nokogiri::HTML5.fragment(post.topic.topic_embed.embed_content_cache).text
else
text << Nokogiri::HTML5.fragment(post.cooked).text
end
tokenizer.truncate(text, max_length)
end
end
end
end
end