FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
# frozen_string_literal: true
|
|
|
|
|
|
|
|
module ::Jobs
|
|
|
|
class DigestRagUpload < ::Jobs::Base
|
2024-04-04 10:02:16 -04:00
|
|
|
CHUNK_SIZE = 1024
|
|
|
|
CHUNK_OVERLAP = 64
|
2024-04-12 09:32:46 -04:00
|
|
|
MAX_FRAGMENTS = 100_000
|
2024-04-04 10:02:16 -04:00
|
|
|
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
# TODO(roman): Add a way to automatically recover from errors, resulting in unindexed uploads.
|
|
|
|
def execute(args)
|
|
|
|
return if (upload = Upload.find_by(id: args[:upload_id])).nil?
|
2024-09-15 18:17:17 -04:00
|
|
|
|
|
|
|
target_type = args[:target_type]
|
|
|
|
target_id = args[:target_id]
|
|
|
|
|
|
|
|
return if !target_type || !target_id
|
|
|
|
|
|
|
|
target = target_type.constantize.find_by(id: target_id)
|
|
|
|
return if !target
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
|
2024-04-12 09:32:46 -04:00
|
|
|
truncation = DiscourseAi::Embeddings::Strategies::Truncation.new
|
|
|
|
vector_rep =
|
|
|
|
DiscourseAi::Embeddings::VectorRepresentations::Base.current_representation(truncation)
|
|
|
|
|
|
|
|
tokenizer = vector_rep.tokenizer
|
2024-09-15 18:17:17 -04:00
|
|
|
chunk_tokens = target.rag_chunk_tokens
|
|
|
|
overlap_tokens = target.rag_chunk_overlap_tokens
|
2024-04-12 09:32:46 -04:00
|
|
|
|
2024-09-15 18:17:17 -04:00
|
|
|
fragment_ids = RagDocumentFragment.where(target: target, upload: upload).pluck(:id)
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
|
|
|
|
# Check if this is the first time we process this upload.
|
|
|
|
if fragment_ids.empty?
|
|
|
|
document = get_uploaded_file(upload)
|
|
|
|
return if document.nil?
|
|
|
|
|
2024-04-25 09:48:55 -04:00
|
|
|
RagDocumentFragment.publish_status(upload, { total: 0, indexed: 0, left: 0 })
|
|
|
|
|
2024-04-04 10:02:16 -04:00
|
|
|
fragment_ids = []
|
|
|
|
idx = 0
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
|
2024-04-04 10:02:16 -04:00
|
|
|
ActiveRecord::Base.transaction do
|
2024-04-12 09:32:46 -04:00
|
|
|
chunk_document(
|
|
|
|
file: document,
|
|
|
|
tokenizer: tokenizer,
|
|
|
|
chunk_tokens: chunk_tokens,
|
|
|
|
overlap_tokens: overlap_tokens,
|
|
|
|
) do |chunk, metadata|
|
2024-04-04 10:02:16 -04:00
|
|
|
fragment_ids << RagDocumentFragment.create!(
|
2024-09-15 18:17:17 -04:00
|
|
|
target: target,
|
2024-04-04 10:02:16 -04:00
|
|
|
fragment: chunk,
|
|
|
|
fragment_number: idx + 1,
|
|
|
|
upload: upload,
|
|
|
|
metadata: metadata,
|
|
|
|
).id
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
|
2024-04-04 10:02:16 -04:00
|
|
|
idx += 1
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
|
2024-04-04 10:02:16 -04:00
|
|
|
if idx > MAX_FRAGMENTS
|
|
|
|
Rails.logger.warn("Upload #{upload.id} has too many fragments, truncating.")
|
|
|
|
break
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
end
|
2024-04-04 10:02:16 -04:00
|
|
|
end
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
end
|
|
|
|
end
|
|
|
|
|
|
|
|
fragment_ids.each_slice(50) do |slice|
|
|
|
|
Jobs.enqueue(:generate_rag_embeddings, fragment_ids: slice)
|
|
|
|
end
|
|
|
|
end
|
|
|
|
|
|
|
|
private
|
|
|
|
|
2024-04-12 09:32:46 -04:00
|
|
|
def chunk_document(file:, tokenizer:, chunk_tokens:, overlap_tokens:)
|
2024-04-04 10:02:16 -04:00
|
|
|
buffer = +""
|
|
|
|
current_metadata = nil
|
|
|
|
done = false
|
|
|
|
overlap = ""
|
|
|
|
|
2024-04-12 09:32:46 -04:00
|
|
|
# generally this will be plenty
|
|
|
|
read_size = chunk_tokens * 10
|
|
|
|
|
2024-04-04 10:02:16 -04:00
|
|
|
while buffer.present? || !done
|
2024-04-12 09:32:46 -04:00
|
|
|
if buffer.length < read_size
|
|
|
|
read = file.read(read_size)
|
2024-04-04 10:02:16 -04:00
|
|
|
done = true if read.nil?
|
|
|
|
|
|
|
|
read = Encodings.to_utf8(read) if read
|
|
|
|
|
|
|
|
buffer << (read || "")
|
|
|
|
end
|
|
|
|
|
|
|
|
# at this point we unconditionally have 2x CHUNK_SIZE worth of data in the buffer
|
|
|
|
metadata_regex = /\[\[metadata (.*?)\]\]/m
|
|
|
|
|
|
|
|
before_metadata, new_metadata, after_metadata = buffer.split(metadata_regex)
|
|
|
|
to_chunk = nil
|
|
|
|
|
|
|
|
if before_metadata.present?
|
|
|
|
to_chunk = before_metadata
|
|
|
|
elsif after_metadata.present?
|
|
|
|
current_metadata = new_metadata
|
|
|
|
to_chunk = after_metadata
|
|
|
|
buffer = buffer.split(metadata_regex, 2).last
|
|
|
|
overlap = ""
|
2024-04-16 23:46:40 -04:00
|
|
|
else
|
|
|
|
current_metadata = new_metadata
|
|
|
|
buffer = buffer.split(metadata_regex, 2).last
|
|
|
|
overlap = ""
|
|
|
|
next
|
2024-04-04 10:02:16 -04:00
|
|
|
end
|
|
|
|
|
2024-04-12 09:32:46 -04:00
|
|
|
chunk, split_char = first_chunk(to_chunk, tokenizer: tokenizer, chunk_tokens: chunk_tokens)
|
2024-04-04 10:02:16 -04:00
|
|
|
buffer = buffer[chunk.length..-1]
|
|
|
|
|
|
|
|
processed_chunk = overlap + chunk
|
|
|
|
|
|
|
|
processed_chunk.strip!
|
|
|
|
processed_chunk.gsub!(/\n[\n]+/, "\n\n")
|
|
|
|
|
|
|
|
yield processed_chunk, current_metadata
|
|
|
|
|
2024-04-12 09:32:46 -04:00
|
|
|
current_chunk_tokens = tokenizer.encode(chunk)
|
|
|
|
overlap_token_ids = current_chunk_tokens[-overlap_tokens..-1] || current_chunk_tokens
|
|
|
|
|
|
|
|
overlap = ""
|
|
|
|
|
|
|
|
while overlap_token_ids.present?
|
|
|
|
begin
|
2024-09-30 03:27:50 -04:00
|
|
|
padding = split_char
|
|
|
|
padding = " " if padding.empty?
|
|
|
|
overlap = tokenizer.decode(overlap_token_ids) + padding
|
2024-04-12 09:32:46 -04:00
|
|
|
break if overlap.encoding == Encoding::UTF_8
|
|
|
|
rescue StandardError
|
|
|
|
# it is possible that we truncated mid char
|
|
|
|
end
|
|
|
|
overlap_token_ids.shift
|
|
|
|
end
|
2024-04-04 10:02:16 -04:00
|
|
|
|
|
|
|
# remove first word it is probably truncated
|
2024-09-30 03:27:50 -04:00
|
|
|
overlap = overlap.split(/\s/, 2).last.to_s.lstrip
|
2024-04-04 10:02:16 -04:00
|
|
|
end
|
|
|
|
end
|
|
|
|
|
2024-04-12 09:32:46 -04:00
|
|
|
def first_chunk(text, chunk_tokens:, tokenizer:, splitters: ["\n\n", "\n", ".", ""])
|
|
|
|
return text, " " if tokenizer.tokenize(text).length <= chunk_tokens
|
2024-04-04 10:02:16 -04:00
|
|
|
|
|
|
|
splitters = splitters.find_all { |s| text.include?(s) }.compact
|
|
|
|
|
|
|
|
buffer = +""
|
|
|
|
split_char = nil
|
|
|
|
|
|
|
|
splitters.each do |splitter|
|
|
|
|
split_char = splitter
|
|
|
|
|
|
|
|
text
|
|
|
|
.split(split_char)
|
|
|
|
.each do |part|
|
2024-04-12 09:32:46 -04:00
|
|
|
break if tokenizer.tokenize(buffer + split_char + part).length > chunk_tokens
|
2024-04-04 10:02:16 -04:00
|
|
|
buffer << split_char
|
|
|
|
buffer << part
|
|
|
|
end
|
|
|
|
break if buffer.length > 0
|
|
|
|
end
|
|
|
|
|
|
|
|
[buffer, split_char]
|
|
|
|
end
|
|
|
|
|
FEATURE: AI Bot RAG support. (#537)
This PR lets you associate uploads to an AI persona, which we'll split and generate embeddings from. When building the system prompt to get a bot reply, we'll do a similarity search followed by a re-ranking (if available). This will let us find the most relevant fragments from the body of knowledge you associated with the persona, resulting in better, more informed responses.
For now, we'll only allow plain-text files, but this will change in the future.
Commits:
* FEATURE: RAG embeddings for the AI Bot
This first commit introduces a UI where admins can upload text files, which we'll store, split into fragments,
and generate embeddings of. In a next commit, we'll use those to give the bot additional information during
conversations.
* Basic asymmetric similarity search to provide guidance in system prompt
* Fix tests and lint
* Apply reranker to fragments
* Uploads filter, css adjustments and file validations
* Add placeholder for rag fragments
* Update annotations
2024-04-01 12:43:34 -04:00
|
|
|
def get_uploaded_file(upload)
|
|
|
|
store = Discourse.store
|
|
|
|
@file ||=
|
|
|
|
if store.external?
|
|
|
|
# Upload#filesize could be approximate.
|
|
|
|
# add two extra Mbs to make sure that we'll be able to download the upload.
|
|
|
|
max_filesize = upload.filesize + 2.megabytes
|
|
|
|
store.download(upload, max_file_size_kb: max_filesize)
|
|
|
|
else
|
|
|
|
File.open(store.path_for(upload))
|
|
|
|
end
|
|
|
|
end
|
|
|
|
end
|
|
|
|
end
|