FEATURE: PDF support for rag pipeline (#1118)
This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes:
**1. LLM Model Association for RAG and Personas:**
- **New Database Columns:** Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`.
- **Migration:** Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter.
- **Model Changes:** The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes.
- **UI Updates:** The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector.
- **Serialization:** The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes.
**2. PDF and Image Support for RAG:**
- **Site Setting:** Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`.
- **File Upload Validation:** The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled.
- **PDF Processing:** Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced.
- **Image Processing:** A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs.
- **RAG Digestion Job:** The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments.
- **UI Updates:** The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types.
**3. Refactoring and Improvements:**
- **LLM Enumeration:** The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend.
- **AI Helper:** The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility.
- **Bot and Persona Updates:** Several updates were made across the codebase, changing the string based association to a LLM to the new model based.
- **Audit Logs:** The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing.
- **Eval Script:** An evaluation script is included.
**4. Testing:**
- The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals`
2025-02-14 12:15:07 +11:00
|
|
|
#frozen_string_literal: true
|
|
|
|
|
|
|
|
class DiscourseAi::Evals::Runner
|
|
|
|
class StructuredLogger
|
|
|
|
def initialize
|
|
|
|
@log = []
|
|
|
|
@current_step = @log
|
|
|
|
end
|
|
|
|
|
|
|
|
def log(name, args: nil, start_time: nil, end_time: nil)
|
|
|
|
start_time ||= Time.now.utc
|
|
|
|
end_time ||= Time.now.utc
|
|
|
|
args ||= {}
|
|
|
|
object = { name: name, args: args, start_time: start_time, end_time: end_time }
|
|
|
|
@current_step << object
|
|
|
|
end
|
|
|
|
|
|
|
|
def step(name, args: nil)
|
|
|
|
start_time = Time.now.utc
|
|
|
|
start_step = @current_step
|
|
|
|
|
|
|
|
new_step = { type: :step, name: name, args: args || {}, log: [], start_time: start_time }
|
|
|
|
|
|
|
|
@current_step << new_step
|
|
|
|
@current_step = new_step[:log]
|
|
|
|
yield new_step
|
|
|
|
@current_step = start_step
|
|
|
|
new_step[:end_time] = Time.now.utc
|
|
|
|
end
|
|
|
|
|
|
|
|
def to_trace_event_json
|
|
|
|
trace_events = []
|
|
|
|
process_id = 1
|
|
|
|
thread_id = 1
|
|
|
|
|
|
|
|
to_trace_event(@log, process_id, thread_id, trace_events)
|
|
|
|
|
|
|
|
JSON.pretty_generate({ traceEvents: trace_events })
|
|
|
|
end
|
|
|
|
|
|
|
|
private
|
|
|
|
|
|
|
|
def to_trace_event(log_items, pid, tid, trace_events, parent_start_time = nil)
|
|
|
|
log_items.each do |item|
|
|
|
|
if item.is_a?(Hash) && item[:type] == :step
|
|
|
|
trace_events << {
|
|
|
|
name: item[:name],
|
|
|
|
cat: "default",
|
|
|
|
ph: "B", # Begin event
|
|
|
|
pid: pid,
|
|
|
|
tid: tid,
|
|
|
|
args: item[:args],
|
|
|
|
ts: timestamp_in_microseconds(item[:start_time]),
|
|
|
|
}
|
|
|
|
|
|
|
|
to_trace_event(item[:log], pid, tid, trace_events, item[:start_time])
|
|
|
|
|
|
|
|
trace_events << {
|
|
|
|
name: item[:name],
|
|
|
|
cat: "default",
|
|
|
|
ph: "E", # End event
|
|
|
|
pid: pid,
|
|
|
|
tid: tid,
|
|
|
|
ts: timestamp_in_microseconds(item[:end_time]),
|
|
|
|
}
|
|
|
|
else
|
|
|
|
trace_events << {
|
|
|
|
name: item[:name],
|
|
|
|
cat: "default",
|
|
|
|
ph: "B",
|
|
|
|
pid: pid,
|
|
|
|
tid: tid,
|
|
|
|
args: item[:args],
|
|
|
|
ts: timestamp_in_microseconds(item[:start_time] || parent_start_time || Time.now.utc),
|
|
|
|
s: "p", # Scope: process
|
|
|
|
}
|
|
|
|
trace_events << {
|
|
|
|
name: item[:name],
|
|
|
|
cat: "default",
|
|
|
|
ph: "E",
|
|
|
|
pid: pid,
|
|
|
|
tid: tid,
|
|
|
|
ts: timestamp_in_microseconds(item[:end_time] || Time.now.utc),
|
|
|
|
s: "p",
|
|
|
|
}
|
|
|
|
end
|
|
|
|
end
|
|
|
|
end
|
|
|
|
|
|
|
|
def timestamp_in_microseconds(time)
|
|
|
|
(time.to_f * 1_000_000).to_i
|
|
|
|
end
|
|
|
|
end
|
|
|
|
|
|
|
|
attr_reader :llms, :cases
|
|
|
|
|
|
|
|
def self.evals_paths
|
|
|
|
@eval_paths ||= Dir.glob(File.join(File.join(__dir__, "../cases"), "*/*.yml"))
|
|
|
|
end
|
|
|
|
|
|
|
|
def self.evals
|
|
|
|
@evals ||= evals_paths.map { |path| DiscourseAi::Evals::Eval.new(path: path) }
|
|
|
|
end
|
|
|
|
|
|
|
|
def self.print
|
|
|
|
evals.each(&:print)
|
|
|
|
end
|
|
|
|
|
|
|
|
def initialize(eval_name:, llms:)
|
|
|
|
@llms = llms
|
|
|
|
@eval = self.class.evals.find { |c| c.id == eval_name }
|
|
|
|
|
|
|
|
if !@eval
|
|
|
|
puts "Error: Unknown evaluation '#{eval_name}'"
|
|
|
|
exit 1
|
|
|
|
end
|
|
|
|
|
|
|
|
if @llms.empty?
|
|
|
|
puts "Error: Unknown model 'model'"
|
|
|
|
exit 1
|
|
|
|
end
|
|
|
|
end
|
|
|
|
|
|
|
|
def run!
|
|
|
|
puts "Running evaluation '#{@eval.id}'"
|
|
|
|
|
|
|
|
structured_log_filename = "#{@eval.id}-#{Time.now.strftime("%Y%m%d-%H%M%S")}.json"
|
|
|
|
log_filename = "#{@eval.id}-#{Time.now.strftime("%Y%m%d-%H%M%S")}.log"
|
|
|
|
logs_dir = File.join(__dir__, "../log")
|
|
|
|
FileUtils.mkdir_p(logs_dir)
|
|
|
|
|
|
|
|
log_path = File.expand_path(File.join(logs_dir, log_filename))
|
|
|
|
structured_log_path = File.expand_path(File.join(logs_dir, structured_log_filename))
|
|
|
|
|
|
|
|
logger = Logger.new(File.open(log_path, "a"))
|
|
|
|
logger.info("Starting evaluation '#{@eval.id}'")
|
|
|
|
|
|
|
|
Thread.current[:llm_audit_log] = logger
|
|
|
|
structured_logger = Thread.current[:llm_audit_structured_log] = StructuredLogger.new
|
|
|
|
|
|
|
|
structured_logger.step("Evaluating #{@eval.id}", args: @eval.to_json) do
|
|
|
|
llms.each do |llm|
|
|
|
|
if @eval.vision && !llm.vision?
|
|
|
|
logger.info("Skipping LLM: #{llm.name} as it does not support vision")
|
|
|
|
next
|
|
|
|
end
|
|
|
|
|
|
|
|
structured_logger.step("Evaluating with LLM: #{llm.name}") do |step|
|
|
|
|
logger.info("Evaluating with LLM: #{llm.name}")
|
|
|
|
print "#{llm.name}: "
|
2025-03-14 12:46:22 +08:00
|
|
|
results = @eval.run(llm: llm)
|
|
|
|
|
|
|
|
results.each do |result|
|
|
|
|
step[:args] = result
|
|
|
|
step[:cname] = result[:result] == :pass ? :good : :bad
|
|
|
|
|
|
|
|
if result[:result] == :fail
|
|
|
|
puts "Failed 🔴"
|
|
|
|
puts "Error: #{result[:message]}" if result[:message]
|
|
|
|
# this is deliberate, it creates a lot of noise, but sometimes for debugging it's useful
|
|
|
|
#puts "Context: #{result[:context].to_s[0..2000]}" if result[:context]
|
|
|
|
if result[:expected_output] && result[:actual_output]
|
|
|
|
puts "---- Expected ----\n#{result[:expected_output]}"
|
|
|
|
puts "---- Actual ----\n#{result[:actual_output]}"
|
|
|
|
end
|
|
|
|
logger.error("Evaluation failed with LLM: #{llm.name}")
|
|
|
|
logger.error("Error: #{result[:message]}") if result[:message]
|
|
|
|
logger.error("Expected: #{result[:expected_output]}") if result[:expected_output]
|
|
|
|
logger.error("Actual: #{result[:actual_output]}") if result[:actual_output]
|
|
|
|
logger.error("Context: #{result[:context]}") if result[:context]
|
|
|
|
elsif result[:result] == :pass
|
|
|
|
puts "Passed 🟢"
|
|
|
|
logger.info("Evaluation passed with LLM: #{llm.name}")
|
|
|
|
else
|
|
|
|
STDERR.puts "Error: Unknown result #{eval.inspect}"
|
|
|
|
logger.error("Unknown result: #{eval.inspect}")
|
2025-02-19 15:44:33 +11:00
|
|
|
end
|
FEATURE: PDF support for rag pipeline (#1118)
This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes:
**1. LLM Model Association for RAG and Personas:**
- **New Database Columns:** Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`.
- **Migration:** Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter.
- **Model Changes:** The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes.
- **UI Updates:** The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector.
- **Serialization:** The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes.
**2. PDF and Image Support for RAG:**
- **Site Setting:** Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`.
- **File Upload Validation:** The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled.
- **PDF Processing:** Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced.
- **Image Processing:** A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs.
- **RAG Digestion Job:** The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments.
- **UI Updates:** The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types.
**3. Refactoring and Improvements:**
- **LLM Enumeration:** The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend.
- **AI Helper:** The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility.
- **Bot and Persona Updates:** Several updates were made across the codebase, changing the string based association to a LLM to the new model based.
- **Audit Logs:** The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing.
- **Eval Script:** An evaluation script is included.
**4. Testing:**
- The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals`
2025-02-14 12:15:07 +11:00
|
|
|
end
|
|
|
|
end
|
|
|
|
end
|
|
|
|
end
|
|
|
|
|
|
|
|
#structured_logger.save(structured_log_path)
|
|
|
|
|
|
|
|
File.write("#{structured_log_path}", structured_logger.to_trace_event_json)
|
|
|
|
|
|
|
|
puts
|
|
|
|
puts "Log file: #{log_path}"
|
|
|
|
puts "Structured log file (ui.perfetto.dev): #{structured_log_path}"
|
|
|
|
|
|
|
|
# temp code
|
|
|
|
# puts File.read(structured_log_path)
|
|
|
|
end
|
|
|
|
end
|