discourse-ai/lib/utils/pdf_to_images.rb

# frozen_string_literal: true

class DiscourseAi::Utils::PdfToImages
  MAX_PDF_SIZE = 100.megabytes
  # this is long, mutool can be faster than magick, 10 minutes will be enough for quite large pdfs
  MAX_CONVERT_SECONDS = 600
  BACKOFF_SECONDS = [5, 30, 60]

  attr_reader :upload, :user

  def initialize(upload:, user:)
    @upload = upload
    @user = user
    @uploaded_pages = UploadReference.where(target: upload).map(&:upload).presence
  end

  def uploaded_pages
    @uploaded_pages ||= extract_pages
  end

  def extract_pages
    begin
      pdf_path =
        if upload.local?
          Discourse.store.path_for(upload)
        else
          Discourse.store.download_safe(upload, max_file_size_kb: MAX_PDF_SIZE)&.path
        end

      raise Discourse::InvalidParameters.new("Failed to download PDF") if pdf_path.nil?

      temp_dir = Dir.mktmpdir("discourse-pdf-#{SecureRandom.hex(8)}")
      temp_pdf = File.join(temp_dir, "source.pdf")
      FileUtils.cp(pdf_path, temp_pdf)

      # Convert PDF to individual page images
      output_pattern = File.join(temp_dir, "page-%04d.png")

      command = [
        "magick",
        "-density",
        "300",
        temp_pdf,
        "-background",
        "white",
        "-auto-orient",
        "-quality",
        "85",
        output_pattern,
      ]

      Discourse::Utils.execute_command(
        *command,
        failure_message: "Failed to convert PDF to images",
        timeout: MAX_CONVERT_SECONDS,
      )

      uploads = []
      Dir
        .glob(File.join(temp_dir, "page-*.png"))
        .sort
        .each do |page_path|
          upload =
            UploadCreator.new(File.open(page_path), "page-#{File.basename(page_path)}").create_for(
              @user.id,
            )

          uploads << upload
        end

      # Create upload references
      UploadReference.ensure_exist!(upload_ids: uploads.map(&:id), target: @upload)

      @uploaded_pages = uploads
    ensure
      FileUtils.rm_rf(temp_dir) if temp_dir
    end
  end
end
FEATURE: PDF support for rag pipeline (#1118) This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes: 1. LLM Model Association for RAG and Personas: - New Database Columns: Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`. - Migration: Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter. - Model Changes: The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes. - UI Updates: The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector. - Serialization: The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes. 2. PDF and Image Support for RAG: - Site Setting: Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`. - File Upload Validation: The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled. - PDF Processing: Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced. - Image Processing: A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs. - RAG Digestion Job: The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments. - UI Updates: The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types. 3. Refactoring and Improvements: - LLM Enumeration: The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend. - AI Helper: The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility. - Bot and Persona Updates: Several updates were made across the codebase, changing the string based association to a LLM to the new model based. - Audit Logs: The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing. - Eval Script: An evaluation script is included. 4. Testing: - The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals` 2025-02-14 12:15:07 +11:00			`# frozen_string_literal: true`

			`class DiscourseAi::Utils::PdfToImages`
			`MAX_PDF_SIZE = 100.megabytes`
			`# this is long, mutool can be faster than magick, 10 minutes will be enough for quite large pdfs`
			`MAX_CONVERT_SECONDS = 600`
			`BACKOFF_SECONDS = [5, 30, 60]`

			`attr_reader :upload, :user`

			`def initialize(upload:, user:)`
			`@upload = upload`
			`@user = user`
			`@uploaded_pages = UploadReference.where(target: upload).map(&:upload).presence`
			`end`

			`def uploaded_pages`
			`@uploaded_pages \|\|= extract_pages`
			`end`

			`def extract_pages`
			`begin`
			`pdf_path =`
			`if upload.local?`
			`Discourse.store.path_for(upload)`
			`else`
			`Discourse.store.download_safe(upload, max_file_size_kb: MAX_PDF_SIZE)&.path`
			`end`

			`raise Discourse::InvalidParameters.new("Failed to download PDF") if pdf_path.nil?`

FEATURE: Native PDF support (#1127) * FEATURE: Native PDF support This amends it so we use PDF Reader gem to extract text from PDFs * This means that our simple pdf eval passes at last * fix spec * skip test in CI * test file support * Update lib/utils/image_to_text.rb Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com> * address pr comments --------- Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com> 2025-02-18 09:22:57 +11:00			`temp_dir = Dir.mktmpdir("discourse-pdf-#{SecureRandom.hex(8)}")`
FEATURE: PDF support for rag pipeline (#1118) This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes: 1. LLM Model Association for RAG and Personas: - New Database Columns: Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`. - Migration: Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter. - Model Changes: The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes. - UI Updates: The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector. - Serialization: The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes. 2. PDF and Image Support for RAG: - Site Setting: Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`. - File Upload Validation: The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled. - PDF Processing: Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced. - Image Processing: A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs. - RAG Digestion Job: The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments. - UI Updates: The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types. 3. Refactoring and Improvements: - LLM Enumeration: The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend. - AI Helper: The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility. - Bot and Persona Updates: Several updates were made across the codebase, changing the string based association to a LLM to the new model based. - Audit Logs: The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing. - Eval Script: An evaluation script is included. 4. Testing: - The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals` 2025-02-14 12:15:07 +11:00			`temp_pdf = File.join(temp_dir, "source.pdf")`
			`FileUtils.cp(pdf_path, temp_pdf)`

			`# Convert PDF to individual page images`
			`output_pattern = File.join(temp_dir, "page-%04d.png")`

			`command = [`
			`"magick",`
			`"-density",`
			`"300",`
			`temp_pdf,`
			`"-background",`
			`"white",`
			`"-auto-orient",`
			`"-quality",`
			`"85",`
			`output_pattern,`
			`]`

			`Discourse::Utils.execute_command(`
			`*command,`
			`failure_message: "Failed to convert PDF to images",`
			`timeout: MAX_CONVERT_SECONDS,`
			`)`

			`uploads = []`
			`Dir`
			`.glob(File.join(temp_dir, "page-*.png"))`
			`.sort`
			`.each do \|page_path\|`
			`upload =`
			`UploadCreator.new(File.open(page_path), "page-#{File.basename(page_path)}").create_for(`
			`@user.id,`
			`)`

			`uploads << upload`
			`end`

			`# Create upload references`
			`UploadReference.ensure_exist!(upload_ids: uploads.map(&:id), target: @upload)`

			`@uploaded_pages = uploads`
			`ensure`
FEATURE: Native PDF support (#1127) * FEATURE: Native PDF support This amends it so we use PDF Reader gem to extract text from PDFs * This means that our simple pdf eval passes at last * fix spec * skip test in CI * test file support * Update lib/utils/image_to_text.rb Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com> * address pr comments --------- Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com> 2025-02-18 09:22:57 +11:00			`FileUtils.rm_rf(temp_dir) if temp_dir`
FEATURE: PDF support for rag pipeline (#1118) This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes: 1. LLM Model Association for RAG and Personas: - New Database Columns: Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`. - Migration: Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter. - Model Changes: The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes. - UI Updates: The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector. - Serialization: The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes. 2. PDF and Image Support for RAG: - Site Setting: Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`. - File Upload Validation: The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled. - PDF Processing: Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced. - Image Processing: A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs. - RAG Digestion Job: The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments. - UI Updates: The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types. 3. Refactoring and Improvements: - LLM Enumeration: The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend. - AI Helper: The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility. - Bot and Persona Updates: Several updates were made across the codebase, changing the string based association to a LLM to the new model based. - Audit Logs: The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing. - Eval Script: An evaluation script is included. 4. Testing: - The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals` 2025-02-14 12:15:07 +11:00			`end`
			`end`
			`end`