discourse-ai/lib/summarization/strategies/chat_messages.rb

# frozen_string_literal: true

module DiscourseAi
  module Summarization
    module Strategies
      class ChatMessages < Base
        def type
          AiSummary.summary_types[:complete]
        end

        def highest_target_number
          nil # We don't persist so we can return nil.
        end

        def initialize(target, since)
          super(target)
          @since = since
        end

        def targets_data
          target
            .chat_messages
            .where("chat_messages.created_at > ?", since.hours.ago)
            .includes(:user)
            .order(created_at: :asc)
            .pluck(:id, :username_lower, :message)
            .map { { id: _1, poster: _2, text: _3 } }
        end

        def summary_extension_prompt(summary, contents)
          input =
            contents
              .map { |item| "(#{item[:id]} #{item[:poster]} said: #{item[:text]} " }
              .join("\n")

          prompt = DiscourseAi::Completions::Prompt.new(<<~TEXT.strip)
            You are a summarization bot tasked with expanding on an existing summary by incorporating new chat messages.
            Your goal is to seamlessly integrate the additional information into the existing summary, preserving the clarity and insights of the original while reflecting any new developments, themes, or conclusions.
            Analyze the new messages to identify key themes, participants' intentions, and any significant decisions or resolutions.
            Update the summary to include these aspects in a way that remains concise, comprehensive, and accessible to someone with no prior context of the conversation.

            ### Guidelines:

            - Merge the new information naturally with the existing summary without redundancy.
            - Only include the updated summary, WITHOUT additional commentary.
            - Don't mention the channel title. Avoid extraneous details or subjective opinions.
            - Maintain the original language of the text being summarized.
            - The same user could write multiple messages in a row, don't treat them as different persons.
            - Aim for summaries to be extended by a reasonable amount, but strive to maintain a total length of 400 words or less, unless absolutely necessary for comprehensiveness.

        TEXT

          prompt.push(type: :user, content: <<~TEXT.strip)
          ### Context:

          This is the existing summary:

          #{summary}

          These are the new chat messages:

          #{input}

          Intengrate the new messages into the existing summary.
        TEXT

          prompt
        end

        def first_summary_prompt(contents)
          content_title = target.name
          input =
            contents.map { |item| "(#{item[:id]} #{item[:poster]} said: #{item[:text]} " }.join

          prompt = DiscourseAi::Completions::Prompt.new(<<~TEXT.strip)
            You are a summarization bot designed to generate clear and insightful paragraphs that conveys the main topics 
            and developments from a series of chat messages within a user-selected time window. 
            
            Analyze the messages to extract key themes, participants' intentions, and any significant conclusions or decisions. 
            Your summary should be concise yet comprehensive, providing an overview that is accessible to someone with no prior context of the conversation. 

            - Only include the summary, WITHOUT additional commentary.
            - Don't mention the channel title. Avoid including extraneous details or subjective opinions.
            - Maintain the original language of the text being summarized.
            - The same user could write multiple messages in a row, don't treat them as different persons.
            - Aim for summaries to be 400 words or less.

          TEXT

          prompt.push(type: :user, content: <<~TEXT.strip)
            #{content_title.present? ? "The name of the channel is: " + content_title + ".\n" : ""}
            
            Here are the messages, inside <input></input> XML tags:

            <input>
              #{input}
            </input>

            Generate a summary of the given chat messages.
          TEXT

          prompt
        end

        private

        attr_reader :since
      end
    end
  end
end
REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00			`# frozen_string_literal: true`

			`module DiscourseAi`
			`module Summarization`
			`module Strategies`
			`class ChatMessages < Base`
			`def type`
			`AiSummary.summary_types[:complete]`
			`end`

FIX: Make summaries backfill job more resilient. (#1071) To quickly select backfill candidates without comparing SHAs, we compare the last summarized post to the topic's highest_post_number. However, hiding or deleting a post and adding a small action will update this column, causing the job to stall and re-generate the same summary repeatedly until someone posts a regular reply. On top of this, this is not always true for topics with `best_replies`, as this last reply isn't necessarily included. Since this is not evident at first glance and each summarization strategy picks its targets differently, I'm opting to simplify the backfill logic and how we track potential candidates. The first step is dropping `content_range`, which serves no purpose and it's there because summary caching was supposed to work differently at the beginning. So instead, I'm replacing it with a column called `highest_target_number`, which tracks `highest_post_number` for topics and could track other things like channel's `message_count` in the future. Now that we have this column when selecting every potential backfill candidate, we'll check if the summary is truly outdated by comparing the SHAs, and if it's not, we just update the column and move on 2025-01-16 09:42:53 -03:00			`def highest_target_number`
			`nil # We don't persist so we can return nil.`
			`end`

REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00			`def initialize(target, since)`
			`super(target)`
			`@since = since`
			`end`

			`def targets_data`
FIX/REFACTOR: FoldContent revamp (#866) * FIX/REFACTOR: FoldContent revamp We hit a snag with our hot topic gist strategy: the regex we used to split the content didn't work, so we cannot send the original post separately. This was important for letting the model focus on what's new in the topic. The algorithm doesn’t give us full control over how prompts are written, and figuring out how to format the content isn't straightforward. This means we're having to use more complicated workarounds, like regex. To tackle this, I'm suggesting we simplify the approach a bit. Let's focus on summarizing as much as we can upfront, then gradually add new content until there's nothing left to summarize. Also, the "extend" part is mostly for models with small context windows, which shouldn't pose a problem 99% of the time with the content volume we're dealing with. * Fix fold docs * Use #shift instead of #pop to get the first elem, not the last 2024-10-25 11:51:17 -03:00			`target`
REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00			`.chat_messages`
			`.where("chat_messages.created_at > ?", since.hours.ago)`
			`.includes(:user)`
			`.order(created_at: :asc)`
			`.pluck(:id, :username_lower, :message)`
			`.map { { id: _1, poster: _2, text: _3 } }`
			`end`

DEV: Extend truncation to all summarizable content (#884) 2024-10-31 12:17:42 -03:00			`def summary_extension_prompt(summary, contents)`
FIX/REFACTOR: FoldContent revamp (#866) * FIX/REFACTOR: FoldContent revamp We hit a snag with our hot topic gist strategy: the regex we used to split the content didn't work, so we cannot send the original post separately. This was important for letting the model focus on what's new in the topic. The algorithm doesn’t give us full control over how prompts are written, and figuring out how to format the content isn't straightforward. This means we're having to use more complicated workarounds, like regex. To tackle this, I'm suggesting we simplify the approach a bit. Let's focus on summarizing as much as we can upfront, then gradually add new content until there's nothing left to summarize. Also, the "extend" part is mostly for models with small context windows, which shouldn't pose a problem 99% of the time with the content volume we're dealing with. * Fix fold docs * Use #shift instead of #pop to get the first elem, not the last 2024-10-25 11:51:17 -03:00			`input =`
			`contents`
			`.map { \|item\| "(#{item[:id]} #{item[:poster]} said: #{item[:text]} " }`
			`.join("\n")`

REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00			`prompt = DiscourseAi::Completions::Prompt.new(<<~TEXT.strip)`
FIX/REFACTOR: FoldContent revamp (#866) * FIX/REFACTOR: FoldContent revamp We hit a snag with our hot topic gist strategy: the regex we used to split the content didn't work, so we cannot send the original post separately. This was important for letting the model focus on what's new in the topic. The algorithm doesn’t give us full control over how prompts are written, and figuring out how to format the content isn't straightforward. This means we're having to use more complicated workarounds, like regex. To tackle this, I'm suggesting we simplify the approach a bit. Let's focus on summarizing as much as we can upfront, then gradually add new content until there's nothing left to summarize. Also, the "extend" part is mostly for models with small context windows, which shouldn't pose a problem 99% of the time with the content volume we're dealing with. * Fix fold docs * Use #shift instead of #pop to get the first elem, not the last 2024-10-25 11:51:17 -03:00			`You are a summarization bot tasked with expanding on an existing summary by incorporating new chat messages.`
			`Your goal is to seamlessly integrate the additional information into the existing summary, preserving the clarity and insights of the original while reflecting any new developments, themes, or conclusions.`
			`Analyze the new messages to identify key themes, participants' intentions, and any significant decisions or resolutions.`
			`Update the summary to include these aspects in a way that remains concise, comprehensive, and accessible to someone with no prior context of the conversation.`

			`### Guidelines:`

			`- Merge the new information naturally with the existing summary without redundancy.`
			`- Only include the updated summary, WITHOUT additional commentary.`
			`- Don't mention the channel title. Avoid extraneous details or subjective opinions.`
			`- Maintain the original language of the text being summarized.`
			`- The same user could write multiple messages in a row, don't treat them as different persons.`
			`- Aim for summaries to be extended by a reasonable amount, but strive to maintain a total length of 400 words or less, unless absolutely necessary for comprehensiveness.`

REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00			`TEXT`

			`prompt.push(type: :user, content: <<~TEXT.strip)`
FIX/REFACTOR: FoldContent revamp (#866) * FIX/REFACTOR: FoldContent revamp We hit a snag with our hot topic gist strategy: the regex we used to split the content didn't work, so we cannot send the original post separately. This was important for letting the model focus on what's new in the topic. The algorithm doesn’t give us full control over how prompts are written, and figuring out how to format the content isn't straightforward. This means we're having to use more complicated workarounds, like regex. To tackle this, I'm suggesting we simplify the approach a bit. Let's focus on summarizing as much as we can upfront, then gradually add new content until there's nothing left to summarize. Also, the "extend" part is mostly for models with small context windows, which shouldn't pose a problem 99% of the time with the content volume we're dealing with. * Fix fold docs * Use #shift instead of #pop to get the first elem, not the last 2024-10-25 11:51:17 -03:00			`### Context:`

			`This is the existing summary:`

			`#{summary}`
REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00
FIX/REFACTOR: FoldContent revamp (#866) * FIX/REFACTOR: FoldContent revamp We hit a snag with our hot topic gist strategy: the regex we used to split the content didn't work, so we cannot send the original post separately. This was important for letting the model focus on what's new in the topic. The algorithm doesn’t give us full control over how prompts are written, and figuring out how to format the content isn't straightforward. This means we're having to use more complicated workarounds, like regex. To tackle this, I'm suggesting we simplify the approach a bit. Let's focus on summarizing as much as we can upfront, then gradually add new content until there's nothing left to summarize. Also, the "extend" part is mostly for models with small context windows, which shouldn't pose a problem 99% of the time with the content volume we're dealing with. * Fix fold docs * Use #shift instead of #pop to get the first elem, not the last 2024-10-25 11:51:17 -03:00			`These are the new chat messages:`

			`#{input}`

			`Intengrate the new messages into the existing summary.`
REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00			`TEXT`

			`prompt`
			`end`

DEV: Extend truncation to all summarizable content (#884) 2024-10-31 12:17:42 -03:00			`def first_summary_prompt(contents)`
FIX/REFACTOR: FoldContent revamp (#866) * FIX/REFACTOR: FoldContent revamp We hit a snag with our hot topic gist strategy: the regex we used to split the content didn't work, so we cannot send the original post separately. This was important for letting the model focus on what's new in the topic. The algorithm doesn’t give us full control over how prompts are written, and figuring out how to format the content isn't straightforward. This means we're having to use more complicated workarounds, like regex. To tackle this, I'm suggesting we simplify the approach a bit. Let's focus on summarizing as much as we can upfront, then gradually add new content until there's nothing left to summarize. Also, the "extend" part is mostly for models with small context windows, which shouldn't pose a problem 99% of the time with the content volume we're dealing with. * Fix fold docs * Use #shift instead of #pop to get the first elem, not the last 2024-10-25 11:51:17 -03:00			`content_title = target.name`
			`input =`
			`contents.map { \|item\| "(#{item[:id]} #{item[:poster]} said: #{item[:text]} " }.join`

REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00			`prompt = DiscourseAi::Completions::Prompt.new(<<~TEXT.strip)`
			`You are a summarization bot designed to generate clear and insightful paragraphs that conveys the main topics`
			`and developments from a series of chat messages within a user-selected time window.`

			`Analyze the messages to extract key themes, participants' intentions, and any significant conclusions or decisions.`
			`Your summary should be concise yet comprehensive, providing an overview that is accessible to someone with no prior context of the conversation.`

			`- Only include the summary, WITHOUT additional commentary.`
			`- Don't mention the channel title. Avoid including extraneous details or subjective opinions.`
			`- Maintain the original language of the text being summarized.`
			`- The same user could write multiple messages in a row, don't treat them as different persons.`
			`- Aim for summaries to be 400 words or less.`

			`TEXT`

			`prompt.push(type: :user, content: <<~TEXT.strip)`
FIX/REFACTOR: FoldContent revamp (#866) * FIX/REFACTOR: FoldContent revamp We hit a snag with our hot topic gist strategy: the regex we used to split the content didn't work, so we cannot send the original post separately. This was important for letting the model focus on what's new in the topic. The algorithm doesn’t give us full control over how prompts are written, and figuring out how to format the content isn't straightforward. This means we're having to use more complicated workarounds, like regex. To tackle this, I'm suggesting we simplify the approach a bit. Let's focus on summarizing as much as we can upfront, then gradually add new content until there's nothing left to summarize. Also, the "extend" part is mostly for models with small context windows, which shouldn't pose a problem 99% of the time with the content volume we're dealing with. * Fix fold docs * Use #shift instead of #pop to get the first elem, not the last 2024-10-25 11:51:17 -03:00			`#{content_title.present? ? "The name of the channel is: " + content_title + ".\n" : ""}`
REFACTOR: Support of different summarization targets/prompts. (#835) * DEV: Add summary types * Refactor for different summary types * Use enum for summary types * Update lib/summarization/strategies/topic_summary.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/topic_gist.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Update lib/summarization/strategies/chat_messages.rb Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> * Fix chat_messages single prompt * Small tweak to the chat summarization prompt --------- Co-authored-by: Penar Musaraj <pmusaraj@gmail.com> 2024-10-15 13:53:26 -03:00
			`Here are the messages, inside <input></input> XML tags:`

			`<input>`
			`#{input}`
			`</input>`

			`Generate a summary of the given chat messages.`
			`TEXT`

			`prompt`
			`end`

			`private`

			`attr_reader :since`
			`end`
			`end`
			`end`
			`end`