Roman Rizzi ec97996905
FIX/REFACTOR: FoldContent revamp (#866)
* FIX/REFACTOR: FoldContent revamp

We hit a snag with our hot topic gist strategy: the regex we used to split the content didn't work, so we cannot send the original post separately. This was important for letting the model focus on what's new in the topic.

The algorithm doesn’t give us full control over how prompts are written, and figuring out how to format the content isn't straightforward. This means we're having to use more complicated workarounds, like regex.

To tackle this, I'm suggesting we simplify the approach a bit. Let's focus on summarizing as much as we can upfront, then gradually add new content until there's nothing left to summarize.

Also, the "extend" part is mostly for models with small context windows, which shouldn't pose a problem 99% of the time with the content volume we're dealing with.

* Fix fold docs

* Use #shift instead of #pop to get the first elem, not the last
2024-10-25 11:51:17 -03:00

154 lines
5.8 KiB
Ruby

# frozen_string_literal: true
module DiscourseAi
module Summarization
module Strategies
class TopicSummary < Base
def type
AiSummary.summary_types[:complete]
end
def targets_data
posts_data =
(target.has_summary? ? best_replies : pick_selection).pluck(
:post_number,
:raw,
:username,
)
posts_data.reduce([]) do |memo, (pn, raw, username)|
raw_text = raw
if pn == 1 && target.topic_embed&.embed_content_cache.present?
raw_text = target.topic_embed&.embed_content_cache
end
memo << { poster: username, id: pn, text: raw_text }
end
end
def summary_extension_prompt(summary, contents)
resource_path = "#{Discourse.base_path}/t/-/#{target.id}"
content_title = target.title
input =
contents.map { |item| "(#{item[:id]} #{item[:poster]} said: #{item[:text]})" }.join
prompt = DiscourseAi::Completions::Prompt.new(<<~TEXT)
You are an advanced summarization bot tasked with enhancing an existing summary by incorporating additional posts.
### Guidelines:
- Only include the enhanced summary, without any additional commentary.
- Understand and generate Discourse forum Markdown; including links, _italics_, **bold**.
- Maintain the original language of the text being summarized.
- Aim for summaries to be 400 words or less.
- Each new post is formatted as "<POST_NUMBER>) <USERNAME> <MESSAGE>"
- Cite specific noteworthy posts using the format [NAME](#{resource_path}/POST_NUMBER)
- Example: link to the 3rd post by sam: [sam](#{resource_path}/3)
- Example: link to the 6th post by jane: [agreed with](#{resource_path}/6)
- Example: link to the 13th post by joe: [#13](#{resource_path}/13)
- When formatting usernames either use @USERNAME or [USERNAME](#{resource_path}/POST_NUMBER)
TEXT
prompt.push(type: :user, content: <<~TEXT.strip)
### Context:
#{content_title.present? ? "The discussion title is: " + content_title + ".\n" : ""}
Here is the existing summary:
#{summary}
Here are the new posts, inside <input></input> XML tags:
<input>
#{input}
</input>
Integrate the new information to generate an enhanced concise and coherent summary.
TEXT
prompt
end
def first_summary_prompt(contents)
resource_path = "#{Discourse.base_path}/t/-/#{target.id}"
content_title = target.title
input =
contents.map { |item| "(#{item[:id]} #{item[:poster]} said: #{item[:text]} " }.join
prompt = DiscourseAi::Completions::Prompt.new(<<~TEXT.strip)
You are an advanced summarization bot that generates concise, coherent summaries of provided text.
- Only include the summary, without any additional commentary.
- You understand and generate Discourse forum Markdown; including links, _italics_, **bold**.
- Maintain the original language of the text being summarized.
- Aim for summaries to be 400 words or less.
- Each post is formatted as "<POST_NUMBER>) <USERNAME> <MESSAGE>"
- Cite specific noteworthy posts using the format [NAME](#{resource_path}/POST_NUMBER)
- Example: link to the 3rd post by sam: [sam](#{resource_path}/3)
- Example: link to the 6th post by jane: [agreed with](#{resource_path}/6)
- Example: link to the 13th post by joe: [#13](#{resource_path}/13)
- When formatting usernames either use @USERNMAE OR [USERNAME](#{resource_path}/POST_NUMBER)
TEXT
prompt.push(
type: :user,
content:
"Here are the posts inside <input></input> XML tags:\n\n<input>1) user1 said: I love Mondays 2) user2 said: I hate Mondays</input>\n\nGenerate a concise, coherent summary of the text above maintaining the original language.",
)
prompt.push(
type: :model,
content:
"Two users are sharing their feelings toward Mondays. [user1](#{resource_path}/1) hates them, while [user2](#{resource_path}/2) loves them.",
)
prompt.push(type: :user, content: <<~TEXT.strip)
#{content_title.present? ? "The discussion title is: " + content_title + ".\n" : ""}
Here are the posts, inside <input></input> XML tags:
<input>
#{input}
</input>
Generate a concise, coherent summary of the text above maintaining the original language.
TEXT
prompt
end
private
attr_reader :topic
def best_replies
Post
.summary(target.id)
.where("post_type = ?", Post.types[:regular])
.where("NOT hidden")
.joins(:user)
.order(:post_number)
end
def pick_selection
posts =
Post
.where(topic_id: target.id)
.where("post_type = ?", Post.types[:regular])
.where("NOT hidden")
.order(:post_number)
post_numbers = posts.limit(5).pluck(:post_number)
post_numbers += posts.reorder("posts.score desc").limit(50).pluck(:post_number)
post_numbers += posts.reorder("post_number desc").limit(5).pluck(:post_number)
Post
.where(topic_id: target.id)
.joins(:user)
.where("post_number in (?)", post_numbers)
.order(:post_number)
end
end
end
end
end