Considering document length in search introduced too much variance in
our search results such that it makes certain searches better but at the
same time made certain searches worst. Instead, we want to have a more
determistic way of ranking search so that it is easier to reason about
why a post is rank higher in search than another.
The long term plan to tackle repeated terms is to restrict the number of
positions for a given lexeme in our search index.
Follow up to d8c796bc4.
Note that his change increases query time by around 40% in the following
benchmark against `dev.discourse.org` but this is a tradeoff that has to be taken so that relevance
search is accurate.
```
require 'benchmark/ips'
Benchmark.ips do |x|
x.config(time: 10, warmup: 2)
x.report("current aggregate search query") do
DB.exec <<~SQL
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id" FROM "posts" JOIN (SELECT *, row_number() over() row_number FROM (SELECT topics.id, min(posts.post_number) post_number FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3, 4) AND (topics.visible) AND (topics.archetype <> 'private_message') AND (post_search_data.search_data @@ TO_TSQUERY('english', '''postgres'':*ABCD')) AND (categories.id NOT IN (
SELECT categories.id WHERE categories.search_priority = 1
)
) AND ((categories.id IS NULL) OR (NOT categories.read_restricted)) GROUP BY topics.id ORDER BY MAX((
TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''postgres'':*ABCD'),
1|32
) *
(
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
)
) DESC, topics.bumped_at DESC LIMIT 51 OFFSET 0) xxx) x ON x.id = posts.topic_id AND x.post_number = posts.post_number WHERE ("posts"."deleted_at" IS NULL) ORDER BY row_number;
SQL
end
x.report("current aggregate search query with proper ranking") do
DB.exec <<~SQL
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id" FROM "posts" JOIN (SELECT *, row_number() over() row_number FROM (SELECT subquery.topic_id id, (ARRAY_AGG(subquery.post_number ORDER BY rank DESC, bumped_at DESC))[1] post_number, MAX(subquery.rank) rank, MAX(subquery.bumped_at) bumped_at FROM (SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id", (
TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''postgres'':*ABCD'),
1|32
) *
(
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
)
rank, topics.bumped_at bumped_at FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3, 4) AND (topics.visible) AND (topics.archetype <> 'private_message') AND (post_search_data.search_data @@ TO_TSQUERY('english', '''postgres'':*ABCD')) AND (categories.id NOT IN (
SELECT categories.id WHERE categories.search_priority = 1
)
) AND ((categories.id IS NULL) OR (NOT categories.read_restricted))) subquery GROUP BY subquery.topic_id ORDER BY rank DESC, bumped_at DESC LIMIT 51 OFFSET 0) xxx) x ON x.id = posts.topic_id AND x.post_number = posts.post_number WHERE ("posts"."deleted_at" IS NULL) ORDER BY row_number;
SQL
end
x.compare!
end
```
```
Warming up --------------------------------------
current aggregate search query
1.000 i/100ms
current aggregate search query with proper ranking
1.000 i/100ms
Calculating -------------------------------------
current aggregate search query
18.040 (± 0.0%) i/s - 181.000 in 10.035241s
current aggregate search query with proper ranking
12.992 (± 0.0%) i/s - 130.000 in 10.007214s
Comparison:
current aggregate search query: 18.0 i/s
current aggregate search query with proper ranking: 13.0 i/s - 1.39x (± 0.00) slower
```
Indexing query strings in URLS produces inconsistent results in PG and
pollutes the search data for really little gain.
The following seems to work as expected...
```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
to_tsvector
------------------------------------------------------
'2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1
```
However, once a path is present
```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
to_tsvector
----------------------------------------------------------------------------------------------
'/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1
```
The lexeme contains both the path and the query string.
```
discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
alias | lexemes
-------+---------------------
host | {www.discourse.org}
discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
to_tsvector
-----------------------
'www.discourse.org':1
```
Given the above lexeme, we will inject additional lexeme by splitting
the host on `.`. The actual tsvector stored will look something like
```
tsvector
---------------------------------------
'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1
```
Previously, we would only take either the `MIN` or `MAX` for
`post_number` during aggregation meaning that the ranking is not
considered.
```
require 'benchmark/ips'
Benchmark.ips do |x|
x.config(time: 10, warmup: 2)
x.report("current aggregate search query") do
DB.exec <<~SQL
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id" FROM "posts" JOIN (SELECT *, row_number() over() row_number FROM (SELECT topics.id, min(posts.post_number) post_number FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3, 4) AND (topics.visible) AND (topics.archetype <> 'private_message') AND (post_search_data.search_data @@ TO_TSQUERY('english', '''postgres'':*ABCD')) AND (categories.id NOT IN (
SELECT categories.id WHERE categories.search_priority = 1
)
) AND ((categories.id IS NULL) OR (NOT categories.read_restricted)) GROUP BY topics.id ORDER BY MAX((
TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''postgres'':*ABCD'),
1|32
) *
(
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
)
) DESC, topics.bumped_at DESC LIMIT 51 OFFSET 0) xxx) x ON x.id = posts.topic_id AND x.post_number = posts.post_number WHERE ("posts"."deleted_at" IS NULL) ORDER BY row_number;
SQL
end
x.report("current aggregate search query with proper ranking") do
DB.exec <<~SQL
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id" FROM "posts" JOIN (SELECT *, row_number() over() row_number FROM (SELECT subquery.topic_id id, (ARRAY_AGG(subquery.post_number))[1] post_number, MAX(subquery.rank) rank, MAX(subquery.bumped_at) bumped_at FROM (SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id", (
TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''postgres'':*ABCD'),
1|32
) *
(
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
)
rank, topics.bumped_at bumped_at FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3, 4) AND (topics.visible) AND (topics.archetype <> 'private_message') AND (post_search_data.search_data @@ TO_TSQUERY('english', '''postgres'':*ABCD')) AND (categories.id NOT IN (
SELECT categories.id WHERE categories.search_priority = 1
)
) AND ((categories.id IS NULL) OR (NOT categories.read_restricted))) subquery GROUP BY subquery.topic_id ORDER BY rank DESC, bumped_at DESC LIMIT 51 OFFSET 0) xxx) x ON x.id = posts.topic_id AND x.post_number = posts.post_number WHERE ("posts"."deleted_at" IS NULL) ORDER BY row_number;
SQL
end
x.compare!
end
```
```
Warming up --------------------------------------
current aggregate search query
1.000 i/100ms
current aggregate search query with proper ranking
1.000 i/100ms
Calculating -------------------------------------
current aggregate search query
17.726 (± 0.0%) i/s - 178.000 in 10.045107s
current aggregate search query with proper ranking
17.802 (± 0.0%) i/s - 178.000 in 10.002230s
Comparison:
current aggregate search query with proper ranking: 17.8 i/s
current aggregate search query: 17.7 i/s - 1.00x (± 0.00) slower
```
On large topics, the cost of sending the entire post ID list back over to the database is signficant. Just have the DB recalculate the list of visible posts instead.
Category and tag hashtags used to be handled differently even though
most of the code was very similar. This design was the root cause of
multiple issues related to hashtags.
This commit reduces the number of requests (just one and debounced
better), removes the use of CSS classes which marked resolved hashtags,
simplifies a lot of the code as there is a single source of truth and
previous race condition fixes are now useless.
It also includes a very minor security fix which let unauthorized users
to guess hidden tags.
When creating a bookmark reminder that deletes the bookmark on reminder, if the user clicked on the notification and got taken to the post in the topic the bookmark icon still showed as blue with the reminder clock indicator. This was because the response JSON for reloading a topic post was not including the bookmark attributes, not even the bookmarked boolean.
We now return the correct attributes in the serializer, and if bookmarked is false we clear all the bookmark related attributes on the post for the notification to make sure nothing of the old bookmark remains in the UI.
This was only a problem if the user did not refresh the app completely inbetween setting the reminder and receiving the notification.
It's a little awkward to test constants by re-assigning them so
I've added a new parameter to `Discourse.find_compatible_resource`
which can be used by tests.
Instead of loading all of the user bookmarks using all the post IDs in a topic, load all the bookmarks for a user using the topic ID. This eliminates a costly WHERE ID IN query.
We have a couple of examples of enormous amounts of text being entered in the name column of bookmarks. This is not desirable...it is just meant to be a short note / reminder of why you bookmarked this.
This PR caps the column at 100 characters and truncates existing names in the database to 100 characters.
* FIX: Improve category hashtag lookup
This commit improves support for sub-sub-categories and does not include
the ID of the category in the slug, which fixes the composer preview.
* FIX: Sub-sub-categories can be mentioned using only two levels
* FIX: Remove support for three-level hashtags
* DEV: Simplify code
Adds a new rake task `plugin:checkout_compatible_all` and
`plugin:checkout_compatible[plugin-name]` that check out compatible plugin
versions.
Supports a .discourse-compatibility file in the root of plugins and themes that
list out a plugin's compatibility with certain discourse versions:
eg: .discourse-compatibility
```
2.5.0.beta6: some-git-hash
2.4.4.beta4: some-git-tag
2.2.0: git-reference
```
This ensures older Discourse installs are able to find and install older
versions of plugins without intervention, through the manifest only.
It iterates through the versions in descending order. If the current Discourse
version matches an item in the manifest, it checks out the listed plugin target.
If the Discourse version is greater than an item in the manifest, it checks out
the next highest version listed in the manifest.
If no versions match, it makes no change.
If any value, including nil, is passed in as an argument the default
won't be set, so we need to handle when a non-Array value is passed in
to the `generate_thumbnails!` method.