Previously we used the raw data indexed to generate blurbs even for cases
when Chinese/Korean/Japanese text was used.
This caused superfluous spaces to show up in excerpts.
Prior to this change, we had weights for very_high, high, low and
very_low. This means there were 4 weights to tweak and what weights to
use for `very_high/high` and `very_low/low` pair was hard to explain.
This change makes it such that `very_high` search priority will always
ensure that the posts are ranked at the top while `very_low` search
priority will ensure that the posts are ranked at the very bottom.
You can now use `@me` to search for posts created by yourself, this is particularly handy if you have a long username.
`@me rainbow` will find all posts you created with the word rainbow.
Also cleans up test suite so it has no warnings.
After the search term is parsed for advanced search filters, the term may
become empty. Later, the same term will be passed to Discourse.route_for
which will raise an ArgumentError.
> URI(nil)
ArgumentError: bad argument (expected URI object or URI string)
* PERF: avoid lookbehinds when indexing search
Previously we used a `EmailCook.url_regexp` this regex used lookbehinds
Unfortunately certain strings could lead to pathological behavior causing
CPU to skyrocket and regex replace to take a very very long time.
EmailCook still needs a fix, but it is less urgent cause it already splits
to single lines. That said we will correct that as well in a seperate PR.
New implementation is far more naive and relies on the extra spaces search
indexer inserts.
In c6ceda8c, a bug was introduced where an admin searching for his own
private messages will actually end up searching through all private
messages on the site.
Follow-up to c6ceda8c4e
With the addition of `PostSearchData#private_message`, a partial
index consisting of only search data from regular posts can be created.
The partial index helps to speed up searches on large sites since PG
will not have to do an index scan on the entire search data index which
has shown to be a bottle neck.
The filter noops if an incorrect username is passed. This filter is not
exposed as part of the UI but is only used when an admin transitions
from a search within a user's personal messages to the full page search.
Follow-up to 4b30799054.
Note the following query being generated where the filter for a user's
private messages is executed twice.
```sql
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id", (TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''test'':*ABCD'),
0|32
)
* (
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
) rank, topics.bumped_at topic_bumped_at FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3) AND (topics.visible) AND (topics.archetype = 'private_message' AND post_search_data.private_message) AND (posts.topic_id IN (SELECT topic_id
FROM topic_allowed_users
WHERE user_id = 99999
UNION ALL
SELECT tg.topic_id
FROM topic_allowed_groups tg
JOIN group_users gu ON gu.user_id = 99999 AND gu.group_id = tg.group_id
)) AND (post_search_data.search_data @@ TO_TSQUERY('english', '''test'':*ABCD')) AND (posts.topic_id IN (SELECT topic_id
FROM topic_allowed_users
WHERE user_id = 99999
UNION ALL
SELECT tg.topic_id
FROM topic_allowed_groups tg
JOIN group_users gu ON gu.user_id = 99999 AND gu.group_id = tg.group_id
)) AND ((categories.id IS NULL) OR (NOT categories.read_restricted) OR (categories.id IN (999999))) ORDER BY rank DESC, topic_bumped_at DESC
```
Renamed from `private_messages` to `personal_messages` without
deprecation because the `private_messages` advanced search filter never
worked in the first place when it was implemented.
Similar to `advanced_filter` I introduced `advanced_order`.
I needed a new option because default orders are evaluated after advanced_filter so I couldn't use it.
Also, that part is a little bit more generic
```
elsif word =~ /order:\w+/
@order = word.gsub('order:', '').to_sym
nil
```
After those changes, I can use them in plugins in this way:
```
Search.advanced_order(:votes) do |posts|
posts.reorder("COALESCE((SELECT dvvc.counter FROM discourse_voting_vote_counters dvvc WHERE dvvc.topic_id = subquery.topic_id), 0) DESC")
end
```
This changes PG text search to only match the given title against
lexemes that are formed from the title. Likewise, the given raw will
only be matched against lexemes that are formed from the post's raw.
Follow up to d8c796bc4.
Note that his change increases query time by around 40% in the following
benchmark against `dev.discourse.org` but this is a tradeoff that has to be taken so that relevance
search is accurate.
```
require 'benchmark/ips'
Benchmark.ips do |x|
x.config(time: 10, warmup: 2)
x.report("current aggregate search query") do
DB.exec <<~SQL
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id" FROM "posts" JOIN (SELECT *, row_number() over() row_number FROM (SELECT topics.id, min(posts.post_number) post_number FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3, 4) AND (topics.visible) AND (topics.archetype <> 'private_message') AND (post_search_data.search_data @@ TO_TSQUERY('english', '''postgres'':*ABCD')) AND (categories.id NOT IN (
SELECT categories.id WHERE categories.search_priority = 1
)
) AND ((categories.id IS NULL) OR (NOT categories.read_restricted)) GROUP BY topics.id ORDER BY MAX((
TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''postgres'':*ABCD'),
1|32
) *
(
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
)
) DESC, topics.bumped_at DESC LIMIT 51 OFFSET 0) xxx) x ON x.id = posts.topic_id AND x.post_number = posts.post_number WHERE ("posts"."deleted_at" IS NULL) ORDER BY row_number;
SQL
end
x.report("current aggregate search query with proper ranking") do
DB.exec <<~SQL
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id" FROM "posts" JOIN (SELECT *, row_number() over() row_number FROM (SELECT subquery.topic_id id, (ARRAY_AGG(subquery.post_number ORDER BY rank DESC, bumped_at DESC))[1] post_number, MAX(subquery.rank) rank, MAX(subquery.bumped_at) bumped_at FROM (SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id", (
TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''postgres'':*ABCD'),
1|32
) *
(
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
)
rank, topics.bumped_at bumped_at FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3, 4) AND (topics.visible) AND (topics.archetype <> 'private_message') AND (post_search_data.search_data @@ TO_TSQUERY('english', '''postgres'':*ABCD')) AND (categories.id NOT IN (
SELECT categories.id WHERE categories.search_priority = 1
)
) AND ((categories.id IS NULL) OR (NOT categories.read_restricted))) subquery GROUP BY subquery.topic_id ORDER BY rank DESC, bumped_at DESC LIMIT 51 OFFSET 0) xxx) x ON x.id = posts.topic_id AND x.post_number = posts.post_number WHERE ("posts"."deleted_at" IS NULL) ORDER BY row_number;
SQL
end
x.compare!
end
```
```
Warming up --------------------------------------
current aggregate search query
1.000 i/100ms
current aggregate search query with proper ranking
1.000 i/100ms
Calculating -------------------------------------
current aggregate search query
18.040 (± 0.0%) i/s - 181.000 in 10.035241s
current aggregate search query with proper ranking
12.992 (± 0.0%) i/s - 130.000 in 10.007214s
Comparison:
current aggregate search query: 18.0 i/s
current aggregate search query with proper ranking: 13.0 i/s - 1.39x (± 0.00) slower
```
Indexing query strings in URLS produces inconsistent results in PG and
pollutes the search data for really little gain.
The following seems to work as expected...
```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
to_tsvector
------------------------------------------------------
'2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1
```
However, once a path is present
```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
to_tsvector
----------------------------------------------------------------------------------------------
'/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1
```
The lexeme contains both the path and the query string.
Previously, we would only take either the `MIN` or `MAX` for
`post_number` during aggregation meaning that the ranking is not
considered.
```
require 'benchmark/ips'
Benchmark.ips do |x|
x.config(time: 10, warmup: 2)
x.report("current aggregate search query") do
DB.exec <<~SQL
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id" FROM "posts" JOIN (SELECT *, row_number() over() row_number FROM (SELECT topics.id, min(posts.post_number) post_number FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3, 4) AND (topics.visible) AND (topics.archetype <> 'private_message') AND (post_search_data.search_data @@ TO_TSQUERY('english', '''postgres'':*ABCD')) AND (categories.id NOT IN (
SELECT categories.id WHERE categories.search_priority = 1
)
) AND ((categories.id IS NULL) OR (NOT categories.read_restricted)) GROUP BY topics.id ORDER BY MAX((
TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''postgres'':*ABCD'),
1|32
) *
(
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
)
) DESC, topics.bumped_at DESC LIMIT 51 OFFSET 0) xxx) x ON x.id = posts.topic_id AND x.post_number = posts.post_number WHERE ("posts"."deleted_at" IS NULL) ORDER BY row_number;
SQL
end
x.report("current aggregate search query with proper ranking") do
DB.exec <<~SQL
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id" FROM "posts" JOIN (SELECT *, row_number() over() row_number FROM (SELECT subquery.topic_id id, (ARRAY_AGG(subquery.post_number))[1] post_number, MAX(subquery.rank) rank, MAX(subquery.bumped_at) bumped_at FROM (SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like_count", "posts"."incoming_link_count", "posts"."bookmark_count", "posts"."score", "posts"."reads", "posts"."post_type", "posts"."sort_order", "posts"."last_editor_id", "posts"."hidden", "posts"."hidden_reason_id", "posts"."notify_moderators_count", "posts"."spam_count", "posts"."illegal_count", "posts"."inappropriate_count", "posts"."last_version_at", "posts"."user_deleted", "posts"."reply_to_user_id", "posts"."percent_rank", "posts"."notify_user_count", "posts"."like_score", "posts"."deleted_by_id", "posts"."edit_reason", "posts"."word_count", "posts"."version", "posts"."cook_method", "posts"."wiki", "posts"."baked_at", "posts"."baked_version", "posts"."hidden_at", "posts"."self_edits", "posts"."reply_quoted", "posts"."via_email", "posts"."raw_email", "posts"."public_version", "posts"."action_code", "posts"."locked_by_id", "posts"."image_upload_id", (
TS_RANK_CD(
post_search_data.search_data,
TO_TSQUERY('english', '''postgres'':*ABCD'),
1|32
) *
(
CASE categories.search_priority
WHEN 2
THEN 0.6
WHEN 3
THEN 0.8
WHEN 4
THEN 1.2
WHEN 5
THEN 1.4
ELSE
CASE WHEN topics.closed
THEN 0.9
ELSE 1
END
END
)
)
rank, topics.bumped_at bumped_at FROM "posts" INNER JOIN "post_search_data" ON "post_search_data"."post_id" = "posts"."id" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) LEFT JOIN categories ON categories.id = topics.category_id WHERE ("posts"."deleted_at" IS NULL) AND "posts"."post_type" IN (1, 2, 3, 4) AND (topics.visible) AND (topics.archetype <> 'private_message') AND (post_search_data.search_data @@ TO_TSQUERY('english', '''postgres'':*ABCD')) AND (categories.id NOT IN (
SELECT categories.id WHERE categories.search_priority = 1
)
) AND ((categories.id IS NULL) OR (NOT categories.read_restricted))) subquery GROUP BY subquery.topic_id ORDER BY rank DESC, bumped_at DESC LIMIT 51 OFFSET 0) xxx) x ON x.id = posts.topic_id AND x.post_number = posts.post_number WHERE ("posts"."deleted_at" IS NULL) ORDER BY row_number;
SQL
end
x.compare!
end
```
```
Warming up --------------------------------------
current aggregate search query
1.000 i/100ms
current aggregate search query with proper ranking
1.000 i/100ms
Calculating -------------------------------------
current aggregate search query
17.726 (± 0.0%) i/s - 178.000 in 10.045107s
current aggregate search query with proper ranking
17.802 (± 0.0%) i/s - 178.000 in 10.002230s
Comparison:
current aggregate search query with proper ranking: 17.8 i/s
current aggregate search query: 17.7 i/s - 1.00x (± 0.00) slower
```
In 91c89df6, I fixed the onebox to support local topics with a slug-less URL.
This commit fixes all the other spots (search, topic links and user badges) where we look up for a local topic.
Follow-up-to: 91c89df6
The logic of adding additional search results does not seem to be
needed anymore.
It appears to be a relic of an old implementation.
This saves an entire search query for every search made.