In some cases reverse chronological can be very important.
- Oldest post by sam
- Oldest topic by sam
Prior to these new filters we had no way of searching for them.
Now the 2 new orders `order:oldest` and `order:oldest_topic` can be used
to find oldest topics and posts
* Update spec/lib/search_spec.rb
Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>
* Update spec/lib/search_spec.rb
Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>
---------
Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>
We were giving topics with repeated words extra weight in search index.
This meant that it was trivial to stuff words into title to dominate in search
given we search for exact title matches first.
The following tweak means that:
`invite invited invites`
and
`invite some stuff`
Both rank the same for title searching.
Titles are short and punchy, duplicating words should not give special
weight.
Requires a full reindex to take effect.
This new modifier can be used by plugins to modify search ordering.
Specifically plugins such as discourse_solved can amend search ordering
so solved topics bump to the top.
Also correct edge case where low and high sort priority categories did not
order correctly when it came to closed/archived
- Reduce duplication of terms in post index from unlimited to 6. This will
result in reduced index size and reduced weighting for posts containing
a huge amount of duplicate terms. (Eg: a post containing "sam sam sam sam
sam sam sam sam", will index as "sam sam sam sam sam sam", only including
the word up to 6 times.) This corrects a flaw where title weighting could
be ignored.
- Prioritize exact matches of words in titles. Our search always performs
a prefix match. However we want to give special weight to exact title matches
meaning that a search for "sum" will find topics such as "the sum of us" vs
"summer in spring".
- Pick up fixes to our search algorithm which are missing from old indexes.
Specifically pick up the fix that indexes URLs properly. (`https://happy.com`
was stemmed to `happi` in keywords and then was not searchable)
see also:
https://meta.discourse.org/t/refinements-to-search-being-tested-on-meta/254158
Indexing will take a while and work in batches, in the background.
Previously due to an error archived topics were more prominent in search
than closed topics.
This amends our internal logic to ensure archived topics are bumped down
the list.
If a post contains domain with a word that stems to a non prefix single
words will not match it.
For example: in happy.com, `happy` stems to `happi`. Thus searches for happy
will not find URLs with it included.
This bloats the index a tiny bit, but impact is limited.
Will require a full reindex of search to take effect.
When we are done refining search we can consider a full version bump.
Previously to_tsquery would split terms and join with &
In PG 14 terms are split and use <-> which means followed directly by.
In PG 13:
discourse_test=# SELECT to_tsquery('english', '''hello world''');
to_tsquery
---------------------
'hello' & 'world'
(1 row)
In PG 14:
discourse_test=# SELECT to_tsquery('english', '''hello world''');
to_tsquery
---------------------
'hello' <-> 'world'
(1 row)
Change is very unobtrosive, we simply amend our to_tsquery to behave like
it used to behave and make no use of the `<->` operator
More detail at: https://akorotkov.github.io/blog/2021/05/22/pg-14-query-parsing/
Note that plainto_tsquery used elsewhere in Discourse keeps the exact
same function.
This also corrects a faulty test that was passing by a fluke on older
version of PG
The new `prioritize_exact_search_match` can be used to force the search
algorithm to prioritize exact term matches in title when ranking results.
This is scoped narrowly to titles for cases such as a topic titled:
"organisation chart" and a search of "org chart".
If we scoped this wider, all discussion about "org chart" would float to
the top and leave a very common title de-prioritized.
This is a hidden site setting and it has some performance impact due
to double ranking.
That said, performance impact is somewhat mitigated cause ranking on
title alone is a very cheap operation.
* FEATURE: allow restricting duplication in search index
This introduces the site setting `max_duplicate_search_index_terms`.
Using this number we limit the amount of duplication in our search index.
This allows us to more correctly weight title searches, so bloated posts
don't unfairly bump to the top of search results.
This feature is completely disabled by default and behind a site setting
We will experiment with it first. Note entire search index must be rebuilt
for it to take effect.
---------
Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>
Many users seems surprised by prefix matching in search leading to
unexpected results.
Over the years we always would return results starting with a search term
and not expect exact matches.
Meaning a search for `abra` would find `abracadabra`
This introduces the Site Setting `enable_search_prefix_matching` which
defaults to true. (behavior unchanged)
We plan to experiment on select sites with exact matches to see if the
results are less surprising
* DEV: Remove enable_whispers site setting
Whispers are enabled as long as there is at least one group allowed to
whisper, see whispers_allowed_groups site setting.
* DEV: Always enable whispers for admins if at least one group is allowed.
The tsquery used for searching is generated using both functions from
Ruby and Postgresql (for example, unaccent function). Depending on the
term used, it generated an invalid tsquery. For example "can’t"
generated "''can''t''" instead of "''can''''t''".
cf. e62e93f83a
This PR also makes it so `bot` (negative ID) and `system` users are always allowed
to send PMs, since the old conditional was just based on `enable_personal_messages`
Before, whispers were only available for staff members.
Config has been changed to allow to configure privileged groups with access to whispers. Post migration was added to move from the old setting into the new one.
I considered having a boolean column `whisperer` on user model similar to `admin/moderator` for performance reason. Finally, I decided to keep looking for groups as queries are only done for current user and didn't notice any N+1 queries.
When searching for PMs or PMs in a group inbox, results in the header search were not being limited to 5 with a "More" link to the full page search. This PR fixes that.
It also simplifies the logic and updates the search API docs to include recently added `in:messages` and `group_messages:groupname` options.
This commit migrates all bookmarks to be polymorphic (using the
bookmarkable_id and bookmarkable_type) columns. It also deletes
all the old code guarded behind the use_polymorphic_bookmarks setting
and changes that setting to true for all sites and by default for
the sake of plugins.
No data is deleted in the migrations, the old post_id and for_topic
columns for bookmarks will be dropped later on.
The search_ignore_accents site setting can be used to make the search
indexer remove the accents before indexing the content. The unaccent
function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).
It's very easy to forget to add `require 'rails_helper'` at the top of every core/plugin spec file, and omissions can cause some very confusing/sporadic errors.
By setting this flag in `.rspec`, we can remove the need for `require 'rails_helper'` entirely.
* Chinese segmenetation will continue to rely on cppjieba
* Japanese segmentation will use our port of TinySegmenter
* Korean currently does not rely on segmentation which was dropped in c677877e4f
* SiteSetting.search_tokenize_chinese_japanese_korean has been split
into SiteSetting.search_tokenize_chinese and
SiteSetting.search_tokenize_japanese respectively
Searching in a category looked only one level down, ignoring the site
setting max_category_nesting. The user interface did not support the
third level of categories and did not display them in the "Categorized"
input of the advanced search options.
Over the years we have found that a few communities never discovered tags.
Instead of having them default off we now have them default on, ensuring
that everyone finds out about them.
Co-authored-by: Dan Ungureanu <dan@ungureanu.me>
When the admin creates a new custom field they can specify if that field should be searchable or not.
That setting is taken into consideration for quick search results.
PG already handles English stop words, the list in cppjieba is
bigger than the list PG uses, which in turn causes confusion cause
words such as "volume" are stripped using cppijieba stop word list
We will follow up with another commit here to apply the Chinese
word stopwords, but for now to eliminate the confusion we are
skipping applying the stopword list when the dictionary in PG is
in English.