Commit Graph

67 Commits

Author SHA1 Message Date
Sam 4570118a63
FIX: search index duplicate parser matching is too restrictive (#20129)
Previous regex did not allow for cases where a lexeme contains a : (colon)

This can happen when parsing URLs. New algorithm allows for this.
Test was amended to more clearly call out index problems
2023-02-02 12:17:19 +11:00
Sam 07679888c8
FEATURE: allow restricting duplication in search index (#20062)
* FEATURE: allow restricting duplication in search index

This introduces the site setting `max_duplicate_search_index_terms`.
Using this number we limit the amount of duplication in our search index.

This allows us to more correctly weight title searches, so bloated posts
don't unfairly bump to the top of search results.

This feature is completely disabled by default and behind a site setting

We will experiment with it first. Note entire search index must be rebuilt
for it to take effect.


---------

Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>
2023-01-31 12:41:31 +11:00
Daniel Waterworth 666536cbd1
DEV: Prefer \A and \z over ^ and $ in regexes (#19936) 2023-01-20 12:52:49 -06:00
David Taylor 5a003715d3
DEV: Apply syntax_tree formatting to `app/*` 2023-01-09 14:14:59 +00:00
Penar Musaraj ebdfc536dd
Revert "FEATURE: Include participants in PN search data (#16855)" (#16904)
This reverts commit 71c74a262d.
2022-05-25 15:08:36 +10:00
Penar Musaraj 71c74a262d
FEATURE: Include participants in PN search data (#16855)
This makes it easier to find PMs involving a particular user, for
example by searching for `in:messages thisUser` (previously, that query
would only return results in posts where `thisUser` was in the post body).
2022-05-18 10:34:01 -04:00
Penar Musaraj df10a27067
FIX: Exclude automatic anchors from search index (#16396) 2022-04-06 16:06:45 -04:00
Daniel Waterworth 6e9a068e44
FIX: Limit max word length in search index (#16380)
Long words bloat the index for little benefit.
2022-04-06 12:23:30 -05:00
Bianca Nenciu 34b4b53bac
FEATURE: Use Postgres unaccent to ignore accents (#16100)
The search_ignore_accents site setting can be used to make the search
indexer remove the accents before indexing the content. The unaccent
function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).
2022-03-07 23:03:10 +02:00
Dan Ungureanu 820fea835c
FIX: Further reduce the input of to_tsvector (#15716)
Random strings can result into much longer tsvectors. For example
parsing a Base64 string of ~600kb can result in a tsvector of over 1MB,
which is the maximum size of a tsvector.

Follow-up-to: 823c3f09d4
2022-02-07 23:03:01 +02:00
Alan Guo Xiang Tan 77137c5d29 FIX: Single line emojis has emoji metadata indexed twice.
This commit fixes a bug where we our `HTMLScrubber` was only searching
for emoji img tags which contains only the "emoji" class. However, our emoji image tags
may contain more than just the "emoji" class like "only-emoji" when an
emoji exists by itself on a single line.
2022-01-24 14:03:17 +08:00
Dan Ungureanu 823c3f09d4
FIX: Reduce input of to_tsvector to follow limits (#13806)
Long posts may have `cooked` fields that produce tsvectors longer than
the maximum size of 1MiB (1,048,576 bytes). This commit uses just the
first million characters of the scrubbed cooked text for indexing.

Reducing the size to exactly 1MB (1_048_576) is not sufficient because
sometimes the output tsvector may be longer than the input and this
gives us some breathing room.
2021-07-28 18:25:14 +03:00
Josh Soref 59097b207f
DEV: Correct typos and spelling mistakes (#12812)
Over the years we accrued many spelling mistakes in the code base. 

This PR attempts to fix spelling mistakes and typos in all areas of the code that are extremely safe to change 

- comments
- test descriptions
- other low risk areas
2021-05-21 11:43:47 +10:00
Krzysztof Kotlarek e29605b79f
FEATURE: the ability to search users by custom fields (#12762)
When the admin creates a new custom field they can specify if that field should be searchable or not.

That setting is taken into consideration for quick search results.
2021-04-27 15:52:45 +10:00
Bianca Nenciu c10df4b58d
FIX: Make HTML scrubber work with deep HTML (#12619)
SearchIndexer and ReindexSearch used to explode for posts with very
deep or invalid HTML content.
2021-04-07 17:02:00 +10:00
Guo Xiang Tan 650da7b626 PERF: Update index for category in a background job.
Search indexing can get expensive and there is no need for us to block
the entire request just to wait for index to finish.
2020-11-09 13:51:26 +08:00
Guo Xiang Tan d12d8fb7fd
DEV: Include more information when reporting search indexing failures. 2020-08-21 11:02:00 +08:00
Guo Xiang Tan 337f062f0f
PERF: Defer indexing post for search when saving a post.
Indexing a post for search is slow and there is no reason for us to have
to block saving a post due to search indexing.
2020-08-21 07:52:43 +08:00
Guo Xiang Tan 92b7fe4c62
PERF: Add partial index for non-pm search. 2020-08-18 15:55:08 +08:00
Guo Xiang Tan 8b811533b1
DEV: Improve readability of setting weights in `SearchIndexer`. 2020-08-14 23:11:41 +08:00
Guo Xiang Tan 5819c4cb3b
PERF: Switch to ActiveRecord's upsert in `SearchIndexer`.
On insertion, it uses a single query instead of 2.
2020-08-14 16:15:14 +08:00
Guo Xiang Tan 255b0e9f14
PERF: Replace video and audio links in search blurb while indexing.
In the near future, we will be swtiching to PG headlines to generate the
search blurb. As such, we need to replace audio and video links in the
raw data used for headline generation. This also means that we avoid
replacing links each time we need to generate the blurb.
2020-08-06 12:25:03 +08:00
Guo Xiang Tan 15e9057ec5
FIX: Reduce number of terms injected for host lexeme.
We do prefix matching in search so there is no need to inject the extra
terms.

Before:
```
"'discourse':10,11 'discourse.org':10,11 'org':10,11 'test':8A,10,11 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```

After:
```
"'discourse.org':10,11 'org':10,11 'test':8A 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```
2020-07-27 15:29:59 +08:00
Guo Xiang Tan 0f53ad58c2
FIX: Improve regexp for matching version lexeme.
Follow up to b70f1084f7
2020-07-27 15:18:27 +08:00
Guo Xiang Tan b70f1084f7
FIX: Don't inject extra terms for version lexeme. 2020-07-27 14:46:44 +08:00
Guo Xiang Tan 181c4eb760 PERF: Avoid parsing `Post#cooked` with Nokogiri for every search. 2020-07-24 10:43:09 +08:00
Guo Xiang Tan 3766122a82
DEV: Allow developmental post search index versions. 2020-07-23 15:19:46 +08:00
Guo Xiang Tan 609ba50fe8
DEV: Add more granularity to `SearchIndexer` versions.
Sometimes, we just want to reindex a specific model and not all the
things.
2020-07-23 14:24:06 +08:00
Guo Xiang Tan ff7678e210
FIX: Reindex posts when `Topic#title` or `Category#name` changes. 2020-07-17 11:12:31 +08:00
Guo Xiang Tan 5c230266d3
FIX: Inject extra lexemes for host lexeme.
```
discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
 alias |       lexemes
-------+---------------------
 host  | {www.discourse.org}

discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
      to_tsvector
-----------------------
 'www.discourse.org':1
```

Given the above lexeme, we will inject additional lexeme by splitting
the host on `.`. The actual tsvector stored will look something like

```
               tsvector
---------------------------------------
 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1
```
2020-07-14 15:32:40 +08:00
Sam Saffron 88459e08c9
FEATURE: allow disabling of extra term injection in search
There is a feature in search where we take over from the tokenizer
in postgres and attempt to inject more words into search.

So for example: sam.i.am will inject the words i and am.

This is not ideal cause there are many edge cases and this can
cause extreme index bloat.

This is an opening move commit to make it configurable, over the
next few weeks we will evaluate and decide if we disable this by
default or simply remove.
2020-06-25 13:36:52 +10:00
Krzysztof Kotlarek 9bff0882c3
FEATURE: Nokogumbo (#9577)
* FEATURE: Nokogumbo

Use Nokogumbo HTML parser.
2020-05-05 13:46:57 +10:00
Neil Lalonde 875f0d8fd8
FEATURE: Tag synonyms
This feature adds the ability to define synonyms for tags, and the ability to merge one tag into another while keeping it as a synonym. For example, tags named "js" and "java-script" can be synonyms of "javascript". When searching and creating topics using synonyms, they will be mapped to the base tag.

Along with this change is a new UI found on each tag's page (for example, `/tags/javascript`) where more information about the tag can be shown. It will list the synonyms, which categories it's restricted to (if any), and which tag groups it belongs to (if tag group names are public on the `/tags` page by enabling the "tags listed by group" setting). Staff users will be able to manage tags in this UI, merge tags, and add/remove synonyms.
2019-12-04 13:33:51 -05:00
Krzysztof Kotlarek 427d54b2b0 DEV: Upgrading Discourse to Zeitwerk (#8098)
Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains. 

We no longer need to use Rails "require_dependency" anywhere and instead can just use standard 
Ruby patterns to require files.

This is a far reaching change and we expect some followups here.
2019-10-02 14:01:53 +10:00
Sam Saffron 6428aa5b1f FIX: search indexer had various cases where it could fail
Previous to this fix is a post had the test www.test.com/abc it would fail
to index.

This also simplifies the rules to avoid full url parsing which can be
expensive
2019-06-04 16:21:03 +10:00
Dan Ungureanu 7a08e23b4b FIX: Bump search index version. 2019-05-29 08:20:59 +08:00
Gerhard Schlager 876c4f20b3 FIX: Remove duplicate Emoji names from blurb
The blurb contained the value of the alt and title attribute of Emojis. Both values are always the same.
2019-04-29 17:26:39 +02:00
Gerhard Schlager 71d19f6e1f FIX: Reduce mentions in blurbs to @username or @groupname
The link to the user profile or group is useless and the URL encoded username or group name looks awful for Unicode names
2019-04-29 17:26:39 +02:00
Guo Xiang Tan d8704c11ca PERF: Better use of index when queueing a topci for search reindex.
Also move `Search::INDEX_VERSION` to `SearchIndexer` which is where the
version is actually being used.
2019-04-02 09:53:37 +08:00
Guo Xiang Tan 3cba10b9ca DEV: Don't warn when trying to reindex a post with a deleted topic. 2019-04-01 17:04:32 +08:00
Guo Xiang Tan 2a69ab4a4c FIX: Keep `alt` and `title` in lightbox when indexing for search.
Follow up to cfd507822f
2019-04-01 16:20:33 +08:00
Guo Xiang Tan cfd507822f
PERF: Improve quality of `PostSearchData#raw_data`. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
Guo Xiang Tan daeda80ada
FIX: Don't index posts with empty `Post#raw` for search. (#7263)
* DEV: Remove unnecessary join in `Jobs::ReindexSearch`.

* FIX: Don't index posts with empty `Post#raw` for search.
2019-04-01 10:06:27 +08:00
Guo Xiang Tan d808f36fc4 FIX: Reindex post for search when post is moved to a different topic.
* This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.
2019-03-19 17:19:14 +08:00
Daniel Hollas cee51672c9 FIX: Strip accents from search query
4481836 introduced accent stipping in search_indexer,
but we need to strip it from the query itself as well

TODO in search with diacritics:
 - Still need to fix excerpts on search page
 - need to support accent stripping in in_topic search
 - need to make sure that in:title works correctly
 - need to fix "word boldening" in titles
2018-10-23 12:10:33 +11:00
David Taylor 9bf522f227
FEATURE: Mixed case tagging (#6454)
- By default, behaviour is not changed: tags are made lowercase upon creation and edit.

- If force_lowercase_tags is disabled, then mixed case tags are allowed.

- Tags must remain case-insensitively unique. This is enforced by ActiveRecord and Postgres.

- A migration is added to provide a `UNIQUE` index on `lower(name)`. Migration includes a safety to correct any current tags that do not meet the criteria.

- A `where_name` scope is added to `models/tag.rb`, to allow easy case-insensitive lookups. This is used instead of `Tag.where(name: "blah")`.

- URLs remain lowercase. Mixed case URLs are functional, but have the lowercase equivalent as the canonical.
2018-10-05 10:23:52 +01:00
Régis Hanol 4481836de2 FEATURE: new 'search_ignore_accents' site setting 2018-09-17 10:42:30 +02:00
Régis Hanol 30619c244c FIX: don't index urls to local files 2018-09-13 18:53:53 +02:00
Sam 9b7cab589a FIX: revert diacritic stripping
See more details in test case and at: https://meta.discourse.org/t/discourse-should-ignore-if-a-character-is-accented-when-doing-a-search/90198/16?u=sam
2018-08-31 11:46:55 +10:00
Sam ac11f8df52 correct regression searching with diacritics 2018-08-24 10:00:51 +10:00