Commit Graph

55 Commits

Author SHA1 Message Date
Sam 4a3c13a37b
FIX: search index failing on certain posts (#20736)
During search indexing we "stuff" the index with additional keywords for
entities that look like domain names.

This allows searches for `cnn` to find URLs for `www.cnn.com`

The search stuffing attempted to keep indexes aligned at the correct positions
by remapping the indexed terms. However under certain edge cases a single
word can stem into 2 different lexemes. If this happened we had an off by
one which caused the entire indexing to fail.

We work around this edge case (and carry incorrect index positions) for cases
like this. It is unlikely to impact search quality at all given index position
makes almost no difference in the search algorithm.
2023-03-20 15:43:08 +11:00
Ted Johansson 39c2f63b35 SECURITY: Add FinalDestination::FastImage that's SSRF safe 2023-03-16 15:27:09 -06:00
Sam 651476e89e
FIX: domain searches not working properly for URLs (#20136)
If a post contains domain with a word that stems to a non prefix single
words will not match it.

For example: in happy.com, `happy` stems to `happi`. Thus searches for happy
will not find URLs with it included.

This bloats the index a tiny bit, but impact is limited.

Will require a full reindex of search to take effect. 

When we are done refining search we can consider a full version bump.
2023-02-03 09:55:28 +11:00
Sam 4570118a63
FIX: search index duplicate parser matching is too restrictive (#20129)
Previous regex did not allow for cases where a lexeme contains a : (colon)

This can happen when parsing URLs. New algorithm allows for this.
Test was amended to more clearly call out index problems
2023-02-02 12:17:19 +11:00
Sam 07679888c8
FEATURE: allow restricting duplication in search index (#20062)
* FEATURE: allow restricting duplication in search index

This introduces the site setting `max_duplicate_search_index_terms`.
Using this number we limit the amount of duplication in our search index.

This allows us to more correctly weight title searches, so bloated posts
don't unfairly bump to the top of search results.

This feature is completely disabled by default and behind a site setting

We will experiment with it first. Note entire search index must be rebuilt
for it to take effect.


---------

Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>
2023-01-31 12:41:31 +11:00
David Taylor cb932d6ee1
DEV: Apply syntax_tree formatting to `spec/*` 2023-01-09 11:49:28 +00:00
Phil Pirozhkov 493d437e79
Add RSpec 4 compatibility (#17652)
* Remove outdated option

04078317ba

* Use the non-globally exposed RSpec syntax

https://github.com/rspec/rspec-core/pull/2803

* Use the non-globally exposed RSpec syntax, cont

https://github.com/rspec/rspec-core/pull/2803

* Comply to strict predicate matchers

See:
 - https://github.com/rspec/rspec-expectations/pull/1195
 - https://github.com/rspec/rspec-expectations/pull/1196
 - https://github.com/rspec/rspec-expectations/pull/1277
2022-07-28 10:27:38 +08:00
Penar Musaraj ebdfc536dd
Revert "FEATURE: Include participants in PN search data (#16855)" (#16904)
This reverts commit 71c74a262d.
2022-05-25 15:08:36 +10:00
Penar Musaraj 71c74a262d
FEATURE: Include participants in PN search data (#16855)
This makes it easier to find PMs involving a particular user, for
example by searching for `in:messages thisUser` (previously, that query
would only return results in posts where `thisUser` was in the post body).
2022-05-18 10:34:01 -04:00
Penar Musaraj df10a27067
FIX: Exclude automatic anchors from search index (#16396) 2022-04-06 16:06:45 -04:00
Daniel Waterworth 6e9a068e44
FIX: Limit max word length in search index (#16380)
Long words bloat the index for little benefit.
2022-04-06 12:23:30 -05:00
Bianca Nenciu 34b4b53bac
FEATURE: Use Postgres unaccent to ignore accents (#16100)
The search_ignore_accents site setting can be used to make the search
indexer remove the accents before indexing the content. The unaccent
function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).
2022-03-07 23:03:10 +02:00
David Taylor c9dab6fd08
DEV: Automatically require 'rails_helper' in all specs (#16077)
It's very easy to forget to add `require 'rails_helper'` at the top of every core/plugin spec file, and omissions can cause some very confusing/sporadic errors.

By setting this flag in `.rspec`, we can remove the need for `require 'rails_helper'` entirely.
2022-03-01 17:50:50 +00:00
Ayke Halder 5ff3a9c4bb
DEV: add native lazy loading for emojis (#15830) 2022-02-09 12:18:59 +01:00
Alan Guo Xiang Tan 77137c5d29 FIX: Single line emojis has emoji metadata indexed twice.
This commit fixes a bug where we our `HTMLScrubber` was only searching
for emoji img tags which contains only the "emoji" class. However, our emoji image tags
may contain more than just the "emoji" class like "only-emoji" when an
emoji exists by itself on a single line.
2022-01-24 14:03:17 +08:00
Natalie Tay 4c46c7e334
DEV: Remove xlink hrefs (#15059) 2021-11-25 15:22:43 +11:00
Arpit Jalan d1fc759ac4
FIX: remove 'crawl_images' site setting (#14646) 2021-10-19 17:12:29 +05:30
Krzysztof Kotlarek 354c939656
FIX: remove Nokogumbo references (#13951)
Specs broken after f4720205c0
2021-08-05 11:46:25 +10:00
Josh Soref 59097b207f
DEV: Correct typos and spelling mistakes (#12812)
Over the years we accrued many spelling mistakes in the code base. 

This PR attempts to fix spelling mistakes and typos in all areas of the code that are extremely safe to change 

- comments
- test descriptions
- other low risk areas
2021-05-21 11:43:47 +10:00
Penar Musaraj 29f3621f45
FIX: Disable lightboxing of animated images (#13099) 2021-05-20 15:19:44 -04:00
Krzysztof Kotlarek e29605b79f
FEATURE: the ability to search users by custom fields (#12762)
When the admin creates a new custom field they can specify if that field should be searchable or not.

That setting is taken into consideration for quick search results.
2021-04-27 15:52:45 +10:00
Bianca Nenciu c10df4b58d
FIX: Make HTML scrubber work with deep HTML (#12619)
SearchIndexer and ReindexSearch used to explode for posts with very
deep or invalid HTML content.
2021-04-07 17:02:00 +10:00
Sam 3c678df942
PERF: avoid lookbehinds when indexing search (#10862)
* PERF: avoid lookbehinds when indexing search

Previously we used a `EmailCook.url_regexp` this regex used lookbehinds

Unfortunately certain strings could lead to pathological behavior causing
CPU to skyrocket and regex replace to take a very very long time.

EmailCook still needs a fix, but it is less urgent cause it already splits
to single lines. That said we will correct that as well in a seperate PR.

New implementation is far more naive and relies on the extra spaces search
indexer inserts.
2020-10-08 11:40:13 +11:00
Guo Xiang Tan 92b7fe4c62
PERF: Add partial index for non-pm search. 2020-08-18 15:55:08 +08:00
Guo Xiang Tan 255b0e9f14
PERF: Replace video and audio links in search blurb while indexing.
In the near future, we will be swtiching to PG headlines to generate the
search blurb. As such, we need to replace audio and video links in the
raw data used for headline generation. This also means that we avoid
replacing links each time we need to generate the blurb.
2020-08-06 12:25:03 +08:00
Guo Xiang Tan 15e9057ec5
FIX: Reduce number of terms injected for host lexeme.
We do prefix matching in search so there is no need to inject the extra
terms.

Before:
```
"'discourse':10,11 'discourse.org':10,11 'org':10,11 'test':8A,10,11 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```

After:
```
"'discourse.org':10,11 'org':10,11 'test':8A 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```
2020-07-27 15:29:59 +08:00
Guo Xiang Tan 0f53ad58c2
FIX: Improve regexp for matching version lexeme.
Follow up to b70f1084f7
2020-07-27 15:18:27 +08:00
Guo Xiang Tan b70f1084f7
FIX: Don't inject extra terms for version lexeme. 2020-07-27 14:46:44 +08:00
Guo Xiang Tan 181c4eb760 PERF: Avoid parsing `Post#cooked` with Nokogiri for every search. 2020-07-24 10:43:09 +08:00
Guo Xiang Tan 609ba50fe8
DEV: Add more granularity to `SearchIndexer` versions.
Sometimes, we just want to reindex a specific model and not all the
things.
2020-07-23 14:24:06 +08:00
Guo Xiang Tan 2196d0b9ae
FIX: Strip query from URLs when indexing for search.
Indexing query strings in URLS produces inconsistent results in PG and
pollutes the search data for really little gain.

The following seems to work as expected...

```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
                     to_tsvector
------------------------------------------------------
 '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1
```

However, once a path is present

```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
                                         to_tsvector
----------------------------------------------------------------------------------------------
 '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1
```

The lexeme contains both the path and the query string.
2020-07-14 15:32:40 +08:00
Guo Xiang Tan 5c230266d3
FIX: Inject extra lexemes for host lexeme.
```
discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
 alias |       lexemes
-------+---------------------
 host  | {www.discourse.org}

discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
      to_tsvector
-----------------------
 'www.discourse.org':1
```

Given the above lexeme, we will inject additional lexeme by splitting
the host on `.`. The actual tsvector stored will look something like

```
               tsvector
---------------------------------------
 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1
```
2020-07-14 15:32:40 +08:00
Sam Saffron 6428aa5b1f FIX: search indexer had various cases where it could fail
Previous to this fix is a post had the test www.test.com/abc it would fail
to index.

This also simplifies the rules to avoid full url parsing which can be
expensive
2019-06-04 16:21:03 +10:00
Gerhard Schlager b788948985 FEATURE: English locale with international date formats
Makes en_US the new default locale
2019-05-20 13:47:20 +02:00
Sam Saffron 4ea21fa2d0 DEV: use #frozen_string_literal: true on all spec
This change both speeds up specs (less strings to allocate) and helps catch
cases where methods in Discourse are mutating inputs.

Overall we will be migrating everything to use #frozen_string_literal: true
it will take a while, but this is the first and safest move in this direction
2019-04-30 10:27:42 +10:00
Gerhard Schlager 876c4f20b3 FIX: Remove duplicate Emoji names from blurb
The blurb contained the value of the alt and title attribute of Emojis. Both values are always the same.
2019-04-29 17:26:39 +02:00
Gerhard Schlager 71d19f6e1f FIX: Reduce mentions in blurbs to @username or @groupname
The link to the user profile or group is useless and the URL encoded username or group name looks awful for Unicode names
2019-04-29 17:26:39 +02:00
Sam Saffron 45285f1477 DEV: remove update_attributes which is deprecated in Rails 6
See: https://github.com/rails/rails/pull/31998

update_attributes is a relic of the past, it should no longer be used.
2019-04-29 17:32:25 +10:00
Guo Xiang Tan d8704c11ca PERF: Better use of index when queueing a topci for search reindex.
Also move `Search::INDEX_VERSION` to `SearchIndexer` which is where the
version is actually being used.
2019-04-02 09:53:37 +08:00
Guo Xiang Tan 2a69ab4a4c FIX: Keep `alt` and `title` in lightbox when indexing for search.
Follow up to cfd507822f
2019-04-01 16:20:33 +08:00
Guo Xiang Tan 16215f9d3b DEV: Correct spec added in cfd507822f.
Remove stub.
2019-04-01 10:32:25 +08:00
Guo Xiang Tan cfd507822f
PERF: Improve quality of `PostSearchData#raw_data`. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
Guo Xiang Tan daeda80ada
FIX: Don't index posts with empty `Post#raw` for search. (#7263)
* DEV: Remove unnecessary join in `Jobs::ReindexSearch`.

* FIX: Don't index posts with empty `Post#raw` for search.
2019-04-01 10:06:27 +08:00
Penar Musaraj 51e08feb7e DEV: Refactor icons used in lightbox HTML
Uses <svg> elements instead of hacky CSS pseudoelements

Adds a migration to mark posts with lightboxes as needing a rebake
2019-03-22 11:52:06 -04:00
Guo Xiang Tan d808f36fc4 FIX: Reindex post for search when post is moved to a different topic.
* This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.
2019-03-19 17:19:14 +08:00
Régis Hanol 4481836de2 FEATURE: new 'search_ignore_accents' site setting 2018-09-17 10:42:30 +02:00
Régis Hanol 30619c244c FIX: don't index urls to local files 2018-09-13 18:53:53 +02:00
Sam 9b7cab589a FIX: revert diacritic stripping
See more details in test case and at: https://meta.discourse.org/t/discourse-should-ignore-if-a-character-is-accented-when-doing-a-search/90198/16?u=sam
2018-08-31 11:46:55 +10:00
Régis Hanol bc7b530b0a FIX: remove diacritics instead of transliterating 2018-08-24 00:38:44 +02:00
Régis Hanol 2fcf2b899e FIX: remove diacritics when tokenizing html for search 2018-08-23 17:13:52 +02:00