Commit Graph

41 Commits

Author SHA1 Message Date
Alan Guo Xiang Tan 77137c5d29 FIX: Single line emojis has emoji metadata indexed twice.
This commit fixes a bug where we our `HTMLScrubber` was only searching
for emoji img tags which contains only the "emoji" class. However, our emoji image tags
may contain more than just the "emoji" class like "only-emoji" when an
emoji exists by itself on a single line.
2022-01-24 14:03:17 +08:00
Natalie Tay 4c46c7e334
DEV: Remove xlink hrefs (#15059) 2021-11-25 15:22:43 +11:00
Arpit Jalan d1fc759ac4
FIX: remove 'crawl_images' site setting (#14646) 2021-10-19 17:12:29 +05:30
Krzysztof Kotlarek 354c939656
FIX: remove Nokogumbo references (#13951)
Specs broken after f4720205c0
2021-08-05 11:46:25 +10:00
Josh Soref 59097b207f
DEV: Correct typos and spelling mistakes (#12812)
Over the years we accrued many spelling mistakes in the code base. 

This PR attempts to fix spelling mistakes and typos in all areas of the code that are extremely safe to change 

- comments
- test descriptions
- other low risk areas
2021-05-21 11:43:47 +10:00
Penar Musaraj 29f3621f45
FIX: Disable lightboxing of animated images (#13099) 2021-05-20 15:19:44 -04:00
Krzysztof Kotlarek e29605b79f
FEATURE: the ability to search users by custom fields (#12762)
When the admin creates a new custom field they can specify if that field should be searchable or not.

That setting is taken into consideration for quick search results.
2021-04-27 15:52:45 +10:00
Bianca Nenciu c10df4b58d
FIX: Make HTML scrubber work with deep HTML (#12619)
SearchIndexer and ReindexSearch used to explode for posts with very
deep or invalid HTML content.
2021-04-07 17:02:00 +10:00
Sam 3c678df942
PERF: avoid lookbehinds when indexing search (#10862)
* PERF: avoid lookbehinds when indexing search

Previously we used a `EmailCook.url_regexp` this regex used lookbehinds

Unfortunately certain strings could lead to pathological behavior causing
CPU to skyrocket and regex replace to take a very very long time.

EmailCook still needs a fix, but it is less urgent cause it already splits
to single lines. That said we will correct that as well in a seperate PR.

New implementation is far more naive and relies on the extra spaces search
indexer inserts.
2020-10-08 11:40:13 +11:00
Guo Xiang Tan 92b7fe4c62
PERF: Add partial index for non-pm search. 2020-08-18 15:55:08 +08:00
Guo Xiang Tan 255b0e9f14
PERF: Replace video and audio links in search blurb while indexing.
In the near future, we will be swtiching to PG headlines to generate the
search blurb. As such, we need to replace audio and video links in the
raw data used for headline generation. This also means that we avoid
replacing links each time we need to generate the blurb.
2020-08-06 12:25:03 +08:00
Guo Xiang Tan 15e9057ec5
FIX: Reduce number of terms injected for host lexeme.
We do prefix matching in search so there is no need to inject the extra
terms.

Before:
```
"'discourse':10,11 'discourse.org':10,11 'org':10,11 'test':8A,10,11 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```

After:
```
"'discourse.org':10,11 'org':10,11 'test':8A 'test.discourse.org':10,11 'titl':4A 'uncategor':9B"
```
2020-07-27 15:29:59 +08:00
Guo Xiang Tan 0f53ad58c2
FIX: Improve regexp for matching version lexeme.
Follow up to b70f1084f7
2020-07-27 15:18:27 +08:00
Guo Xiang Tan b70f1084f7
FIX: Don't inject extra terms for version lexeme. 2020-07-27 14:46:44 +08:00
Guo Xiang Tan 181c4eb760 PERF: Avoid parsing `Post#cooked` with Nokogiri for every search. 2020-07-24 10:43:09 +08:00
Guo Xiang Tan 609ba50fe8
DEV: Add more granularity to `SearchIndexer` versions.
Sometimes, we just want to reindex a specific model and not all the
things.
2020-07-23 14:24:06 +08:00
Guo Xiang Tan 2196d0b9ae
FIX: Strip query from URLs when indexing for search.
Indexing query strings in URLS produces inconsistent results in PG and
pollutes the search data for really little gain.

The following seems to work as expected...

```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
                     to_tsvector
------------------------------------------------------
 '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1
```

However, once a path is present

```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
                                         to_tsvector
----------------------------------------------------------------------------------------------
 '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1
```

The lexeme contains both the path and the query string.
2020-07-14 15:32:40 +08:00
Guo Xiang Tan 5c230266d3
FIX: Inject extra lexemes for host lexeme.
```
discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
 alias |       lexemes
-------+---------------------
 host  | {www.discourse.org}

discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
      to_tsvector
-----------------------
 'www.discourse.org':1
```

Given the above lexeme, we will inject additional lexeme by splitting
the host on `.`. The actual tsvector stored will look something like

```
               tsvector
---------------------------------------
 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1
```
2020-07-14 15:32:40 +08:00
Sam Saffron 6428aa5b1f FIX: search indexer had various cases where it could fail
Previous to this fix is a post had the test www.test.com/abc it would fail
to index.

This also simplifies the rules to avoid full url parsing which can be
expensive
2019-06-04 16:21:03 +10:00
Gerhard Schlager b788948985 FEATURE: English locale with international date formats
Makes en_US the new default locale
2019-05-20 13:47:20 +02:00
Sam Saffron 4ea21fa2d0 DEV: use #frozen_string_literal: true on all spec
This change both speeds up specs (less strings to allocate) and helps catch
cases where methods in Discourse are mutating inputs.

Overall we will be migrating everything to use #frozen_string_literal: true
it will take a while, but this is the first and safest move in this direction
2019-04-30 10:27:42 +10:00
Gerhard Schlager 876c4f20b3 FIX: Remove duplicate Emoji names from blurb
The blurb contained the value of the alt and title attribute of Emojis. Both values are always the same.
2019-04-29 17:26:39 +02:00
Gerhard Schlager 71d19f6e1f FIX: Reduce mentions in blurbs to @username or @groupname
The link to the user profile or group is useless and the URL encoded username or group name looks awful for Unicode names
2019-04-29 17:26:39 +02:00
Sam Saffron 45285f1477 DEV: remove update_attributes which is deprecated in Rails 6
See: https://github.com/rails/rails/pull/31998

update_attributes is a relic of the past, it should no longer be used.
2019-04-29 17:32:25 +10:00
Guo Xiang Tan d8704c11ca PERF: Better use of index when queueing a topci for search reindex.
Also move `Search::INDEX_VERSION` to `SearchIndexer` which is where the
version is actually being used.
2019-04-02 09:53:37 +08:00
Guo Xiang Tan 2a69ab4a4c FIX: Keep `alt` and `title` in lightbox when indexing for search.
Follow up to cfd507822f
2019-04-01 16:20:33 +08:00
Guo Xiang Tan 16215f9d3b DEV: Correct spec added in cfd507822f.
Remove stub.
2019-04-01 10:32:25 +08:00
Guo Xiang Tan cfd507822f
PERF: Improve quality of `PostSearchData#raw_data`. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
Guo Xiang Tan daeda80ada
FIX: Don't index posts with empty `Post#raw` for search. (#7263)
* DEV: Remove unnecessary join in `Jobs::ReindexSearch`.

* FIX: Don't index posts with empty `Post#raw` for search.
2019-04-01 10:06:27 +08:00
Penar Musaraj 51e08feb7e DEV: Refactor icons used in lightbox HTML
Uses <svg> elements instead of hacky CSS pseudoelements

Adds a migration to mark posts with lightboxes as needing a rebake
2019-03-22 11:52:06 -04:00
Guo Xiang Tan d808f36fc4 FIX: Reindex post for search when post is moved to a different topic.
* This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.
2019-03-19 17:19:14 +08:00
Régis Hanol 4481836de2 FEATURE: new 'search_ignore_accents' site setting 2018-09-17 10:42:30 +02:00
Régis Hanol 30619c244c FIX: don't index urls to local files 2018-09-13 18:53:53 +02:00
Sam 9b7cab589a FIX: revert diacritic stripping
See more details in test case and at: https://meta.discourse.org/t/discourse-should-ignore-if-a-character-is-accented-when-doing-a-search/90198/16?u=sam
2018-08-31 11:46:55 +10:00
Régis Hanol bc7b530b0a FIX: remove diacritics instead of transliterating 2018-08-24 00:38:44 +02:00
Régis Hanol 2fcf2b899e FIX: remove diacritics when tokenizing html for search 2018-08-23 17:13:52 +02:00
Bianca Nenciu 975a72ab7a FEATURE: Make links indexable. (#6285) 2018-08-20 10:39:19 +10:00
Sam 6676bbd38b FEATURE: index YouTube titles in search
Previously we omitted the titles for videos that YouTube provided
2018-04-26 15:46:52 +10:00
Sam 86d12bd44b FEATURE: search within title using in:title
Also

- Significantly improved search ranking, title is treated most strongly
- Adds tag names to the index
- Run search re-indexer more aggressively
- Re-index topic and all posts on category change
2018-02-20 14:41:21 +11:00
Erick Guan 6e59149a77 FIX: rebuild index when engine replaced (#5021) 2017-08-16 07:38:34 -04:00
Sam 0a78ae739d Remove SearchObserver, aim is to remove all observers
rails-observers gem is mostly unmaintained and is a pain to carry forward
new implementation contains significantly less magic as a bonus
2016-12-22 13:13:14 +11:00