Commit Graph

17 Commits

Author SHA1 Message Date
Guo Xiang Tan d8704c11ca PERF: Better use of index when queueing a topci for search reindex.
Also move `Search::INDEX_VERSION` to `SearchIndexer` which is where the
version is actually being used.
2019-04-02 09:53:37 +08:00
Guo Xiang Tan 2a69ab4a4c FIX: Keep `alt` and `title` in lightbox when indexing for search.
Follow up to cfd507822f
2019-04-01 16:20:33 +08:00
Guo Xiang Tan 16215f9d3b DEV: Correct spec added in cfd507822f.
Remove stub.
2019-04-01 10:32:25 +08:00
Guo Xiang Tan cfd507822f
PERF: Improve quality of `PostSearchData#raw_data`. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
Guo Xiang Tan daeda80ada
FIX: Don't index posts with empty `Post#raw` for search. (#7263)
* DEV: Remove unnecessary join in `Jobs::ReindexSearch`.

* FIX: Don't index posts with empty `Post#raw` for search.
2019-04-01 10:06:27 +08:00
Penar Musaraj 51e08feb7e DEV: Refactor icons used in lightbox HTML
Uses <svg> elements instead of hacky CSS pseudoelements

Adds a migration to mark posts with lightboxes as needing a rebake
2019-03-22 11:52:06 -04:00
Guo Xiang Tan d808f36fc4 FIX: Reindex post for search when post is moved to a different topic.
* This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.
2019-03-19 17:19:14 +08:00
Régis Hanol 4481836de2 FEATURE: new 'search_ignore_accents' site setting 2018-09-17 10:42:30 +02:00
Régis Hanol 30619c244c FIX: don't index urls to local files 2018-09-13 18:53:53 +02:00
Sam 9b7cab589a FIX: revert diacritic stripping
See more details in test case and at: https://meta.discourse.org/t/discourse-should-ignore-if-a-character-is-accented-when-doing-a-search/90198/16?u=sam
2018-08-31 11:46:55 +10:00
Régis Hanol bc7b530b0a FIX: remove diacritics instead of transliterating 2018-08-24 00:38:44 +02:00
Régis Hanol 2fcf2b899e FIX: remove diacritics when tokenizing html for search 2018-08-23 17:13:52 +02:00
Bianca Nenciu 975a72ab7a FEATURE: Make links indexable. (#6285) 2018-08-20 10:39:19 +10:00
Sam 6676bbd38b FEATURE: index YouTube titles in search
Previously we omitted the titles for videos that YouTube provided
2018-04-26 15:46:52 +10:00
Sam 86d12bd44b FEATURE: search within title using in:title
Also

- Significantly improved search ranking, title is treated most strongly
- Adds tag names to the index
- Run search re-indexer more aggressively
- Re-index topic and all posts on category change
2018-02-20 14:41:21 +11:00
Erick Guan 6e59149a77 FIX: rebuild index when engine replaced (#5021) 2017-08-16 07:38:34 -04:00
Sam 0a78ae739d Remove SearchObserver, aim is to remove all observers
rails-observers gem is mostly unmaintained and is a pain to carry forward
new implementation contains significantly less magic as a bonus
2016-12-22 13:13:14 +11:00