discourse

Commit Graph

Author	SHA1	Message	Date
Sam	bd32912c5e	FIX: do not allow title stuffing to dominate search (#21464 ) We were giving topics with repeated words extra weight in search index. This meant that it was trivial to stuff words into title to dominate in search given we search for exact title matches first. The following tweak means that: `invite invited invites` and `invite some stuff` Both rank the same for title searching. Titles are short and punchy, duplicating words should not give special weight. Requires a full reindex to take effect.	2023-05-10 11:47:58 +10:00
Jan Cernik	afe3e36363	DEV: Remove lazy-yt and replace with lazy-videos (#20722 ) - Refactors the old plugin to remove jquery usage - Adds support for Vimeo videos (default on) and Tiktok (experimental and default off)	2023-03-29 11:54:25 -04:00
Sam	4a3c13a37b	FIX: search index failing on certain posts (#20736 ) During search indexing we "stuff" the index with additional keywords for entities that look like domain names. This allows searches for `cnn` to find URLs for `www.cnn.com` The search stuffing attempted to keep indexes aligned at the correct positions by remapping the indexed terms. However under certain edge cases a single word can stem into 2 different lexemes. If this happened we had an off by one which caused the entire indexing to fail. We work around this edge case (and carry incorrect index positions) for cases like this. It is unlikely to impact search quality at all given index position makes almost no difference in the search algorithm.	2023-03-20 15:43:08 +11:00
Sam	cd247d5322	FEATURE: Roll out new search optimisations (#20364 ) - Reduce duplication of terms in post index from unlimited to 6. This will result in reduced index size and reduced weighting for posts containing a huge amount of duplicate terms. (Eg: a post containing "sam sam sam sam sam sam sam sam", will index as "sam sam sam sam sam sam", only including the word up to 6 times.) This corrects a flaw where title weighting could be ignored. - Prioritize exact matches of words in titles. Our search always performs a prefix match. However we want to give special weight to exact title matches meaning that a search for "sum" will find topics such as "the sum of us" vs "summer in spring". - Pick up fixes to our search algorithm which are missing from old indexes. Specifically pick up the fix that indexes URLs properly. (`https://happy.com` was stemmed to `happi` in keywords and then was not searchable) see also: https://meta.discourse.org/t/refinements-to-search-being-tested-on-meta/254158 Indexing will take a while and work in batches, in the background.	2023-02-20 11:53:35 +11:00
Sam	651476e89e	FIX: domain searches not working properly for URLs (#20136 ) If a post contains domain with a word that stems to a non prefix single words will not match it. For example: in happy.com, `happy` stems to `happi`. Thus searches for happy will not find URLs with it included. This bloats the index a tiny bit, but impact is limited. Will require a full reindex of search to take effect. When we are done refining search we can consider a full version bump.	2023-02-03 09:55:28 +11:00
Sam	4570118a63	FIX: search index duplicate parser matching is too restrictive (#20129 ) Previous regex did not allow for cases where a lexeme contains a : (colon) This can happen when parsing URLs. New algorithm allows for this. Test was amended to more clearly call out index problems	2023-02-02 12:17:19 +11:00
Sam	07679888c8	FEATURE: allow restricting duplication in search index (#20062 ) * FEATURE: allow restricting duplication in search index This introduces the site setting `max_duplicate_search_index_terms`. Using this number we limit the amount of duplication in our search index. This allows us to more correctly weight title searches, so bloated posts don't unfairly bump to the top of search results. This feature is completely disabled by default and behind a site setting We will experiment with it first. Note entire search index must be rebuilt for it to take effect. --------- Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>	2023-01-31 12:41:31 +11:00
Daniel Waterworth	666536cbd1	DEV: Prefer \A and \z over ^ and $ in regexes (#19936 )	2023-01-20 12:52:49 -06:00
David Taylor	5a003715d3	DEV: Apply syntax_tree formatting to `app/*`	2023-01-09 14:14:59 +00:00
Penar Musaraj	ebdfc536dd	Revert "FEATURE: Include participants in PN search data (#16855 )" (#16904 ) This reverts commit `71c74a262d`.	2022-05-25 15:08:36 +10:00
Penar Musaraj	71c74a262d	FEATURE: Include participants in PN search data (#16855 ) This makes it easier to find PMs involving a particular user, for example by searching for `in:messages thisUser` (previously, that query would only return results in posts where `thisUser` was in the post body).	2022-05-18 10:34:01 -04:00
Penar Musaraj	df10a27067	FIX: Exclude automatic anchors from search index (#16396 )	2022-04-06 16:06:45 -04:00
Daniel Waterworth	6e9a068e44	FIX: Limit max word length in search index (#16380 ) Long words bloat the index for little benefit.	2022-04-06 12:23:30 -05:00
Bianca Nenciu	34b4b53bac	FEATURE: Use Postgres unaccent to ignore accents (#16100 ) The search_ignore_accents site setting can be used to make the search indexer remove the accents before indexing the content. The unaccent function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).	2022-03-07 23:03:10 +02:00
Dan Ungureanu	820fea835c	FIX: Further reduce the input of to_tsvector (#15716 ) Random strings can result into much longer tsvectors. For example parsing a Base64 string of ~600kb can result in a tsvector of over 1MB, which is the maximum size of a tsvector. Follow-up-to: `823c3f09d4`	2022-02-07 23:03:01 +02:00
Alan Guo Xiang Tan	77137c5d29	FIX: Single line emojis has emoji metadata indexed twice. This commit fixes a bug where we our `HTMLScrubber` was only searching for emoji img tags which contains only the "emoji" class. However, our emoji image tags may contain more than just the "emoji" class like "only-emoji" when an emoji exists by itself on a single line.	2022-01-24 14:03:17 +08:00
Dan Ungureanu	823c3f09d4	FIX: Reduce input of to_tsvector to follow limits (#13806 ) Long posts may have `cooked` fields that produce tsvectors longer than the maximum size of 1MiB (1,048,576 bytes). This commit uses just the first million characters of the scrubbed cooked text for indexing. Reducing the size to exactly 1MB (1_048_576) is not sufficient because sometimes the output tsvector may be longer than the input and this gives us some breathing room.	2021-07-28 18:25:14 +03:00
Josh Soref	59097b207f	DEV: Correct typos and spelling mistakes (#12812 ) Over the years we accrued many spelling mistakes in the code base. This PR attempts to fix spelling mistakes and typos in all areas of the code that are extremely safe to change - comments - test descriptions - other low risk areas	2021-05-21 11:43:47 +10:00
Krzysztof Kotlarek	e29605b79f	FEATURE: the ability to search users by custom fields (#12762 ) When the admin creates a new custom field they can specify if that field should be searchable or not. That setting is taken into consideration for quick search results.	2021-04-27 15:52:45 +10:00
Bianca Nenciu	c10df4b58d	FIX: Make HTML scrubber work with deep HTML (#12619 ) SearchIndexer and ReindexSearch used to explode for posts with very deep or invalid HTML content.	2021-04-07 17:02:00 +10:00
Guo Xiang Tan	650da7b626	PERF: Update index for category in a background job. Search indexing can get expensive and there is no need for us to block the entire request just to wait for index to finish.	2020-11-09 13:51:26 +08:00
Guo Xiang Tan	d12d8fb7fd	DEV: Include more information when reporting search indexing failures.	2020-08-21 11:02:00 +08:00
Guo Xiang Tan	337f062f0f	PERF: Defer indexing post for search when saving a post. Indexing a post for search is slow and there is no reason for us to have to block saving a post due to search indexing.	2020-08-21 07:52:43 +08:00
Guo Xiang Tan	92b7fe4c62	PERF: Add partial index for non-pm search.	2020-08-18 15:55:08 +08:00
Guo Xiang Tan	8b811533b1	DEV: Improve readability of setting weights in `SearchIndexer`.	2020-08-14 23:11:41 +08:00
Guo Xiang Tan	5819c4cb3b	PERF: Switch to ActiveRecord's upsert in `SearchIndexer`. On insertion, it uses a single query instead of 2.	2020-08-14 16:15:14 +08:00
Guo Xiang Tan	255b0e9f14	PERF: Replace video and audio links in search blurb while indexing. In the near future, we will be swtiching to PG headlines to generate the search blurb. As such, we need to replace audio and video links in the raw data used for headline generation. This also means that we avoid replacing links each time we need to generate the blurb.	2020-08-06 12:25:03 +08:00
Guo Xiang Tan	15e9057ec5	FIX: Reduce number of terms injected for host lexeme. We do prefix matching in search so there is no need to inject the extra terms. Before: ``` "'discourse':10,11 'discourse.org':10,11 'org':10,11 'test':8A,10,11 'test.discourse.org':10,11 'titl':4A 'uncategor':9B" ``` After: ``` "'discourse.org':10,11 'org':10,11 'test':8A 'test.discourse.org':10,11 'titl':4A 'uncategor':9B" ```	2020-07-27 15:29:59 +08:00
Guo Xiang Tan	0f53ad58c2	FIX: Improve regexp for matching version lexeme. Follow up to `b70f1084f7`	2020-07-27 15:18:27 +08:00
Guo Xiang Tan	b70f1084f7	FIX: Don't inject extra terms for version lexeme.	2020-07-27 14:46:44 +08:00
Guo Xiang Tan	181c4eb760	PERF: Avoid parsing `Post#cooked` with Nokogiri for every search.	2020-07-24 10:43:09 +08:00
Guo Xiang Tan	3766122a82	DEV: Allow developmental post search index versions.	2020-07-23 15:19:46 +08:00
Guo Xiang Tan	609ba50fe8	DEV: Add more granularity to `SearchIndexer` versions. Sometimes, we just want to reindex a specific model and not all the things.	2020-07-23 14:24:06 +08:00
Guo Xiang Tan	ff7678e210	FIX: Reindex posts when `Topic#title` or `Category#name` changes.	2020-07-17 11:12:31 +08:00
Guo Xiang Tan	5c230266d3	FIX: Inject extra lexemes for host lexeme. ``` discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org'); alias \| lexemes -------+--------------------- host \| {www.discourse.org} discourse_development=# SELECT TO_TSVECTOR('www.discourse.org'); to_tsvector ----------------------- 'www.discourse.org':1 ``` Given the above lexeme, we will inject additional lexeme by splitting the host on `.`. The actual tsvector stored will look something like ``` tsvector --------------------------------------- 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1 ```	2020-07-14 15:32:40 +08:00
Sam Saffron	88459e08c9	FEATURE: allow disabling of extra term injection in search There is a feature in search where we take over from the tokenizer in postgres and attempt to inject more words into search. So for example: sam.i.am will inject the words i and am. This is not ideal cause there are many edge cases and this can cause extreme index bloat. This is an opening move commit to make it configurable, over the next few weeks we will evaluate and decide if we disable this by default or simply remove.	2020-06-25 13:36:52 +10:00
Krzysztof Kotlarek	9bff0882c3	FEATURE: Nokogumbo (#9577 ) * FEATURE: Nokogumbo Use Nokogumbo HTML parser.	2020-05-05 13:46:57 +10:00
Neil Lalonde	875f0d8fd8	FEATURE: Tag synonyms This feature adds the ability to define synonyms for tags, and the ability to merge one tag into another while keeping it as a synonym. For example, tags named "js" and "java-script" can be synonyms of "javascript". When searching and creating topics using synonyms, they will be mapped to the base tag. Along with this change is a new UI found on each tag's page (for example, `/tags/javascript`) where more information about the tag can be shown. It will list the synonyms, which categories it's restricted to (if any), and which tag groups it belongs to (if tag group names are public on the `/tags` page by enabling the "tags listed by group" setting). Staff users will be able to manage tags in this UI, merge tags, and add/remove synonyms.	2019-12-04 13:33:51 -05:00
Krzysztof Kotlarek	427d54b2b0	DEV: Upgrading Discourse to Zeitwerk (#8098 ) Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains. We no longer need to use Rails "require_dependency" anywhere and instead can just use standard Ruby patterns to require files. This is a far reaching change and we expect some followups here.	2019-10-02 14:01:53 +10:00
Sam Saffron	6428aa5b1f	FIX: search indexer had various cases where it could fail Previous to this fix is a post had the test www.test.com/abc it would fail to index. This also simplifies the rules to avoid full url parsing which can be expensive	2019-06-04 16:21:03 +10:00
Dan Ungureanu	7a08e23b4b	FIX: Bump search index version.	2019-05-29 08:20:59 +08:00
Gerhard Schlager	876c4f20b3	FIX: Remove duplicate Emoji names from blurb The blurb contained the value of the alt and title attribute of Emojis. Both values are always the same.	2019-04-29 17:26:39 +02:00
Gerhard Schlager	71d19f6e1f	FIX: Reduce mentions in blurbs to @username or @groupname The link to the user profile or group is useless and the URL encoded username or group name looks awful for Unicode names	2019-04-29 17:26:39 +02:00
Guo Xiang Tan	d8704c11ca	PERF: Better use of index when queueing a topci for search reindex. Also move `Search::INDEX_VERSION` to `SearchIndexer` which is where the version is actually being used.	2019-04-02 09:53:37 +08:00
Guo Xiang Tan	3cba10b9ca	DEV: Don't warn when trying to reindex a post with a deleted topic.	2019-04-01 17:04:32 +08:00
Guo Xiang Tan	2a69ab4a4c	FIX: Keep `alt` and `title` in lightbox when indexing for search. Follow up to `cfd507822f`	2019-04-01 16:20:33 +08:00
Guo Xiang Tan	cfd507822f	PERF: Improve quality of `PostSearchData#raw_data`. (#7275 ) This commit fixes the follow quality issue with `PostSearchData#raw_data`: 1. URLs are being tokenized and links with similar href and characters are being duplicated in the raw data. `Post#cooked`: ``` <p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org ``` 2. Ligthbox being included in search pollutes the `PostSearchData#raw_data` unncessarily. From 28 March 2018 to 28 March 2019, searches for the term `image` on `meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for uploads that were pasted. `Post#cooked` ``` <p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000 ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image ``` In terms of indexing performance, we now have to parse the given HTML through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms to scrub plus the indexing takes place in a background job.	2019-04-01 10:14:29 +08:00
Guo Xiang Tan	daeda80ada	FIX: Don't index posts with empty `Post#raw` for search. (#7263 ) * DEV: Remove unnecessary join in `Jobs::ReindexSearch`. * FIX: Don't index posts with empty `Post#raw` for search.	2019-04-01 10:06:27 +08:00
Guo Xiang Tan	d808f36fc4	FIX: Reindex post for search when post is moved to a different topic. * This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.	2019-03-19 17:19:14 +08:00
Daniel Hollas	cee51672c9	FIX: Strip accents from search query `4481836` introduced accent stipping in search_indexer, but we need to strip it from the query itself as well TODO in search with diacritics: - Still need to fix excerpts on search page - need to support accent stripping in in_topic search - need to make sure that in:title works correctly - need to fix "word boldening" in titles	2018-10-23 12:10:33 +11:00

1 2

72 Commits