discourse

Commit Graph

Author	SHA1	Message	Date
Sam	4a3c13a37b	FIX: search index failing on certain posts (#20736 ) During search indexing we "stuff" the index with additional keywords for entities that look like domain names. This allows searches for `cnn` to find URLs for `www.cnn.com` The search stuffing attempted to keep indexes aligned at the correct positions by remapping the indexed terms. However under certain edge cases a single word can stem into 2 different lexemes. If this happened we had an off by one which caused the entire indexing to fail. We work around this edge case (and carry incorrect index positions) for cases like this. It is unlikely to impact search quality at all given index position makes almost no difference in the search algorithm.	2023-03-20 15:43:08 +11:00
Ted Johansson	39c2f63b35	SECURITY: Add FinalDestination::FastImage that's SSRF safe	2023-03-16 15:27:09 -06:00
Sam	651476e89e	FIX: domain searches not working properly for URLs (#20136 ) If a post contains domain with a word that stems to a non prefix single words will not match it. For example: in happy.com, `happy` stems to `happi`. Thus searches for happy will not find URLs with it included. This bloats the index a tiny bit, but impact is limited. Will require a full reindex of search to take effect. When we are done refining search we can consider a full version bump.	2023-02-03 09:55:28 +11:00
Sam	4570118a63	FIX: search index duplicate parser matching is too restrictive (#20129 ) Previous regex did not allow for cases where a lexeme contains a : (colon) This can happen when parsing URLs. New algorithm allows for this. Test was amended to more clearly call out index problems	2023-02-02 12:17:19 +11:00
Sam	07679888c8	FEATURE: allow restricting duplication in search index (#20062 ) * FEATURE: allow restricting duplication in search index This introduces the site setting `max_duplicate_search_index_terms`. Using this number we limit the amount of duplication in our search index. This allows us to more correctly weight title searches, so bloated posts don't unfairly bump to the top of search results. This feature is completely disabled by default and behind a site setting We will experiment with it first. Note entire search index must be rebuilt for it to take effect. --------- Co-authored-by: Alan Guo Xiang Tan <gxtan1990@gmail.com>	2023-01-31 12:41:31 +11:00
David Taylor	cb932d6ee1	DEV: Apply syntax_tree formatting to `spec/*`	2023-01-09 11:49:28 +00:00
Phil Pirozhkov	493d437e79	Add RSpec 4 compatibility (#17652 ) * Remove outdated option `04078317ba` * Use the non-globally exposed RSpec syntax https://github.com/rspec/rspec-core/pull/2803 * Use the non-globally exposed RSpec syntax, cont https://github.com/rspec/rspec-core/pull/2803 * Comply to strict predicate matchers See: - https://github.com/rspec/rspec-expectations/pull/1195 - https://github.com/rspec/rspec-expectations/pull/1196 - https://github.com/rspec/rspec-expectations/pull/1277	2022-07-28 10:27:38 +08:00
Penar Musaraj	ebdfc536dd	Revert "FEATURE: Include participants in PN search data (#16855 )" (#16904 ) This reverts commit `71c74a262d`.	2022-05-25 15:08:36 +10:00
Penar Musaraj	71c74a262d	FEATURE: Include participants in PN search data (#16855 ) This makes it easier to find PMs involving a particular user, for example by searching for `in:messages thisUser` (previously, that query would only return results in posts where `thisUser` was in the post body).	2022-05-18 10:34:01 -04:00
Penar Musaraj	df10a27067	FIX: Exclude automatic anchors from search index (#16396 )	2022-04-06 16:06:45 -04:00
Daniel Waterworth	6e9a068e44	FIX: Limit max word length in search index (#16380 ) Long words bloat the index for little benefit.	2022-04-06 12:23:30 -05:00
Bianca Nenciu	34b4b53bac	FEATURE: Use Postgres unaccent to ignore accents (#16100 ) The search_ignore_accents site setting can be used to make the search indexer remove the accents before indexing the content. The unaccent function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).	2022-03-07 23:03:10 +02:00
David Taylor	c9dab6fd08	DEV: Automatically require 'rails_helper' in all specs (#16077 ) It's very easy to forget to add `require 'rails_helper'` at the top of every core/plugin spec file, and omissions can cause some very confusing/sporadic errors. By setting this flag in `.rspec`, we can remove the need for `require 'rails_helper'` entirely.	2022-03-01 17:50:50 +00:00
Ayke Halder	5ff3a9c4bb	DEV: add native lazy loading for emojis (#15830 )	2022-02-09 12:18:59 +01:00
Alan Guo Xiang Tan	77137c5d29	FIX: Single line emojis has emoji metadata indexed twice. This commit fixes a bug where we our `HTMLScrubber` was only searching for emoji img tags which contains only the "emoji" class. However, our emoji image tags may contain more than just the "emoji" class like "only-emoji" when an emoji exists by itself on a single line.	2022-01-24 14:03:17 +08:00
Natalie Tay	4c46c7e334	DEV: Remove xlink hrefs (#15059 )	2021-11-25 15:22:43 +11:00
Arpit Jalan	d1fc759ac4	FIX: remove 'crawl_images' site setting (#14646 )	2021-10-19 17:12:29 +05:30
Krzysztof Kotlarek	354c939656	FIX: remove Nokogumbo references (#13951 ) Specs broken after `f4720205c0`	2021-08-05 11:46:25 +10:00
Josh Soref	59097b207f	DEV: Correct typos and spelling mistakes (#12812 ) Over the years we accrued many spelling mistakes in the code base. This PR attempts to fix spelling mistakes and typos in all areas of the code that are extremely safe to change - comments - test descriptions - other low risk areas	2021-05-21 11:43:47 +10:00
Penar Musaraj	29f3621f45	FIX: Disable lightboxing of animated images (#13099 )	2021-05-20 15:19:44 -04:00
Krzysztof Kotlarek	e29605b79f	FEATURE: the ability to search users by custom fields (#12762 ) When the admin creates a new custom field they can specify if that field should be searchable or not. That setting is taken into consideration for quick search results.	2021-04-27 15:52:45 +10:00
Bianca Nenciu	c10df4b58d	FIX: Make HTML scrubber work with deep HTML (#12619 ) SearchIndexer and ReindexSearch used to explode for posts with very deep or invalid HTML content.	2021-04-07 17:02:00 +10:00
Sam	3c678df942	PERF: avoid lookbehinds when indexing search (#10862 ) * PERF: avoid lookbehinds when indexing search Previously we used a `EmailCook.url_regexp` this regex used lookbehinds Unfortunately certain strings could lead to pathological behavior causing CPU to skyrocket and regex replace to take a very very long time. EmailCook still needs a fix, but it is less urgent cause it already splits to single lines. That said we will correct that as well in a seperate PR. New implementation is far more naive and relies on the extra spaces search indexer inserts.	2020-10-08 11:40:13 +11:00
Guo Xiang Tan	92b7fe4c62	PERF: Add partial index for non-pm search.	2020-08-18 15:55:08 +08:00
Guo Xiang Tan	255b0e9f14	PERF: Replace video and audio links in search blurb while indexing. In the near future, we will be swtiching to PG headlines to generate the search blurb. As such, we need to replace audio and video links in the raw data used for headline generation. This also means that we avoid replacing links each time we need to generate the blurb.	2020-08-06 12:25:03 +08:00
Guo Xiang Tan	15e9057ec5	FIX: Reduce number of terms injected for host lexeme. We do prefix matching in search so there is no need to inject the extra terms. Before: ``` "'discourse':10,11 'discourse.org':10,11 'org':10,11 'test':8A,10,11 'test.discourse.org':10,11 'titl':4A 'uncategor':9B" ``` After: ``` "'discourse.org':10,11 'org':10,11 'test':8A 'test.discourse.org':10,11 'titl':4A 'uncategor':9B" ```	2020-07-27 15:29:59 +08:00
Guo Xiang Tan	0f53ad58c2	FIX: Improve regexp for matching version lexeme. Follow up to `b70f1084f7`	2020-07-27 15:18:27 +08:00
Guo Xiang Tan	b70f1084f7	FIX: Don't inject extra terms for version lexeme.	2020-07-27 14:46:44 +08:00
Guo Xiang Tan	181c4eb760	PERF: Avoid parsing `Post#cooked` with Nokogiri for every search.	2020-07-24 10:43:09 +08:00
Guo Xiang Tan	609ba50fe8	DEV: Add more granularity to `SearchIndexer` versions. Sometimes, we just want to reindex a specific model and not all the things.	2020-07-23 14:24:06 +08:00
Guo Xiang Tan	2196d0b9ae	FIX: Strip query from URLs when indexing for search. Indexing query strings in URLS produces inconsistent results in PG and pollutes the search data for really little gain. The following seems to work as expected... ``` discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3'); to_tsvector ------------------------------------------------------ '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1 ``` However, once a path is present ``` discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3'); to_tsvector ---------------------------------------------------------------------------------------------- '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1 ``` The lexeme contains both the path and the query string.	2020-07-14 15:32:40 +08:00
Guo Xiang Tan	5c230266d3	FIX: Inject extra lexemes for host lexeme. ``` discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org'); alias \| lexemes -------+--------------------- host \| {www.discourse.org} discourse_development=# SELECT TO_TSVECTOR('www.discourse.org'); to_tsvector ----------------------- 'www.discourse.org':1 ``` Given the above lexeme, we will inject additional lexeme by splitting the host on `.`. The actual tsvector stored will look something like ``` tsvector --------------------------------------- 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1 ```	2020-07-14 15:32:40 +08:00
Sam Saffron	6428aa5b1f	FIX: search indexer had various cases where it could fail Previous to this fix is a post had the test www.test.com/abc it would fail to index. This also simplifies the rules to avoid full url parsing which can be expensive	2019-06-04 16:21:03 +10:00
Gerhard Schlager	b788948985	FEATURE: English locale with international date formats Makes en_US the new default locale	2019-05-20 13:47:20 +02:00
Sam Saffron	4ea21fa2d0	DEV: use #frozen_string_literal: true on all spec This change both speeds up specs (less strings to allocate) and helps catch cases where methods in Discourse are mutating inputs. Overall we will be migrating everything to use #frozen_string_literal: true it will take a while, but this is the first and safest move in this direction	2019-04-30 10:27:42 +10:00
Gerhard Schlager	876c4f20b3	FIX: Remove duplicate Emoji names from blurb The blurb contained the value of the alt and title attribute of Emojis. Both values are always the same.	2019-04-29 17:26:39 +02:00
Gerhard Schlager	71d19f6e1f	FIX: Reduce mentions in blurbs to @username or @groupname The link to the user profile or group is useless and the URL encoded username or group name looks awful for Unicode names	2019-04-29 17:26:39 +02:00
Sam Saffron	45285f1477	DEV: remove update_attributes which is deprecated in Rails 6 See: https://github.com/rails/rails/pull/31998 update_attributes is a relic of the past, it should no longer be used.	2019-04-29 17:32:25 +10:00
Guo Xiang Tan	d8704c11ca	PERF: Better use of index when queueing a topci for search reindex. Also move `Search::INDEX_VERSION` to `SearchIndexer` which is where the version is actually being used.	2019-04-02 09:53:37 +08:00
Guo Xiang Tan	2a69ab4a4c	FIX: Keep `alt` and `title` in lightbox when indexing for search. Follow up to `cfd507822f`	2019-04-01 16:20:33 +08:00
Guo Xiang Tan	16215f9d3b	DEV: Correct spec added in `cfd507822f`. Remove stub.	2019-04-01 10:32:25 +08:00
Guo Xiang Tan	cfd507822f	PERF: Improve quality of `PostSearchData#raw_data`. (#7275 ) This commit fixes the follow quality issue with `PostSearchData#raw_data`: 1. URLs are being tokenized and links with similar href and characters are being duplicated in the raw data. `Post#cooked`: ``` <p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org ``` 2. Ligthbox being included in search pollutes the `PostSearchData#raw_data` unncessarily. From 28 March 2018 to 28 March 2019, searches for the term `image` on `meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for uploads that were pasted. `Post#cooked` ``` <p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000 ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image ``` In terms of indexing performance, we now have to parse the given HTML through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms to scrub plus the indexing takes place in a background job.	2019-04-01 10:14:29 +08:00
Guo Xiang Tan	daeda80ada	FIX: Don't index posts with empty `Post#raw` for search. (#7263 ) * DEV: Remove unnecessary join in `Jobs::ReindexSearch`. * FIX: Don't index posts with empty `Post#raw` for search.	2019-04-01 10:06:27 +08:00
Penar Musaraj	51e08feb7e	DEV: Refactor icons used in lightbox HTML Uses <svg> elements instead of hacky CSS pseudoelements Adds a migration to mark posts with lightboxes as needing a rebake	2019-03-22 11:52:06 -04:00
Guo Xiang Tan	d808f36fc4	FIX: Reindex post for search when post is moved to a different topic. * This is causing certain posts to appear in searches incorrectly as `PostSearchData#raw_data` contains the outdated title, category name and tag names.	2019-03-19 17:19:14 +08:00
Régis Hanol	4481836de2	FEATURE: new 'search_ignore_accents' site setting	2018-09-17 10:42:30 +02:00
Régis Hanol	30619c244c	FIX: don't index urls to local files	2018-09-13 18:53:53 +02:00
Sam	9b7cab589a	FIX: revert diacritic stripping See more details in test case and at: https://meta.discourse.org/t/discourse-should-ignore-if-a-character-is-accented-when-doing-a-search/90198/16?u=sam	2018-08-31 11:46:55 +10:00
Régis Hanol	bc7b530b0a	FIX: remove diacritics instead of transliterating	2018-08-24 00:38:44 +02:00
Régis Hanol	2fcf2b899e	FIX: remove diacritics when tokenizing html for search	2018-08-23 17:13:52 +02:00

1 2

55 Commits