OpenSearch

Commit Graph

Author	SHA1	Message	Date
James Rodewig	4bc6d2dbec	[DOCS] Correct link for Lucene StopFilter	2020-03-05 14:52:25 -05:00
James Rodewig	0c4bf64095	[DOCS] Fix several Asciidoctor double arrow replacements (#52827 ) Per the [Asciidoctor docs][0], Asciidoctor replaces the following syntax with double arrows in the rendered HTML: * => renders as ⇒ * <= renders as ⇐ This escapes several unintended replacements, such as in the Painless docs. Where appropriate, it also replaces some double arrow instances with single arrows for consistency. [0]: https://asciidoctor.org/docs/user-manual/#replacements	2020-03-04 08:43:19 -05:00
James Rodewig	cf87724ff6	[DOCS] Reformat `stop` token filter (#53059 ) Makes the following changes to the `stop` token filter docs: * Updates description * Adds a link to the related Lucene filter * Adds detailed analyze snippet * Updates custom analyzer and custom filter snippets * Adds a list of predefined stop words by language Co-authored-by: ScottieL <36999642+ScottieL@users.noreply.github.com>	2020-03-03 13:22:52 -05:00
James Rodewig	d336faa0b0	[DOCS] Reformat trim token filter docs (#51649 ) Makes the following changes to the `trim` token filter docs: * Updates description * Adds a link to the related Lucene filter * Adds tip about removing whitespace using tokenizers * Adds detailed analyze snippets * Adds custom analyzer snippet	2020-03-02 07:48:23 -05:00
rhymes	7eb4c07f1f	[DOCS] Fix typo in index and search analysis docs (#52988 )	2020-03-02 07:25:01 -05:00
debadair	291713f284	[DOCS] Fixed typo in jump link. (#52302 )	2020-02-12 17:53:00 -08:00
James Rodewig	36b2663e98	[DOCS] Add attribute for Lucene analysis links (#51687 ) Adds a `lucene-analysis-docs` attribute for the Lucene `/analysis/` javadocs directory. This should prevent typos and keep the docs DRY.	2020-01-30 11:24:01 -05:00
James Rodewig	4fcf5a9de4	[DOCS] Rewrite analysis intro (#51184 ) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip	2020-01-30 09:32:16 -05:00
James Rodewig	70e4ae3381	[DOCS] Reformat unique token filter docs (#50748 ) * Updates the description * Adds analyze, custom analyzer, and custom filter snippets * Adds parameter documentation	2020-01-28 10:42:25 -05:00
James Rodewig	23b65390ab	[DOCS] Add response snippets to 'Testing analyzers' page (#51427 ) Adds response snippets to the `POST _analyze` snippets in the 'Testing analyzers' page. Co-authored-by: Emmanuel DEMEY <demey.emmanuel@gmail.com>	2020-01-27 08:41:44 -05:00
James Rodewig	7ef906fde8	[DOCS] Add tutorials section to analysis topic (#50809 ) Adds a 'Configure text analysis' page to house tutorial content for the analysis topic. Also relocates the following pages as children as this new page: * 'Test an analyzer' * 'Configuring built-in analyzers' * 'Create a custom analyzer' I plan to add a tutorial for specifying index-time and search-time analyzers to this section as part of a future PR.	2020-01-16 13:12:06 -05:00
James Rodewig	ef26763ca9	[DOCS] Add concepts section to analysis topic (#50801 ) This helps the topic better match the structure of our machine learning docs, e.g. https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts' child page, but I plan to add other concepts, such as 'Index time vs. search time', with later PRs.	2020-01-16 13:00:39 -05:00
James Rodewig	1edaf2b101	[DOCS] Retitle analysis reference pages (#51071 ) * Changes titles to sentence case. * Appends pages with 'reference' to differentiate their content from conceptual overviews. * Moves the 'Normalizers' page to end of the Analysis topic pages.	2020-01-16 12:30:51 -05:00
PND	1d391f7113	[Docs] Fix example output of edge n-gram token filter. (#51085 )	2020-01-16 11:34:00 +01:00
James Rodewig	78c9eee5ea	[DOCS] Add section ID to analysis overview page	2020-01-08 14:43:41 -06:00
James Rodewig	9d1567b13b	[DOCS] Add overview page to analysis topic (#50515 ) Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results	2020-01-08 12:54:00 -06:00
James Rodewig	20eba1e410	[DOCS] Reformat reverse token filter docs (#50672 ) * Updates the description and adds a Lucene link * Adds analyze and custom analyzer snippets	2020-01-07 11:01:55 -06:00
James Rodewig	8009b07ccb	[DOCS] Reformat truncate token filter docs (#50687 ) * Updates the description and adds a Lucene link * Adds analyze, custom analyzer, and custom filter snippets * Adds parameter documentation	2020-01-07 10:33:57 -06:00
James Rodewig	e6a469cc74	[DOCS] Reformat uppercase token filter docs (#50555 ) * Updates the description and adds a Lucene link * Adds analyze and custom analyzer snippets	2020-01-03 08:39:08 -05:00
James Rodewig	7a14607a25	[DOCS] Abbreviate token filter titles (#50511 )	2019-12-27 11:01:52 -05:00
Nik Everett	01293ebad5	Fix docs typos (#50365 ) (#50464 ) Fixes a few typos in the docs. Co-authored-by: Xiang Dai <764524258@qq.com>	2019-12-23 12:38:17 -05:00
James Rodewig	cd04021961	[DOCS] Reformat token count limit filter docs (#49835 )	2019-12-13 08:44:39 -05:00
James Rodewig	1186a5dc09	[DOCS] Reformat lowercase token filter docs (#49935 )	2019-12-12 09:50:12 -05:00
James Rodewig	87a73b6bdf	[DOCS] Reformat length token filter docs (#49805 ) * Adds a title abbreviation * Updates the description and adds a Lucene link * Reformats the parameters section * Adds analyze, custom analyzer, and custom filter snippets Relates to #44726.	2019-12-04 09:59:08 -05:00
James Rodewig	ade72b97b7	[DOCS] Reformat keep types and keep words token filter docs (#49604 ) * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds explanations of token types to keep types token filter and tokenizer docs	2019-12-02 09:40:50 -05:00
James Rodewig	2fd58bb845	[DOCS] Add missing "_type" to delimited payload token filter docs	2019-11-25 16:16:05 -05:00
James Rodewig	c40449ac22	[DOCS] Reformat delimited payload token filter docs (#49380 ) * Adds a title abbreviation * Relocates the older name deprecation warning * Updates the description and adds a Lucene link * Adds a note to explain payloads and how to store them * Adds analyze and custom analyzer snippets * Adds a 'Return stored payloads' example	2019-11-25 15:40:05 -05:00
James Rodewig	d06c71eb82	[DOCS] Fix edge n-gram tokenizer nav Adds a missing float tag to the edge n-gram tokenizer docs. This tag ensures the edge n-gram tokenizer docs display on the same page.	2019-11-22 15:54:07 -05:00
James Rodewig	562607d3f5	[DOCS] Reformat n-gram token filter docs (#49438 ) Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting	2019-11-22 10:38:50 -05:00
Christoph Büscher	4ffa050735	Allow custom characters in token_chars of ngram tokenizers (#49250 ) Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894	2019-11-20 10:37:12 +01:00
James Rodewig	a26916cc23	[DOCS] Reformat elision token filter docs (#49262 )	2019-11-19 10:55:22 -05:00
James Rodewig	8639ddab5e	[DOCS] Reformat fingerprint token filter docs (#49311 )	2019-11-19 10:55:21 -05:00
gpaimla	7d20b50f45	Implement Lucene EstonianAnalyzer, Stemmer (#49149 ) This PR adds a new analyzer and stemmer for the Estonian language. Closes #48895	2019-11-18 17:24:21 +01:00
James Rodewig	095c34359f	[DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers (#49007 ) The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.	2019-11-13 14:28:12 -05:00
James Rodewig	838af15d29	[DOCS] Reformat compound word token filters (#49006 ) * Separates the compound token filters doc pages into separate token filter pages: * Dictionary decompounder token filter * Hyphenation decompounder token filter * Adds analyze API examples for each compound token filter * Adds a redirect for the removed compound token filters page Co-Authored-By: debadair <debadair@elastic.co>	2019-11-13 09:36:52 -05:00
James Rodewig	dd92830801	[DOCS] Reformat condition token filter (#48775 )	2019-11-11 08:49:44 -05:00
Julian Simioni	5e4501eb3f	[Docs] Consolidate single example into a single line (#48904 ) The first example of splitting rules for the `word_delimiter` token filter was spread across two bullet points. This makes it look like they are two separate splitting rules.	2019-11-08 15:12:45 -05:00
James Rodewig	700a316bb3	[DOCS] Reformat decimal digit token filter docs (#48722 )	2019-11-01 12:38:14 -04:00
Peter Johnson	3f7aafa421	[DOCS] Fix typo in synonym token filter docs (#48691 )	2019-10-31 09:12:24 -04:00
James Rodewig	3d5b1725a9	[DOCS] Remove unneeded filter from common grams analyze ex (#48748 )	2019-10-31 09:08:14 -04:00
James Rodewig	77acbc4fa9	[DOCS] Reformat common grams token filter (#48426 )	2019-10-30 08:40:56 -04:00
James Rodewig	06dc1fbd96	[DOCS] Reformat ASCII folding token filter docs (#48143 )	2019-10-23 15:06:55 -05:00
James Rodewig	9c75f14a9f	[DOCS] Reformat classic token filter docs (#48314 )	2019-10-23 10:14:25 -05:00
James Rodewig	a66bb2c7ed	[DOCS] Reformat CJK bigram and CJK width token filter docs (#48210 )	2019-10-21 08:44:49 -05:00
James Rodewig	8677653c5b	[DOCS] Reformat apostrophe token filter docs (#48076 )	2019-10-16 08:51:14 -04:00
Wilder Pereira	8c73e215b2	[DOCS] Remove unneeded spaces from custom analyzer snippet (#47332 )	2019-10-15 15:53:16 -04:00
James Rodewig	601a88bede	[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068 )	2019-10-15 15:47:25 -04:00
James Rodewig	af7aba18d4	Fixed sample code for minhash (#46385 ) The sample code is wrong. Field type is required for the sample field. I guess the intention was to give the sample field the name ```fingerprint```, mapping it as ```text``` using the custom analyzer ```my_analyzer```	2019-09-12 13:29:44 -04:00
Abhilash Bolla	20e93bca6b	Fixed grammar in pattern replace char filter docs. (#46546 ) Minor grammar fix in the pattern replace char filter docs.	2019-09-10 11:04:07 -07:00
James Rodewig	b59ecde041	[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353 ) (#46502 )	2019-09-09 13:38:14 -04:00
James Rodewig	f04573f8e8	[DOCS] [5 of 5] Change // TESTRESPONSE comments to [source,console-results] (#46449 ) (#46459 )	2019-09-06 16:09:09 -04:00
James Rodewig	bb7bff5e30	[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295 ) (#46418 )	2019-09-06 09:22:08 -04:00
James Rodewig	3e62cf9d74	[DOCS] Correct custom analyzer callouts (#46030 )	2019-08-29 10:08:18 -04:00
James Rodewig	d46545f729	[DOCS] Update anchors and links for Elasticsearch API relocation (#44500 )	2019-07-19 09:18:23 -04:00
Christoph Büscher	2cc7f5a744	Allow reloading of search time analyzers (#43313 ) Currently changing resources (like dictionaries, synonym files etc...) of search time analyzers is only possible by closing an index, changing the underlying resource (e.g. synonym files) and then re-opening the index for the change to take effect. This PR adds a new API endpoint that allows triggering reloading of certain analysis resources (currently token filters) that will then pick up changes in underlying file resources. To achieve this we introduce a new type of custom analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows swapping out analysis components. Custom analyzers that contain filters that are markes as "updateable" will automatically choose this implementation. This PR also adds this capability to `synonym` token filters for use in search time analyzers. Relates to #29051	2019-06-28 09:55:40 +02:00
Alan Woodward	05a7333eca	Require [articles] setting in elision filter (#43083 ) We should throw an exception at construction time if a list of articles is not provided, otherwise we can get random NPEs during indexing. Relates to #43002	2019-06-27 09:02:36 +01:00
Sachin Frayne	44aedcf97a	Correct the description of generate_word_parts (#43026 )	2019-06-10 11:36:31 +01:00
James Rodewig	5342616a23	[DOCS] Add explicit `articles_case` parameter to Elision Token Filter example (#42987 )	2019-06-07 11:24:43 -04:00
Mayya Sharipova	5a76f46ac6	Fix error with mapping in docs Related to #39630	2019-05-30 10:28:09 -04:00
Peter Dyson	b84b5525e1	[DOCS] path_hierarchy tokenizer examples (#39630 ) Closes #17138	2019-05-30 09:17:55 -04:00
Alan Woodward	3a35427b6d	Improvements to docs around multiplexer and synonyms (#41645 ) This commit fixes a multiplexer doc error concerning synonyms, and adds suggestions on how to combine the two filters.	2019-05-07 09:10:14 +01:00
James Rodewig	d46f55f013	[DOCS] Add attribute to escape minimal pt token link in Asciidoctor (#41613 )	2019-04-30 14:11:48 -04:00
James Rodewig	53702efddd	[DOCS] Add anchors for Asciidoctor migration (#41648 )	2019-04-30 10:20:17 -04:00
Guilherme Ferreira	48a17d5768	[Docs] Correct default stop list constant (#41342 )	2019-04-23 19:13:51 +02:00
Guilherme Ferreira	23e40c040a	[Docs] Correct spelling of "_none_" (#41192 )	2019-04-15 15:12:28 +02:00
Guilherme Ferreira	414debd740	[Docs] Correct spelling the "_none_" stopwords element (#41191 )	2019-04-15 14:12:26 +02:00
Christoph Büscher	dfc70e6ef0	Correct indention in synonym docs (#40711 ) The stopword filter should be on the same level as the synonym filter in the example request. Correcting this for better readability.	2019-04-02 01:44:24 +02:00
Mayya Sharipova	671a209ed9	Correct errors in min_hash filter documentation Related to #39671	2019-03-08 16:21:24 -05:00
Mayya Sharipova	54d41afac1	Add documentation for min_hash filter (#39671 ) Closes #20757	2019-03-07 08:49:48 -05:00
jimczi	ecb6df137c	fix typo in synonym graph filter docs	2019-03-05 18:20:14 +01:00
Christoph Büscher	4b77d0434a	Remove `nGram` and `edgeNGram` token filter names (#39070 ) In #30209 we deprecated the camel case `nGram` filter name in favour of `ngram` and did the same for `edgeNGram` and `edge_ngram` and we are removing those names in 8.0. This change disallows using the deprecated names for new indices created in 7.0 by throwing an error if these filters are used. Relates to #38911	2019-02-21 16:55:40 +01:00
Jim Ferenczi	83402b1320	Remove beta marker from the synonym_graph docs (#38185 )	2019-02-19 10:49:49 +01:00
Mayya Sharipova	0e1b1959fe	Correct rebuilt persian analyzer (#38724 ) (#38744 ) Make substitution of \u200C with a space explicit The problem with this symbol `\u200C` in a test string, that SHOULD be substituted with space in the rebuilt Persian analyzer, but it is not. Correcting this line `"mappings": [ "\\u200C=> "] <1>` to `"mappings": [ "\\u200C=>\\u0020"] <1>` in solves the problem. This change explicitly says to substitute ZWNJ with a space. Closes #38188	2019-02-11 14:17:18 -05:00
Christoph Büscher	34f2d2ec91	Remove remaining occurances of "include_type_name=true" in docs (#37646 )	2019-01-22 15:13:52 +01:00
Christoph Büscher	3a96608b3f	Remove more include_type_name and types from docs (#37601 )	2019-01-18 14:11:18 +01:00
Christoph Büscher	25aac4f77f	Remove `include_type_name` in asciidoc where possible (#37568 ) The "include_type_name" parameter was temporarily introduced in #37285 to facilitate moving the default parameter setting to "false" in many places in the documentation code snippets. Most of the places can simply be reverted without causing errors. In this change I looked for asciidoc files that contained the "include_type_name=true" addition when creating new indices but didn't look likey they made use of the "_doc" type for mappings. This is mostly the case e.g. in the analysis docs where index creating often only contains settings. I manually corrected the use of types in some places where the docs still used an explicit type name and not the dummy "_doc" type.	2019-01-18 09:34:11 +01:00
Julie Tibshirani	36a3b84fc9	Update the default for include_type_name to false. (#37285 ) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.	2019-01-14 13:08:01 -08:00
Josh Soref	edb48321ba	[DOCS] Various spelling corrections (#37046 )	2019-01-07 14:44:12 +01:00
Christoph Büscher	132ccbec2f	[Docs] Extend common-grams-tokenfilter doctest example (#36807 ) Adding an example output using the "_analyze" API and expected response.	2018-12-19 09:49:23 +01:00
Christoph Büscher	41feaf137c	[Docs] Fix error in Common Grams Token Filter (#36774 ) The first example given is missing the two single-token cases for "is" and "a". The later usage example is slightly wrong in that custom analyzers should go under `settings.analysis.analyzer`.	2018-12-18 16:54:06 +01:00
Alan Woodward	af57575838	Allow word_delimiter_graph_filter to not adjust internal offsets (#36699 ) This commit adds an adjust_offsets parameter to the word_delimiter_graph token filter, defaulting to true. Most of the time you'd want sub-tokens emitted by this filter to have offsets that are adjusted to their real position in the token stream; however, some token filters can change the length or starting position of a token (eg trim) without changing their offset attributes, and this can lead to word_delimiter_graph emitting illegal offsets. Setting adjust_offsets to false in these cases will allow indexing again. Fixes #34741, #33710	2018-12-18 13:20:51 +00:00
Jim Ferenczi	18866c4c0b	Make hits.total an object in the search response (#35849 ) This commit changes the format of the `hits.total` in the search response to be an object with a `value` and a `relation`. The `value` indicates the number of hits that match the query and the `relation` indicates whether the number is accurate (in which case the relation is equals to `eq`) or a lower bound of the total (in which case it is equals to `gte`). This change also adds a parameter called `rest_total_hits_as_int` that can be used in the search APIs to opt out from this change (retrieve the total hits as a number in the rest response). Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain `hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a follow up (to allow numbers to be passed to `track_total_hits`). Relates #33028	2018-12-05 19:49:06 +01:00
Alan Woodward	a646f85a99	Ensure TokenFilters only produce single tokens when parsing synonyms (#34331 ) A number of tokenfilters can produce multiple tokens at the same position. This is a problem when using token chains to parse synonym files, as the SynonymMap requires that there are no stacked tokens in its input. This commit ensures that when used to parse synonyms, these tokenfilters either produce a single version of their input token, or that they throw an error when mappings are generated. In indexes created in elasticsearch 6.x deprecation warnings are emitted in place of the error. * asciifolding and cjk_bigram produce only the folded or bigrammed token * decompounders, synonyms and keyword_repeat are skipped * n-grams, word-delimiter-filter, multiplexer, fingerprint and phonetic throw errors Fixes #34298	2018-11-29 10:35:38 +00:00
Alan Woodward	26cc8ff8c3	Add pointer to the index-phrases option in shingle filter docs (#35771 ) We should be discouraging the use of shingle filters and instead pointing users to the index-phrases parameter on text fields.	2018-11-21 15:27:11 +00:00
Alan Woodward	f6a43b5939	Add a prebuilt ICU Analyzer (#34958 ) The ICU plugin provides the building blocks of an analysis chain, but doesn't actually have a prebuilt analyzer. It would be a better for users if there was a simple analyzer that they could use out of the box, and also something we can point to from the CJK Analyzer docs as a superior alternative. Relates to #34285	2018-11-21 09:00:48 +00:00
Julie Tibshirani	f854330e06	Make sure to use the type _doc in the REST documentation. (#34662 ) * Replace custom type names with _doc in REST examples. * Avoid using two mapping types in the percolator docs. * Rename doc -> _doc in the main repository README. * Also replace some custom type names in the HLRC docs.	2018-10-22 11:54:04 -07:00
Christoph Büscher	e869f9a78c	[Docs] Update synonym-tokenfilter.asciidoc (#34706 ) Remove ugly double-dot.	2018-10-22 17:18:29 +02:00
Jim Ferenczi	a9daa5cb90	[DOCS] Remove beta label from normalizers (#34326 )	2018-10-05 15:42:00 +02:00
Nikolay Vasiliev	16956a1a05	[DOCS] Clarify 'type' parameter meaning for custom analyzer (#34012 ) This pull request improves the docs on the meaning of type parameter on the custom analyzer doc page. Closes #33456	2018-09-25 15:32:27 +02:00
Alan Woodward	5107949402	Allow TokenFilterFactories to rewrite themselves against their preceding chain (#33702 ) We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they appear before them in a chain, because they produce multiple tokens at the same position. This commit adds two methods to the TokenFilterFactory interface. * `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`. * `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym list `Analyzer`. By default it returns `true`. Fixes #33609	2018-09-19 15:52:14 +01:00
Alan Woodward	f598297f55	Add predicate_token_filter (#33431 ) This allows users to filter out tokens from a TokenStream using painless scripts, instead of having to write specialised Java code and packaging it up into a plugin. The commit also refactors the AnalysisPredicateScript.Token class so that it wraps and makes read-only an AttributeSource.	2018-09-11 09:16:39 +01:00
Jim Ferenczi	7ad71f906a	Upgrade to a Lucene 8 snapshot (#33310 ) The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly. Some comments about the change: * Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309 Closes #32899	2018-09-06 14:42:06 +02:00
Alan Woodward	636442700c	Add conditional token filter to elasticsearch (#31958 ) This allows tokenfilters to be applied selectively, depending on the status of the current token in the tokenstream. The filter takes a scripted predicate, and only applies its subfilter when the predicate returns true.	2018-09-05 14:52:43 +01:00
Matthias Sieber	a39f6f09f4	fixed elements in array of produced terms (#32519 )	2018-08-02 11:12:15 -04:00
Christoph Büscher	61486680a2	Add exclusion option to `keep_types` token filter (#32012 ) Currently the `keep_types` token filter includes all token types specified using its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where instead of keeping the specified tokens (include) they are filtered out (exclude). This change exposes this option as a new `mode` parameter that can either take the values `include` (the default, if not specified) or `exclude`. Closes #29277	2018-07-17 09:04:41 +02:00
Sohaib Iftikhar	88c270d844	Added lenient flag for synonym token filter (#31484 ) * Added lenient flag for synonym-tokenfilter. Relates to #30968 * added docs for synonym-graph-tokenfilter -- Also made lenient final -- changed from !lenient to lenient == false * Changes after review (1) -- Renamed to ElasticsearchSynonymParser -- Added explanation for ElasticsearchSynonymParser::add method -- Changed ElasticsearchSynonymParser::logger instance to static * Added lenient option for WordnetSynonymParser -- also added more documentation * Added additional documentation * Improved documentation	2018-07-10 17:11:50 -04:00
Alan Woodward	5683bc60a6	Multiplexing token filter (#31208 ) The `multiplexer` filter emits multiple tokens at the same position, each version of the token haivng been passed through a different filter chain. Identical tokens at the same position are removed. This allows users to, for example, index lowercase and original-case tokens, or stemmed and unstemmed versions, in the same field, so that they can search for a stemmed term within x positions of an unstemmed term.	2018-06-20 10:16:26 +01:00
Alan Woodward	8c0ec05a12	Expose lucene's RemoveDuplicatesTokenFilter (#31275 )	2018-06-18 09:46:12 +01:00
Itamar Syn-Hershko	5f172b6795	[Feature] Adding a char_group tokenizer (#24186 ) === Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```	2018-05-22 16:26:31 +02:00
Jim Ferenczi	bdb79d021a	Fix docs failure on language analyzers (#30722 ) This commit fixes docs failure on language analyzers when compared to the built in analyzers. The `elision` filters used by the rebuilt language analyzers should be case insensitive to match the definition of the prebuilt analyzers. Closes #30557	2018-05-22 09:58:12 +02:00

1 2 3 4 5 ...

324 Commits