OpenSearch

Commit Graph

Author	SHA1	Message	Date
Martijn van Groningen	544822c78b	Moved keyword tokenizer to analysis-common module (#30642 ) Relates to #23658	2018-05-29 19:22:28 +02:00
Itamar Syn-Hershko	5f172b6795	[Feature] Adding a char_group tokenizer (#24186 ) === Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```	2018-05-22 16:26:31 +02:00
Christoph Büscher	b6340658f4	Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209 ) The camel case name `nGram` should be removed in favour of `ngram` and similar for `edgeNGram` and `edge_ngram`. Before removal, we need to deprecate the camel case names first. This change adds deprecation warnings for indices with versions 6.4.0 and higher and logs deprecation warnings.	2018-05-17 12:52:22 +02:00
Martijn van Groningen	7b95470897	Moved tokenizers to analysis common module (#30538 ) The following tokenizers were moved: classic, edge_ngram, letter, lowercase, ngram, path_hierarchy, pattern, thai, uax_url_email and whitespace. Left keyword tokenizer factory in server module, because normalizers directly depend on it.This should be addressed on a follow up change. Relates to #23658	2018-05-14 07:55:01 +02:00
Jim Ferenczi	dbd857341f	Upgrade to 7.4.0-snapshot-1ed95c097b (#30357 ) Upgrade to lucene-7.4.0-snapshot-1ed95c097b This version contains: * An Analyzer for Korean * An IntervalQuery and IntervalsSource that retrieve minimum intervals of positional queries. * A new API to retrieve matches (offsets and positions) of a query for a single document. * Support for soft deletes in the index writer. * A fixed shingle filter that handles index time synonyms. * Support for emoji sequence in ICUTokenizer (with an upgrade to icu 61.1)	2018-05-04 11:44:22 +02:00
Adrien Grand	231a63fdf8	Remove useless version checks in REST tests. (#30165 ) Many tests are added with a version check so that they do not run against a version that doesn't have the feature yet. Master is 7.0, so all tests that do not run against 6.0+ can be removed and the version check can be removed on all tests that always run on 6.0+.	2018-05-02 11:34:15 +02:00
Christoph Büscher	24763d881e	Deprecate use of `htmlStrip` as name for HtmlStripCharFilter (#27429 ) The camel case name `htmlStip` should be removed in favour of `html_strip`, but we need to deprecate it first. This change adds deprecation warnings for indices with version starting with 6.3.0 and logs deprecation warnings in this cases.	2018-04-19 16:48:17 +02:00
Christoph Büscher	231fd4eb18	Remove `delimited_payload_filter` (#27705 ) From 7.0 on, using `delimited_payload_filter` should throw an error. It was deprecated in 6.2 in favour of `delimited_payload` (#26625). Relates to #27704	2018-04-05 18:41:04 +02:00
Adrien Grand	3bdfc8f3fb	Upgrade to lucene-7.3.0-snapshot-98a6b3d. (#29298 ) Most notable changes include: - this release doesn't have the 7.2.1 version constant so I had to create one - spatial4j and jts were upgraded	2018-04-03 09:27:14 +02:00
Alan Woodward	af3f63616b	Allow TrimFilter to be used in custom normalizers (#27758 ) AnalysisFactoryTestCase checks that the ES custom token filter multi-term awareness matches the underlying lucene factory. For the trim filter this won't be the case until LUCENE-8093 is released in 7.3, so we add a temporary exclusion Closes #27310	2017-12-18 14:27:03 +00:00
Christoph Büscher	c4fe7d3f72	[Docs] add deprecation warning for `delimited_payload_filter` renaming	2017-12-04 10:22:05 +01:00
kel	4885acb048	Replace `delimited_payload_filter` by `delimited_payload` (#26625 ) The `delimited_payload_filter` is renamed to `delimited_payload`, the old name is deprecated and should be replaced by `delimited_payload`. Closes #21978	2017-11-24 13:03:19 +01:00
Mayya Sharipova	148376c2c5	Add limits for ngram and shingle settings (#27211 ) * Add limits for ngram and shingle settings (#27211) Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes #25887	2017-11-07 08:14:55 -05:00
David Roberts	749c3ec716	Remove the single argument Environment constructor (#27235 ) Only tests should use the single argument Environment constructor. To enforce this the single arg Environment constructor has been replaced with a test framework factory method. Production code (beyond initial Bootstrap) should always use the same Environment object that Node.getEnvironment() returns. This Environment is also available via dependency injection.	2017-11-04 13:25:09 +00:00
Simon Willnauer	cdd7c1e6c2	Return List instead of an array from settings (#26903 ) Today we return a `String[]` that requires copying values for every access. Yet, we already store the setting as a list so we can also directly return the unmodifiable list directly. This makes list / array access in settings a much cheaper operation especially if lists are large.	2017-10-09 09:52:08 +02:00
Martijn van Groningen	b27e408ed2	Removed void token filter entries and added two tests	2017-10-05 13:25:05 +02:00
Md. Abdulla-Al-Sun	a40c474e10	Added Bengali Analyzer to Elasticsearch with respect to the lucene update(PR#238)	2017-10-05 13:25:05 +02:00
Simon Willnauer	00dfdf50cf	Represent lists as actual lists inside Settings (#26878 ) Today we represent each value of a list setting with it's own dedicated key that ends with the index of the value in the list. Aside of the obvious weirdness this has several issues especially if lists are massive since it causes massive runtime penalties when validating settings. Like a list of 100k words will literally cause a create index call to timeout and in-turn massive slowdown on all subsequent validations runs. With this change we use a simple string list to represent the list. This change also forbids to add a settings that ends with a .0 which was internally used to detect a list setting. Once this has been rolled out for an entire major version all the internal .0 handling can be removed since all settings will be converted. Relates to #26723	2017-10-05 09:27:08 +02:00
Simon Willnauer	aab4655e63	Unify Settings xcontent reading and writing (#26739 ) This change adds a fromXContent method to Settings that allows to read the xcontent that is produced by toXContent. It also replaces the entire settings loader infrastructure and removes the structured map representation. Future PRs will also tackle the `getAsMap` that exposes the internal represenation of settings for better encapsulation.	2017-09-25 13:23:01 +02:00
Michael Basnight	f385e0cf26	Add bad_request to the rest-api-spec catch params (#26539 ) This adds another request to the catch params. It also makes sure that the generic request param does not allow 400 either.	2017-09-14 14:24:03 -05:00
Jim Ferenczi	86d97971a4	Remove the _all metadata field (#26356 ) * Remove the _all metadata field This change removes the `_all` metadata field. This field is deprecated in 6 and cannot be activated for indices created in 6 so it can be safely removed in the next major version (e.g. 7).	2017-08-28 17:43:59 +02:00
Adrien Grand	eb782492be	Remove support for lenient booleans. Closes #22298	2017-08-28 09:56:01 +02:00
Martijn van Groningen	1146a35870	Move more token filters to analysis-common module The following token filters were moved: arabic_stem, brazilian_stem, czech_stem, dutch_stem, french_stem, german_stem and russian_stem. Relates to #23658	2017-08-11 17:39:24 +02:00
Martijn van Groningen	ff3b909a83	Moved HtmlStripCharFilterFactory to analyis.common package like the other factories.	2017-07-31 15:34:54 +02:00
Martijn van Groningen	0b776a1de0	Move more token filters to analysis-common module The following token filters were moved: delimited_payload_filter, keep, keep_types, classic, apostrophe, decimal_digit, fingerprint, min_hash and scandinavian_folding. Relates to #23658	2017-07-31 15:15:04 +02:00
Jim Ferenczi	ab3b5c695a	Pre-configured shingle filter should disable graph analysis (#25853 ) This change disables the graph analysis on default `shingle` filter. The pre-configured shingle filter produces shingles of different size. Graph analysis on such token stream is useless and dangerous as it may create too many paths. Fixes #25555	2017-07-24 18:42:15 +02:00
Martijn van Groningen	8003171a0c	Move more token filters to analysis-common module The following token filters were moved: arabic_normalization, german_normalization, hindi_normalization, indic_normalization, persian_normalization, scandinavian_normalization, serbian_normalization, sorani_normalization, cjk_width and cjk_width Relates to #23658	2017-07-17 08:29:44 +02:00
Jim Ferenczi	13da3eb53e	Refactor QueryStringQuery for 6.0 (#25646 ) This change refactors the query_string query to analyze the query text around logical operators of the query string the same way than a match_query/multi_match_query. It also adds a type parameter that can be used to change the way multi fields query are built the same way than a multi_match query does. Now that these queries share the same behavior regarding text analysis, some parameters are obsolete and have been deprecated: split_on_whitespace: This setting is now ignored with a deprecation notice if it is used explicitely. With this PR The query_string always splits on logical operator. It simplifies the understanding of the other parameters that can have different meanings depending on the value of split_on_whitespace. auto_generate_phrase_queries: This setting is now ignored with a deprecation notice if it is used explicitely. This setting only makes sense when the parser splits on whitespace. use_dismax: This setting is now ignored with a deprecation notice if it is used explicitely. The tie_breaker parameter is sufficient to handle best_fields/most_fields. Fixes #25574	2017-07-13 15:32:17 +02:00
Martijn van Groningen	6db708ef75	Move more token filters to analysis-common module The following token filters were moved: common grams, limit token, pattern capture and pattern raplace. Relates to #23658	2017-07-07 10:02:52 +02:00
Jun Ohtani	6894ef6057	[Analysis] Support normalizer in request param (#24767 ) * [Analysis] Support normalizer in request param Support normalizer param Support custom normalizer with char_filter/filter param Closes #23347	2017-07-04 19:16:56 +09:00
Martijn van Groningen	a34f5fa812	Move more token filters to analysis-common module The following token filters were moved: stemmer, stemmer_override, kstem, dictionary_decompounder, hyphenation_decompounder, reverse, elision and truncate. Relates to #23658	2017-06-26 09:02:16 +02:00
Jun Ohtani	62d1969595	Parse synonyms with the same analysis chain (#8049 ) * [Analysis] Parse synonyms with the same analysis chain Synonym Token Filter / Synonym Graph Filter tokenize synonyms with whatever tokenizer and token filters appear before it in the chain. Close #7199	2017-06-20 21:50:33 +09:00
Andy Bristol	4c5bd57619	Rename simple pattern tokenizers (#25300 ) Changed names to be snake case for consistency Related to #25159, original issue #23363	2017-06-19 13:48:43 -07:00
Nik Everett	ecc87f613f	Move pre-configured "keyword" tokenizer to the analysis-common module (#24863 ) Moves the keyword tokenizer to the analysis-common module. The keyword tokenizer is special because it is used by CustomNormalizerProvider so I pulled it out into its own PR. To get the move to work I've reworked the lookup from static to one using the AnalysisRegistry. This seems safe enough. Part of #23658.	2017-06-16 11:48:15 -04:00
Martijn van Groningen	428e70758a	Moved more token filters to analysis-common module. The following token filters were moved: `edge_ngram`, `ngram`, `uppercase`, `lowercase`, `length`, `flatten_graph` and `unique`. Relates to #23658	2017-06-15 18:28:31 +02:00
Andy Bristol	48696ab544	expose simple pattern tokenizers (#25159 ) Expose the experimental simplepattern and simplepatternsplit tokenizers in the common analysis plugin. They provide tokenization based on regular expressions, using Lucene's deterministic regex implementation that is usually faster than Java's and has protections against creating too-deep stacks during matching. Both have a not-very-useful default pattern of the empty string because all tokenizer factories must be able to be instantiated at index creation time. They should always be configured by the user in practice.	2017-06-13 12:46:59 -07:00
Jim Ferenczi	8250aa4267	Remove the postings highlighter and make unified the default highlighter choice (#25028 ) This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves exactly like the `unified` highlighter when index_options is set to `offsets`: https://issues.apache.org/jira/browse/LUCENE-7815 It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided). The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text. Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such. There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available. I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.	2017-06-09 14:09:57 +02:00
Nik Everett	73307a2144	Plugins can register pre-configured char filters (#25000 ) Fixes the plumbing so plugins can register char filters and moves the `html_strip` char filter into analysis-common. Relates to #23658	2017-06-05 09:25:15 -04:00
Martijn van Groningen	258be2b135	Moved `keyword_marker`, `trim`, `snowball` and `porter_stemmer` tokenfilter factories from core to common-analysis module. Relates to #23658	2017-05-31 09:34:08 +02:00
Nik Everett	b9ea579633	Allow plugins to register pre-configured tokenizers (#24751 ) Allows plugins to register pre-configured tokenizers. Much of the decisions are the same as those in #24223, #24572, and #24223. This only migrates the lowercase tokenizer but I figure that is a good start because it proves out the features.	2017-05-19 12:07:04 -04:00
Nik Everett	0189a65e6b	Fail rest tests on yaml files (#24740 ) We've switched to supporting only `yml` files but anyone who didn't notice will commit a `yaml` file which won't be executed which is bad because it is easy not to notice. The test to catch this is simple enough that I think it is worth adding just to warn folks about their mistake.	2017-05-17 10:24:57 -04:00
Nik Everett	5fc6f17121	Fix new analysis-common yml tests These tests are broken because I added them with the `yml` extension and didn't realize that we weren't running tests with that extension until we merged #24659. I used that extension in anticipation of #24659 but didn't verify that the tests were actually running. Ooops! Closes #24734	2017-05-17 07:55:04 -04:00
Ryan Ernst	2a65bed243	Tests: Change rest test extension from .yaml to .yml (#24659 ) This commit renames all rest test files to use the .yml extension instead of .yaml. This way the extension used within all of elasticsearch for yaml is consistent.	2017-05-16 17:24:35 -07:00
Nik Everett	7ef390068a	Move remaining pre-configured token filters into analysis-common (#24716 ) Moves the remaining preconfigured token figured into the analysis-common module. There were a couple of tests in core that depended on the pre-configured token filters so I had to touch them: * `GetTermVectorsCheckDocFreqIT` depended on `type_as_payload` but didn't do anything important with it. I dropped the dependency. Then I moved the test to a single node test case because we're trying to cut down on the number of `ESIntegTestCase` subclasses. * `AbstractTermVectorsTestCase` and its subclasses depended on `type_as_payload`. I dropped their usage of the token filter and added an integration test for the termvectors API that uses `type_as_payload` to the `analysis-common` module. * `AnalysisModuleTests` expected a few pre-configured token filtes be registered by default. They aren't any more so I dropped this assertion. We assert that the `CommonAnalysisPlugin` registers these pre-built token filters in `CommonAnalysisFactoryTests` * `SearchQueryIT` and `SuggestSearchIT` had tests that depended on the specific behavior of the token filters so I moved the tests to integration tests in `analysis-common`.	2017-05-16 13:10:24 -04:00
Nik Everett	65f2717ab7	Make PreConfiguredTokenFilter harder to misuse (#24572 ) There are now three public static method to build instances of PreConfiguredTokenFilter and the ctor is private. I chose static methods instead of constructors because those allow us to change out the implementation returned if we so desire. Relates to #23658	2017-05-10 22:39:43 -04:00
Nik Everett	bb06d8ec4f	Allow plugins to build pre-configured token filters (#24223 ) This changes the way we register pre-configured token filters so that plugins can declare them and starts to move all of the pre-configured token filters out of core. It doesn't finish the job because doing so would make the change unreviewably large. So this PR includes a shim that keeps the "old" way of registering pre-configured token filters around. The Lowercase token filter is special because there is a "special" interaction between it and the lowercase tokenizer. I'm not sure exactly what to do about it so for now I'm leaving it alone with the intent of figuring out what to do with it in a followup. This also renames these pre-configured token filters from "pre-built" to "pre-configured" because that seemed like a more descriptive name. This is a part of #23658	2017-05-09 14:50:49 -04:00
Nik Everett	7c3efb829b	Move char filters into analysis-common (#24261 ) Another step down the road to dropping the lucene-analyzers-common dependency from core. Note that this removes some tests that no longer compile from core. I played around with adding them to the analysis-common module where they would compile but we already test these in the tests generated from the example usage in the documentation. I'm not super happy with the way that `requriesAnalysisSettings` works with regards to plugins. I think it'd be fairly bug-prone for plugin authors to use. But I'm making it visible as is for now and I'll rethink later. A part of #23658	2017-04-26 13:25:34 -04:00
Nik Everett	caf376c8af	Start building analysis-common module (#23614 ) Start moving built in analysis components into the new analysis-common module. The goal of this project is: 1. Remove core's dependency on lucene-analyzers-common.jar which should shrink the dependencies for transport client and high level rest client. 2. Prove that analysis plugins can do all the "built in" things by moving all "built in" behavior to a plugin. 3. Force tests not to depend on any oddball analyzer behavior. If tests need anything more than the standard analyzer they can use the mock analyzer provided by Lucene's test infrastructure.	2017-04-19 18:51:34 -04:00

1 2

98 Commits