OpenSearch

Commit Graph

Author	SHA1	Message	Date
markharwood	e197b6c45b	Analysis enhancement - add preserve_original setting in ngram-token-filter (#55432 ) (#56100 ) Authored-by: Amit Khandelwal <amitmbm87@gmail.com>	2020-05-04 11:31:28 +01:00
Amit Khandelwal	126e4acca8	Expose `preserve_original` in `edge_ngram` token filter (#55766 ) The Lucene `preserve_original` setting is currently not supported in the `edge_ngram` token filter. This change adds it with a default value of `false`. Closes #55767	2020-04-28 10:24:27 +02:00
Rory Hunter	d66af46724	Always use deprecateAndMaybeLog for deprecation warnings (#55319 ) Backport of #55115. Replace calls to deprecate(String,Object...) with deprecateAndMaybeLog(...), with an appropriate key, so that all messages can potentially be deduplicated.	2020-04-23 09:20:54 +01:00
David Turner	7941f4a47e	Add RepositoriesService to createComponents() args (#54814 ) Today we pass the `RepositoriesService` to the searchable snapshots plugin during the initialization of the `RepositoryModule`, forcing the plugin to be a `RepositoryPlugin` even though it does not implement any repositories. After discussion we decided it best for now to pass this in via `Plugin#createComponents` instead, pending some future work in which plugins can depend on services more dynamically.	2020-04-16 16:27:36 +01:00
Jason Tedor	5fcda57b37	Rename MetaData to Metadata in all of the places (#54519 ) This is a simple naming change PR, to fix the fact that "metadata" is a single English word, and for too long we have not followed general naming conventions for it. We are also not consistent about it, for example, METADATA instead of META_DATA if we were trying to be consistent with MetaData (although METADATA is correct when considered in the context of "metadata"). This was a simple find and replace across the code base, only taking a few minutes to fix this naming issue forever.	2020-03-31 17:24:38 -04:00
Jake Landis	db3420d757	[7.x] Optimize which Rest resources are used by the Rest tests… (#53766 ) This should help with Gradle's incremental compile such that projects only depend upon the resources they use. related #52114	2020-03-19 12:28:59 -05:00
Jay Modi	f3f6ff97ee	Single instance of the IndexNameExpressionResolver (#52604 ) This commit modifies the codebase so that our production code uses a single instance of the IndexNameExpressionResolver class. This change is being made in preparation for allowing name expression resolution to be augmented by a plugin. In order to remove some instances of IndexNameExpressionResolver, the single instance is added as a parameter of Plugin#createComponents and PersistentTaskPlugin#getPersistentTasksExecutor. Backport of #52596	2020-02-21 07:50:02 -07:00
Adrien Grand	ad9d2f1922	Move analysis/mappings stats to cluster-stats. (#51875 ) Closes #51138	2020-02-05 11:02:25 +01:00
Marios Trivyzas	fda25ed04a	Fix caching for PreConfiguredTokenFilter (#50912 ) (#51091 ) The PreConfiguredTokenFilter#singletonWithVersion uses the version internally for the token filter factories but it registers only one instance in the cache and not one instance per version. This can lead to exceptions like the one described in #50734 since the singleton is created and cached using the version created of the first index that is processed. Remove the singletonWithVersion() methods and use the elasticsearchVersion() methods instead. Fixes: #50734 (cherry picked from commit 24e1858)	2020-01-16 13:58:02 +01:00
Christoph Büscher	2f13751bad	Deprecate and remove camel-case nGram and edgeNGram tokenizers (#50862 ) (#50991 ) We deprecated and removed the camel-case versions of the nGram and edgeNGram filters a while ago and we should do the same with the nGram and edgeNGram tokenizers. This PR deprecates the use of these names in favour of ngram and edge_ngram in 7. Usage will be disallowed on new indices starting with 8 then.	2020-01-14 21:42:34 +01:00
Alan Woodward	4974f56b25	Fix analysis BWC tests - warnings now emitted on index creation	2020-01-14 14:48:40 +00:00
Alan Woodward	8c16725a0d	Check for deprecations when analyzers are built (#50908 ) Generally speaking, deprecated analysis components in elasticsearch will issue deprecation warnings when they are first used. However, this means that no warnings are emitted when indexes are created with deprecated components, and users have to actually index a document to see warnings. This makes it much harder to see these warnings and act on them at appropriate times. This is worse in the case where components throw exceptions on upgrade. In this case, users will not be aware of a problem until a document is indexed, instead of at index creation time. This commit adds a new check that pushes an empty string through all user-defined analyzers and normalizers when an IndexAnalyzers object is built for each index; deprecation warnings and exceptions are now emitted when indexes are created or opened. Fixes #42349	2020-01-14 13:52:02 +00:00
Christoph Büscher	b1b4282273	Make Multiplexer inherit filter chains analysis mode (#50662 ) Currently, if an updateable synonym filter is included in a multiplexer filter, it is not reloaded via the _reload_search_analyzers because the multiplexer itself doesn't pass on the analysis mode of the filters it contains, so its not recognized as "updateable" in itself. Instead we can check and merge the AnalysisMode settings of all filters in the multiplexer and use the resulting mode (e.g. search-time only) for the multiplexer itself, thus making any synonym filters contained in it reloadable. This, of course, will also make the analyzers using the multiplexer be usable at search-time only. Closes #50554	2020-01-08 22:12:01 +01:00
Christoph Büscher	6258d25458	Log deprecation for nGram and edgeNGram custom filters (#50376 ) (#50445 ) The camel-case `nGram` and `edgeNGram` filter names were deprecated in 6. We currently throw errors on new indices when they are used. However these errors are currently only thrown for pre-configured filters, adding them as custom filters doesn't trigger the warning and error. This change adds the appropriate deprecation warnings for `nGram` and `edgeNGram` respectively on version 7 indices. Relates #50360	2019-12-20 22:00:08 +01:00
Stuart Tettemer	689df1f28f	Scripting: ScriptFactory not required by compile (#50344 ) (#50392 ) Avoid backwards incompatible changes for 8.x and 7.6 by removing type restriction on compile and Factory. Factories may optionally implement ScriptFactory. If so, then they can indicate determinism and thus cacheability. Backport Relates: #49466	2019-12-19 12:50:25 -07:00
Stuart Tettemer	17cda5b2c0	Scripting: Groundwork for caching script results (#49895 ) (#49944 ) In order to cache script results in the query shard cache, we need to check if scripts are deterministic. This change adds a default method to the script factories, `isResultDeterministic() -> false` which is used by the `QueryShardContext`. Script results were never cached and that does not change here. Future changes will implement this method based on whether the results of the scripts are deterministic or not and therefore cacheable. Refs: #49466 Backport	2019-12-06 15:08:05 -07:00
Christoph Büscher	4ffa050735	Allow custom characters in token_chars of ngram tokenizers (#49250 ) Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894	2019-11-20 10:37:12 +01:00
gpaimla	7d20b50f45	Implement Lucene EstonianAnalyzer, Stemmer (#49149 ) This PR adds a new analyzer and stemmer for the Estonian language. Closes #48895	2019-11-18 17:24:21 +01:00
Rory Hunter	c46a0e8708	Apply 2-space indent to all gradle scripts (#49071 ) Backport of #48849. Update `.editorconfig` to make the Java settings the default for all files, and then apply a 2-space indent to all `*.gradle` files. Then reformat all the files.	2019-11-14 11:01:23 +00:00
Rory Hunter	3c77c50f5f	Improve resiliency to auto-formatting in libs, modules (#48619 ) Backport of #48448. Make a number of changes so that code in the libs and modules directories are more resilient to automatic formatting. This covers: * Remove string concatenation where JSON fits on a single line * Move some comments around to they aren't auto-formatted to a strange place	2019-10-29 10:39:34 +00:00
Alan Woodward	697c693ee7	Reset Token position on reuse in scripted analysis (#47424 ) Most of the information in AnalysisPredicateScript.Token is pulled directly from its underlying AttributeSource, but we also keep track of the token position, and this state is held directly on the Token. This information needs to be reset when the containing ScriptFilteringTokenFilter or ScriptedConditionTokenFilter is re-used. Fixes #47197	2019-10-02 11:27:04 +01:00
Christoph Büscher	3366726ad1	Enable reloading of synonym_graph filters (#45135 ) Reloading of synonym_graph filter doesn't work currently because the search time AnalysisMode doesn't get propagated to the TokenFilterFactory emitted by the graph filters getChainAwareTokenFilterFactory() method. This change fixes that. Closes #45127	2019-08-02 15:33:42 +02:00
Alan Woodward	b6a0f098e6	Don't use index_phrases on graph queries (#44340 ) Due to https://issues.apache.org/jira/browse/LUCENE-8916, when you try to use a synonym filter with the index_phrases option on a text field, you can end up with null values in a Phrase query, leading to weird exceptions further down the querying chain. As a workaround, this commit disables the index_phrases optimization for queries that produce token graphs. Fixes #43976	2019-07-17 16:46:00 +01:00
Alan Woodward	4b99255fed	Add name() method to TokenizerFactory (#43909 ) This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory, and removes the need to pass around tokenizer names when building custom analyzers. As this means that TokenizerFactory is no longer a functional interface, the commit also adds a factory method to TokenizerFactory to make construction simpler.	2019-07-04 11:28:55 +01:00
Christoph Büscher	2cc7f5a744	Allow reloading of search time analyzers (#43313 ) Currently changing resources (like dictionaries, synonym files etc...) of search time analyzers is only possible by closing an index, changing the underlying resource (e.g. synonym files) and then re-opening the index for the change to take effect. This PR adds a new API endpoint that allows triggering reloading of certain analysis resources (currently token filters) that will then pick up changes in underlying file resources. To achieve this we introduce a new type of custom analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows swapping out analysis components. Custom analyzers that contain filters that are markes as "updateable" will automatically choose this implementation. This PR also adds this capability to `synonym` token filters for use in search time analyzers. Relates to #29051	2019-06-28 09:55:40 +02:00
Alan Woodward	51b230f6ab	Fix PreConfiguredTokenFilters getSynonymFilter() implementations (#38839 ) (#43678 ) When we added support for TokenFilterFactories to specialise how they were used when parsing synonym files, PreConfiguredTokenFilters were set up to either apply themselves, or be ignored. This behaviour is a leftover from an earlier iteration, and also has an incorrect default. This commit makes preconfigured token filters usable in synonym file parsing by default, and brings those filters that should not be used into line with index-specific filter factories; in indexes created before version 7 we emit a deprecation warning, and we throw an error in indexes created after. Fixes #38793	2019-06-28 08:19:00 +01:00
Alan Woodward	4882b932d8	Issue deprecation warnings when preconfigured delimited_payload_filter is used (#43684 ) #26625 deprecated delimited_payload_filter and added tests to check that warnings would be emitted when both a normal and pre-configured filter were used. Unfortunately, due to a bug in the Analyze API, the pre- configured filter check was never actually triggered, and it turns out that the deprecation warning was not in fact being emitted in this case. #43568 fixed the Analyze API bug, which then surfaced this on backport. This commit ensures that the preconfigured filter also emits the warnings and triggers an error if a new index tries to use a preconfigured delimited_payload_filter	2019-06-27 12:44:29 +01:00
Alan Woodward	8ff5519b11	Use preconfigured filters correctly in Analyze API (#43568 ) When a named token filter or char filter is passed as part of an Analyze API request with no index, we currently try and build the relevant filter using no index settings. However, this can miss cases where there is a pre-configured filter defined in the analysis registry. One example here is the elision filter, which has a pre-configured version built with the french elision set; when used as part of normal analysis, this preconfigured set is used, but when used as part of the Analyze API we end up with NPEs because it tries to instantiate the filter with no index settings. This commit changes the Analyze API to check for pre-configured filters in the case that the request has no index defined, and is using a name rather than a custom definition for a filter. It also changes the pre-configured `word_delimiter_graph` filter and `edge_ngram` tokenizer to make their settings consistent with the defaults used when creating them with no settings Closes #43002 Closes #43621 Closes #43582	2019-06-27 09:07:01 +01:00
Alan Woodward	05a7333eca	Require [articles] setting in elision filter (#43083 ) We should throw an exception at construction time if a list of articles is not provided, otherwise we can get random NPEs during indexing. Relates to #43002	2019-06-27 09:02:36 +01:00
Marios Trivyzas	ce30afcd01	Deprecate CommonTermsQuery and cutoff_frequency (#42619 ) (#42691 ) Since the max_score optimization landed in Elasticsearch 7, the CommonTermsQuery is redundant and slower. Moreover the cutoff_frequency parameter for MatchQuery and MultiMatchQuery is redundant. Relates to #27096 (cherry picked from commit 04b74497314eeec076753a33b3b6cc11549646e8)	2019-05-30 18:04:47 +02:00
Jim Ferenczi	4ca5649a0d	Upgrade to lucene 8.1.0-snapshot-e460356abe (#40952 )	2019-05-23 11:45:33 +02:00
Jason Tedor	7f3ab4524f	Bump 7.x branch to version 7.2.0 This commit adds the 7.2.0 version constant to the 7.x branch, and bumps BWC logic accordingly.	2019-05-01 13:38:57 -04:00
Alpar Torok	25944c4317	convert modules to use testclusters (#40804 ) * convert modules to use testclusters * Eliminate PluginPropertiesTask and move logic in plugin where it belongs	2019-04-04 11:45:40 +03:00
Alan Woodward	4296ff2fd1	Test that no-index synonyms can be used with the Analyze API (#40781 ) Relates to #23943	2019-04-04 09:03:51 +01:00
Alan Woodward	83d2870308	Add `use_field` option to intervals query (#40157 ) This is the equivalent of the `field_masking_span` query, allowing users to merge intervals from multiple fields - for example, to search for stemmed tokens near unstemmed tokens.	2019-03-20 16:26:04 +00:00
Christoph Büscher	4b77d0434a	Remove `nGram` and `edgeNGram` token filter names (#39070 ) In #30209 we deprecated the camel case `nGram` filter name in favour of `ngram` and did the same for `edgeNGram` and `edge_ngram` and we are removing those names in 8.0. This change disallows using the deprecated names for new indices created in 7.0 by throwing an error if these filters are used. Relates to #38911	2019-02-21 16:55:40 +01:00
Julie Tibshirani	c2e9d13ebd	Default include_type_name to false in the yml test harness. (#38058 ) This PR removes the temporary change we made to the yml test harness in #37285 to automatically set `include_type_name` to `true` in index creation requests if it's not already specified. This is possible now that the vast majority of index creation requests were updated to be typeless in #37611. A few additional tests also needed updating here. Additionally, this PR updates the test harness to set `include_type_name` to `false` in index creation requests when communicating with 6.x nodes. This mirrors the logic added in #37611 to allow for typeless document write requests in test set-up code. With this update in place, we can remove many references to `include_type_name: false` from the yml tests.	2019-02-01 11:44:13 -08:00
Colin Goodheart-Smithe	21e392e95e	Removes typed calls from YAML REST tests (#37611 ) This PR attempts to remove all typed calls from our YAML REST tests. The PR adds include_type_name: false to create index requests that use a mapping and also to put mapping requests. It also removes _type from index requests where they haven't already been removed. The PR ignores tests named *_with_types.yml since this are specifically testing typed API behaviour. The change also includes changing the test harness to add the type _doc to index, update, get and bulk requests that do not specify the document type when the test is running against a mixed 7.x/6.x cluster.	2019-01-30 16:32:58 +00:00
Christoph Büscher	b4b4cd6ebd	Clean codebase from empty statements (#37822 ) * Remove empty statements There are a couple of instances of undocumented empty statements all across the code base. While they are mostly harmless, they make the code hard to read and are potentially error-prone. Removing most of these instances and marking blocks that look empty by intention as such. * Change test, slightly more verbose but less confusing	2019-01-25 14:23:02 +01:00
Jun Ohtani	38b698d455	[Analysis] Deprecate Standard Html Strip Analyzer in master (#26719 ) * [Analysis] Deprecate Standard Html Strip Analyzer Deprecate only Standard Html Strip Analyzer If user create index with the analyzer since 7.0, es throws an exception. If an index was created before 7.0, es issue deprecation log We will remove it in 8.0 Related #4704	2019-01-09 12:42:00 +09:00
Alan Woodward	af57575838	Allow word_delimiter_graph_filter to not adjust internal offsets (#36699 ) This commit adds an adjust_offsets parameter to the word_delimiter_graph token filter, defaulting to true. Most of the time you'd want sub-tokens emitted by this filter to have offsets that are adjusted to their real position in the token stream; however, some token filters can change the length or starting position of a token (eg trim) without changing their offset attributes, and this can lead to word_delimiter_graph emitting illegal offsets. Setting adjust_offsets to false in these cases will allow indexing again. Fixes #34741, #33710	2018-12-18 13:20:51 +00:00
Julie Tibshirani	87831051dc	Deprecate types in explain requests. (#35611 ) The following updates were made: - Add a new untyped endpoint `{index}/_explain/{id}`. - Add deprecation warnings to RestAction, plus tests in RestActionTests. - For each REST yml test, make sure there is one version without types, and another legacy version that retains types (called *_with_types.yml). - Deprecate relevant methods on the Java HLRC requests/ responses. - Update documentation (for both the REST API and Java HLRC).	2018-12-10 19:45:13 -08:00
Jim Ferenczi	18866c4c0b	Make hits.total an object in the search response (#35849 ) This commit changes the format of the `hits.total` in the search response to be an object with a `value` and a `relation`. The `value` indicates the number of hits that match the query and the `relation` indicates whether the number is accurate (in which case the relation is equals to `eq`) or a lower bound of the total (in which case it is equals to `gte`). This change also adds a parameter called `rest_total_hits_as_int` that can be used in the search APIs to opt out from this change (retrieve the total hits as a number in the rest response). Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain `hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a follow up (to allow numbers to be passed to `track_total_hits`). Relates #33028	2018-12-05 19:49:06 +01:00
Alan Woodward	73ceaad03a	Update to lucene-8.0.0-snapshot-c78429a554 (#36212 ) Includes: * A fix for a bug in Intervals.or() (https://issues.apache.org/jira/browse/LUCENE-8586) * The ability to disable offset mangling in WordDelimiterGraphFilter (https://issues.apache.org/jira/browse/LUCENE-8509) * BM25Similarity no longer multiplies scores by k1 + 1	2018-12-05 12:43:56 +00:00
Alan Woodward	a646f85a99	Ensure TokenFilters only produce single tokens when parsing synonyms (#34331 ) A number of tokenfilters can produce multiple tokens at the same position. This is a problem when using token chains to parse synonym files, as the SynonymMap requires that there are no stacked tokens in its input. This commit ensures that when used to parse synonyms, these tokenfilters either produce a single version of their input token, or that they throw an error when mappings are generated. In indexes created in elasticsearch 6.x deprecation warnings are emitted in place of the error. * asciifolding and cjk_bigram produce only the folded or bigrammed token * decompounders, synonyms and keyword_repeat are skipped * n-grams, word-delimiter-filter, multiplexer, fingerprint and phonetic throw errors Fixes #34298	2018-11-29 10:35:38 +00:00
Jim Ferenczi	e37a0ef844	Upgrade to lucene-8.0.0-snapshot-67cdd21996 (#35816 )	2018-11-22 15:42:59 +01:00
Alpar Torok	8a85b2eada	Remove build qualifier from server's Version (#35172 ) With this change, `Version` no longer carries information about the qualifier, we still need a way to show the "display version" that does have both qualifier and snapshot. This is now stored by the build and red from `META-INF`.	2018-11-07 14:01:05 +02:00
Nick Knize	a5e1f4d3a2	Upgrade to lucene-8.0.0-snapshot-31d7dfe6b1 (#35224 )	2018-11-06 11:55:23 +01:00
Pratik Sanglikar	f1135ef0ce	Core: Replace deprecated Loggers calls with LogManager. (#34691 ) Replace deprecated Loggers calls with LogManager. Relates to #32174	2018-10-29 15:52:30 -04:00
Tal Levy	e1fdd00420	Lowercase static final DeprecationLogger instance names (#34887 ) After discussing on the team's FixItFriday, we concluded that static final instance variables that are mutable should be lowercased. Historically, DeprecationLogger was uppercased more frequently than lowercased.	2018-10-25 21:12:19 -07:00
Alpar Torok	59536966c2	Add a new "contains" feature (#34738 ) The contains syntax was added in #30874 but the skips were not properly put in place. The java runner has the feature so the tests will run as part of the build, but language clients will be able to support it at their own pace.	2018-10-25 08:50:50 +03:00
Christoph Büscher	c1c447a4cf	Check stemmer language setting early (#34601 ) Currently the StemmerTokenFilterFactory checks the validity of the language setting only when the first TokenStream is processed. Instead we should throw an error earlier at mapping creation time. This change adds a check to the StemmerTokenFilterFactory constructor that checks for a valid `language` setting by trying to create a new TokenStream from an empty input stream. This will throw errors about wrong language settings early on. Closes #34170	2018-10-19 12:59:23 +02:00
Alan Woodward	d25e3ef065	[CI] Use a single shard for synonym REST tests, to ensure score ordering is stable	2018-10-01 09:47:22 +01:00
Alan Woodward	f243d75f59	Remove special-casing of Synonym filters in AnalysisRegistry (#34034 ) The synonym filters no longer need access to the AnalysisRegistry in their constructors, so we can remove the special-case code and move them to the common analysis module. This commit means that synonyms are no longer available for `server` integration tests, so several of these are either rewritten or migrated to the common analysis module as rest-spec-api tests	2018-09-28 09:02:47 +01:00
Christoph Büscher	ba3ceeaccf	Clean up "unused variable" warnings (#31876 ) This change cleans up "unused variable" warnings. There are several cases were we most likely want to suppress the warnings (especially in the client documentation test where the snippets contain many unused variables). In a lot of cases the unused variables can just be deleted though.	2018-09-26 14:09:32 +02:00
Alan Woodward	b33c18d316	Move SoraniNormalizationFilterFactory to the common analysis plugin (#33892 ) Follow up to #25715	2018-09-20 17:31:41 +01:00
Alan Woodward	5107949402	Allow TokenFilterFactories to rewrite themselves against their preceding chain (#33702 ) We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they appear before them in a chain, because they produce multiple tokens at the same position. This commit adds two methods to the TokenFilterFactory interface. * `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`. * `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym list `Analyzer`. By default it returns `true`. Fixes #33609	2018-09-19 15:52:14 +01:00
Alan Woodward	f598297f55	Add predicate_token_filter (#33431 ) This allows users to filter out tokens from a TokenStream using painless scripts, instead of having to write specialised Java code and packaging it up into a plugin. The commit also refactors the AnalysisPredicateScript.Token class so that it wraps and makes read-only an AttributeSource.	2018-09-11 09:16:39 +01:00
Jim Ferenczi	7ad71f906a	Upgrade to a Lucene 8 snapshot (#33310 ) The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly. Some comments about the change: * Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309 Closes #32899	2018-09-06 14:42:06 +02:00
Alan Woodward	e134f9b5f3	Fix generics in ScriptPlugin#getContexts() (#33426 ) Changes the return value from List<ScriptContext> to List<ScriptContext<?>> to remove raw-types warnings.	2018-09-06 09:04:22 +01:00
Alan Woodward	636442700c	Add conditional token filter to elasticsearch (#31958 ) This allows tokenfilters to be applied selectively, depending on the status of the current token in the tokenstream. The filter takes a scripted predicate, and only applies its subfilter when the predicate returns true.	2018-09-05 14:52:43 +01:00
Jim Ferenczi	f4e9729d64	Remove unsupported Version.V_5_* (#32937 ) This change removes the es 5x version constants and their usages.	2018-08-24 09:51:21 +02:00
Alan Woodward	cfb30144c9	Call setReferences() on custom referring tokenfilters in _analyze (#32157 ) When building custom tokenfilters without an index in the _analyze endpoint, we need to ensure that referring filters are correctly built by calling their #setReferences() method Fixes #32154	2018-07-18 14:43:20 +01:00
Armin Braun	ed3b44fb4c	Handle TokenizerFactory TODOs (#32063 ) * Don't replace Replace TokenizerFactory with Supplier, this approach was rejected in #32063 * Remove unused parameter from constructor	2018-07-17 14:14:02 +02:00
Christoph Büscher	61486680a2	Add exclusion option to `keep_types` token filter (#32012 ) Currently the `keep_types` token filter includes all token types specified using its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where instead of keeping the specified tokens (include) they are filtered out (exclude). This change exposes this option as a new `mode` parameter that can either take the values `include` (the default, if not specified) or `exclude`. Closes #29277	2018-07-17 09:04:41 +02:00
Alan Woodward	a01e26a39b	Correct spelling of AnalysisPlugin#requriesAnalysisSettings (#32025 ) Because this is a static method on a public API, and one that we encourage plugin authors to use, the method with the typo is deprecated in 6.x rather than just renamed.	2018-07-13 13:13:21 +01:00
Alpar Torok	08b8d11e30	Add support for switching distribution for all integration tests (#30874 ) * remove left-over comment * make sure of the property for plugins * skip installing modules if these exist in the distribution * Log the distrbution being ran * Don't allow running with integ-tests-zip passed externally * top level x-pack/qa can't run with oss distro * Add support for matching objects in lists Makes it possible to have a key that points to a list and assert that a certain object is present in the list. All keys have to be present and values have to match. The objects in the source list may have additional fields. example: ``` match: { 'nodes.$master.plugins': { name: ingest-attachment } } ``` * Update plugin and module tests to work with other distributions Some of the tests expected that the integration tests will always be ran with the `integ-test-zip` distribution so that there will be no other plugins loaded. With this change, we check for the presence of the plugin without assuming exclusivity. * Allow modules to run on other distros as well To match the behavior of tets.distributions * Add and use a new `contains` assertion Replaces the previus changes that caused `match` to do a partial match. * Implement PR review comments	2018-06-26 06:49:03 -07:00
Alan Woodward	5683bc60a6	Multiplexing token filter (#31208 ) The `multiplexer` filter emits multiple tokens at the same position, each version of the token haivng been passed through a different filter chain. Identical tokens at the same position are removed. This allows users to, for example, index lowercase and original-case tokens, or stemmed and unstemmed versions, in the same field, so that they can search for a stemmed term within x positions of an unstemmed term.	2018-06-20 10:16:26 +01:00
Martijn van Groningen	47095357bc	Move language analyzers from server to analysis-common module. (#31300 ) The following analyzers were moved from server module to analysis-common module: `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `latvian`, `lithuanian`, `norwegian`, `persian`, `portuguese`, `romanian`, `russian`, `sorani`, `spanish`, `swedish`, `turkish` and `thai`. Relates to #23658	2018-06-18 11:24:43 +02:00
Alan Woodward	8c0ec05a12	Expose lucene's RemoveDuplicatesTokenFilter (#31275 )	2018-06-18 09:46:12 +01:00
Martijn van Groningen	16d593b22f	Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202 ) instead of lamda that creates the analyzer	2018-06-13 07:25:19 +02:00
Tanguy Leroux	bf58660482	Remove all unused imports and fix CRLF (#31207 ) The X-Pack opening and the recent other refactorings left a lot of unused imports in the codebase. This commit removes them all.	2018-06-11 15:12:12 +02:00
Martijn van Groningen	07a57cc131	Move number of language analyzers to analysis-common module (#31143 ) The following analyzers were moved from server module to analysis-common module: `snowball`, `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `chinese`, `cjk`, `czech`, `danish`, `dutch`, `english`, `finnish`, `french`, `galician` and `german`. Relates to #23658	2018-06-08 08:58:46 +02:00
Martijn van Groningen	735d0e671a	Make PreBuiltAnalyzerProviderFactory plugable via AnalysisPlugin and move `finger_print`, `pattern` and `standard_html_strip` analyzers to analysis-common module. (both AnalysisProvider and PreBuiltAnalyzerProvider) Changed PreBuiltAnalyzerProviderFactory to extend from PreConfiguredAnalysisComponent and changed to make sure that predefined analyzers are always instantiated with the current ES version and if an instance is requested for a different version then delegate to PreBuiltCache. This is similar to the behaviour that exists today in AnalysisRegistry.PreBuiltAnalysis and PreBuiltAnalyzerProviderFactory. (#31095) Relates to #23658	2018-06-06 07:40:21 +02:00
Martijn van Groningen	544822c78b	Moved keyword tokenizer to analysis-common module (#30642 ) Relates to #23658	2018-05-29 19:22:28 +02:00
Itamar Syn-Hershko	5f172b6795	[Feature] Adding a char_group tokenizer (#24186 ) === Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```	2018-05-22 16:26:31 +02:00
Christoph Büscher	b6340658f4	Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209 ) The camel case name `nGram` should be removed in favour of `ngram` and similar for `edgeNGram` and `edge_ngram`. Before removal, we need to deprecate the camel case names first. This change adds deprecation warnings for indices with versions 6.4.0 and higher and logs deprecation warnings.	2018-05-17 12:52:22 +02:00
Martijn van Groningen	7b95470897	Moved tokenizers to analysis common module (#30538 ) The following tokenizers were moved: classic, edge_ngram, letter, lowercase, ngram, path_hierarchy, pattern, thai, uax_url_email and whitespace. Left keyword tokenizer factory in server module, because normalizers directly depend on it.This should be addressed on a follow up change. Relates to #23658	2018-05-14 07:55:01 +02:00
Jim Ferenczi	dbd857341f	Upgrade to 7.4.0-snapshot-1ed95c097b (#30357 ) Upgrade to lucene-7.4.0-snapshot-1ed95c097b This version contains: * An Analyzer for Korean * An IntervalQuery and IntervalsSource that retrieve minimum intervals of positional queries. * A new API to retrieve matches (offsets and positions) of a query for a single document. * Support for soft deletes in the index writer. * A fixed shingle filter that handles index time synonyms. * Support for emoji sequence in ICUTokenizer (with an upgrade to icu 61.1)	2018-05-04 11:44:22 +02:00
Adrien Grand	231a63fdf8	Remove useless version checks in REST tests. (#30165 ) Many tests are added with a version check so that they do not run against a version that doesn't have the feature yet. Master is 7.0, so all tests that do not run against 6.0+ can be removed and the version check can be removed on all tests that always run on 6.0+.	2018-05-02 11:34:15 +02:00
Christoph Büscher	24763d881e	Deprecate use of `htmlStrip` as name for HtmlStripCharFilter (#27429 ) The camel case name `htmlStip` should be removed in favour of `html_strip`, but we need to deprecate it first. This change adds deprecation warnings for indices with version starting with 6.3.0 and logs deprecation warnings in this cases.	2018-04-19 16:48:17 +02:00
Christoph Büscher	231fd4eb18	Remove `delimited_payload_filter` (#27705 ) From 7.0 on, using `delimited_payload_filter` should throw an error. It was deprecated in 6.2 in favour of `delimited_payload` (#26625). Relates to #27704	2018-04-05 18:41:04 +02:00
Adrien Grand	3bdfc8f3fb	Upgrade to lucene-7.3.0-snapshot-98a6b3d. (#29298 ) Most notable changes include: - this release doesn't have the 7.2.1 version constant so I had to create one - spatial4j and jts were upgraded	2018-04-03 09:27:14 +02:00
Alan Woodward	af3f63616b	Allow TrimFilter to be used in custom normalizers (#27758 ) AnalysisFactoryTestCase checks that the ES custom token filter multi-term awareness matches the underlying lucene factory. For the trim filter this won't be the case until LUCENE-8093 is released in 7.3, so we add a temporary exclusion Closes #27310	2017-12-18 14:27:03 +00:00
Christoph Büscher	c4fe7d3f72	[Docs] add deprecation warning for `delimited_payload_filter` renaming	2017-12-04 10:22:05 +01:00
kel	4885acb048	Replace `delimited_payload_filter` by `delimited_payload` (#26625 ) The `delimited_payload_filter` is renamed to `delimited_payload`, the old name is deprecated and should be replaced by `delimited_payload`. Closes #21978	2017-11-24 13:03:19 +01:00
Mayya Sharipova	148376c2c5	Add limits for ngram and shingle settings (#27211 ) * Add limits for ngram and shingle settings (#27211) Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes #25887	2017-11-07 08:14:55 -05:00
David Roberts	749c3ec716	Remove the single argument Environment constructor (#27235 ) Only tests should use the single argument Environment constructor. To enforce this the single arg Environment constructor has been replaced with a test framework factory method. Production code (beyond initial Bootstrap) should always use the same Environment object that Node.getEnvironment() returns. This Environment is also available via dependency injection.	2017-11-04 13:25:09 +00:00
Simon Willnauer	cdd7c1e6c2	Return List instead of an array from settings (#26903 ) Today we return a `String[]` that requires copying values for every access. Yet, we already store the setting as a list so we can also directly return the unmodifiable list directly. This makes list / array access in settings a much cheaper operation especially if lists are large.	2017-10-09 09:52:08 +02:00
Martijn van Groningen	b27e408ed2	Removed void token filter entries and added two tests	2017-10-05 13:25:05 +02:00
Md. Abdulla-Al-Sun	a40c474e10	Added Bengali Analyzer to Elasticsearch with respect to the lucene update(PR#238)	2017-10-05 13:25:05 +02:00
Simon Willnauer	00dfdf50cf	Represent lists as actual lists inside Settings (#26878 ) Today we represent each value of a list setting with it's own dedicated key that ends with the index of the value in the list. Aside of the obvious weirdness this has several issues especially if lists are massive since it causes massive runtime penalties when validating settings. Like a list of 100k words will literally cause a create index call to timeout and in-turn massive slowdown on all subsequent validations runs. With this change we use a simple string list to represent the list. This change also forbids to add a settings that ends with a .0 which was internally used to detect a list setting. Once this has been rolled out for an entire major version all the internal .0 handling can be removed since all settings will be converted. Relates to #26723	2017-10-05 09:27:08 +02:00
Simon Willnauer	aab4655e63	Unify Settings xcontent reading and writing (#26739 ) This change adds a fromXContent method to Settings that allows to read the xcontent that is produced by toXContent. It also replaces the entire settings loader infrastructure and removes the structured map representation. Future PRs will also tackle the `getAsMap` that exposes the internal represenation of settings for better encapsulation.	2017-09-25 13:23:01 +02:00
Michael Basnight	f385e0cf26	Add bad_request to the rest-api-spec catch params (#26539 ) This adds another request to the catch params. It also makes sure that the generic request param does not allow 400 either.	2017-09-14 14:24:03 -05:00
Jim Ferenczi	86d97971a4	Remove the _all metadata field (#26356 ) * Remove the _all metadata field This change removes the `_all` metadata field. This field is deprecated in 6 and cannot be activated for indices created in 6 so it can be safely removed in the next major version (e.g. 7).	2017-08-28 17:43:59 +02:00
Adrien Grand	eb782492be	Remove support for lenient booleans. Closes #22298	2017-08-28 09:56:01 +02:00
Martijn van Groningen	1146a35870	Move more token filters to analysis-common module The following token filters were moved: arabic_stem, brazilian_stem, czech_stem, dutch_stem, french_stem, german_stem and russian_stem. Relates to #23658	2017-08-11 17:39:24 +02:00
Martijn van Groningen	ff3b909a83	Moved HtmlStripCharFilterFactory to analyis.common package like the other factories.	2017-07-31 15:34:54 +02:00
Martijn van Groningen	0b776a1de0	Move more token filters to analysis-common module The following token filters were moved: delimited_payload_filter, keep, keep_types, classic, apostrophe, decimal_digit, fingerprint, min_hash and scandinavian_folding. Relates to #23658	2017-07-31 15:15:04 +02:00
Jim Ferenczi	ab3b5c695a	Pre-configured shingle filter should disable graph analysis (#25853 ) This change disables the graph analysis on default `shingle` filter. The pre-configured shingle filter produces shingles of different size. Graph analysis on such token stream is useless and dangerous as it may create too many paths. Fixes #25555	2017-07-24 18:42:15 +02:00

1 2 3 4

172 Commits