Commit Graph

172 Commits

Author SHA1 Message Date
markharwood e197b6c45b
Analysis enhancement - add preserve_original setting in ngram-token-filter (#55432) (#56100)
Authored-by: Amit Khandelwal <amitmbm87@gmail.com>
2020-05-04 11:31:28 +01:00
Amit Khandelwal 126e4acca8 Expose `preserve_original` in `edge_ngram` token filter (#55766)
The Lucene `preserve_original` setting is currently not supported in the `edge_ngram`
token filter. This change adds it with a default value of `false`.

Closes #55767
2020-04-28 10:24:27 +02:00
Rory Hunter d66af46724
Always use deprecateAndMaybeLog for deprecation warnings (#55319)
Backport of #55115.

Replace calls to deprecate(String,Object...) with deprecateAndMaybeLog(...),
with an appropriate key, so that all messages can potentially be deduplicated.
2020-04-23 09:20:54 +01:00
David Turner 7941f4a47e Add RepositoriesService to createComponents() args (#54814)
Today we pass the `RepositoriesService` to the searchable snapshots plugin
during the initialization of the `RepositoryModule`, forcing the plugin to be a
`RepositoryPlugin` even though it does not implement any repositories.

After discussion we decided it best for now to pass this in via
`Plugin#createComponents` instead, pending some future work in which plugins
can depend on services more dynamically.
2020-04-16 16:27:36 +01:00
Jason Tedor 5fcda57b37
Rename MetaData to Metadata in all of the places (#54519)
This is a simple naming change PR, to fix the fact that "metadata" is a
single English word, and for too long we have not followed general
naming conventions for it. We are also not consistent about it, for
example, METADATA instead of META_DATA if we were trying to be
consistent with MetaData (although METADATA is correct when considered
in the context of "metadata"). This was a simple find and replace across
the code base, only taking a few minutes to fix this naming issue
forever.
2020-03-31 17:24:38 -04:00
Jake Landis db3420d757
[7.x] Optimize which Rest resources are used by the Rest tests… (#53766)
This should help with Gradle's incremental compile such that projects
only depend upon the resources they use.

related #52114
2020-03-19 12:28:59 -05:00
Jay Modi f3f6ff97ee
Single instance of the IndexNameExpressionResolver (#52604)
This commit modifies the codebase so that our production code uses a
single instance of the IndexNameExpressionResolver class. This change
is being made in preparation for allowing name expression resolution
to be augmented by a plugin.

In order to remove some instances of IndexNameExpressionResolver, the
single instance is added as a parameter of Plugin#createComponents and
PersistentTaskPlugin#getPersistentTasksExecutor.

Backport of #52596
2020-02-21 07:50:02 -07:00
Adrien Grand ad9d2f1922
Move analysis/mappings stats to cluster-stats. (#51875)
Closes #51138
2020-02-05 11:02:25 +01:00
Marios Trivyzas fda25ed04a
Fix caching for PreConfiguredTokenFilter (#50912) (#51091)
The PreConfiguredTokenFilter#singletonWithVersion uses the version
internally for the token filter factories but it registers only one
instance in the cache and not one instance per version. This can lead
to exceptions like the one described in #50734 since the singleton is
created and cached using the version created of the first index
that is processed.

Remove the singletonWithVersion() methods and use the
elasticsearchVersion() methods instead.

Fixes: #50734
(cherry picked from commit 24e1858)
2020-01-16 13:58:02 +01:00
Christoph Büscher 2f13751bad
Deprecate and remove camel-case nGram and edgeNGram tokenizers (#50862) (#50991)
We deprecated and removed the camel-case versions of the nGram and edgeNGram
filters a while ago and we should do the same with the nGram and edgeNGram tokenizers.
This PR deprecates the use of these names in favour of ngram and edge_ngram in
7. Usage will be disallowed on new indices starting with 8 then.
2020-01-14 21:42:34 +01:00
Alan Woodward 4974f56b25 Fix analysis BWC tests - warnings now emitted on index creation 2020-01-14 14:48:40 +00:00
Alan Woodward 8c16725a0d Check for deprecations when analyzers are built (#50908)
Generally speaking, deprecated analysis components in elasticsearch will issue deprecation
warnings when they are first used. However, this means that no warnings are emitted when
indexes are created with deprecated components, and users have to actually index a document
to see warnings. This makes it much harder to see these warnings and act on them at
appropriate times.

This is worse in the case where components throw exceptions on upgrade. In this case, users
will not be aware of a problem until a document is indexed, instead of at index creation time.

This commit adds a new check that pushes an empty string through all user-defined analyzers
and normalizers when an IndexAnalyzers object is built for each index; deprecation warnings
and exceptions are now emitted when indexes are created or opened.

Fixes #42349
2020-01-14 13:52:02 +00:00
Christoph Büscher b1b4282273 Make Multiplexer inherit filter chains analysis mode (#50662)
Currently, if an updateable synonym filter is included in a multiplexer filter,
it is not reloaded via the _reload_search_analyzers because the multiplexer
itself doesn't pass on the analysis mode of the filters it contains, so its not
recognized as "updateable" in itself. Instead we can check and merge the
AnalysisMode settings of all filters in the multiplexer and use the resulting
mode (e.g. search-time only) for the multiplexer itself, thus making any synonym
filters contained in it reloadable.  This, of course, will also make the
analyzers using the multiplexer be usable at search-time only.

Closes #50554
2020-01-08 22:12:01 +01:00
Christoph Büscher 6258d25458
Log deprecation for nGram and edgeNGram custom filters (#50376) (#50445)
The camel-case `nGram` and `edgeNGram` filter names were deprecated in 6. We
currently throw errors on new indices when they are used. However these errors
are currently only thrown for pre-configured filters, adding them as custom
filters doesn't trigger the warning and error. This change adds the appropriate
deprecation warnings for `nGram` and `edgeNGram` respectively on version 7
indices.

Relates #50360
2019-12-20 22:00:08 +01:00
Stuart Tettemer 689df1f28f
Scripting: ScriptFactory not required by compile (#50344) (#50392)
Avoid backwards incompatible changes for 8.x and 7.6 by removing type
restriction on compile and Factory.  Factories may optionally implement
ScriptFactory.  If so, then they can indicate determinism and thus
cacheability.

**Backport**

Relates: #49466
2019-12-19 12:50:25 -07:00
Stuart Tettemer 17cda5b2c0
Scripting: Groundwork for caching script results (#49895) (#49944)
In order to cache script results in the query shard cache, we need to
check if scripts are deterministic.  This change adds a default method
to the script factories, `isResultDeterministic() -> false` which is
used by the `QueryShardContext`.

Script results were never cached and that does not change here.  Future
changes will implement this method based on whether the results of the
scripts are deterministic or not and therefore cacheable.

Refs: #49466

**Backport**
2019-12-06 15:08:05 -07:00
Christoph Büscher 4ffa050735 Allow custom characters in token_chars of ngram tokenizers (#49250)
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes #25894
2019-11-20 10:37:12 +01:00
gpaimla 7d20b50f45 Implement Lucene EstonianAnalyzer, Stemmer (#49149)
This PR adds a new analyzer and stemmer for the Estonian language.

Closes #48895
2019-11-18 17:24:21 +01:00
Rory Hunter c46a0e8708
Apply 2-space indent to all gradle scripts (#49071)
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
2019-11-14 11:01:23 +00:00
Rory Hunter 3c77c50f5f
Improve resiliency to auto-formatting in libs, modules (#48619)
Backport of #48448. Make a number of changes so that code in the libs and
modules directories are more resilient to automatic formatting. This covers:

* Remove string concatenation where JSON fits on a single line
* Move some comments around to they aren't auto-formatted to a strange
  place
2019-10-29 10:39:34 +00:00
Alan Woodward 697c693ee7 Reset Token position on reuse in scripted analysis (#47424)
Most of the information in AnalysisPredicateScript.Token is pulled directly
from its underlying AttributeSource, but we also keep track of the token position,
and this state is held directly on the Token. This information needs to be reset when
the containing ScriptFilteringTokenFilter or ScriptedConditionTokenFilter is re-used.

Fixes #47197
2019-10-02 11:27:04 +01:00
Christoph Büscher 3366726ad1 Enable reloading of synonym_graph filters (#45135)
Reloading of synonym_graph filter doesn't work currently because the search time
AnalysisMode doesn't get propagated to the TokenFilterFactory emitted by the
graph filters getChainAwareTokenFilterFactory() method. This change fixes that.

Closes #45127
2019-08-02 15:33:42 +02:00
Alan Woodward b6a0f098e6 Don't use index_phrases on graph queries (#44340)
Due to https://issues.apache.org/jira/browse/LUCENE-8916, when you
try to use a synonym filter with the index_phrases option on a text field,
you can end up with null values in a Phrase query, leading to weird
exceptions further down the querying chain. As a workaround, this commit
disables the index_phrases optimization for queries that produce token
graphs.

Fixes #43976
2019-07-17 16:46:00 +01:00
Alan Woodward 4b99255fed Add name() method to TokenizerFactory (#43909)
This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory,
and removes the need to pass around tokenizer names when building custom analyzers.

As this means that TokenizerFactory is no longer a functional interface, the commit also
adds a factory method to TokenizerFactory to make construction simpler.
2019-07-04 11:28:55 +01:00
Christoph Büscher 2cc7f5a744
Allow reloading of search time analyzers (#43313)
Currently changing resources (like dictionaries, synonym files etc...) of search
time analyzers is only possible by closing an index, changing the underlying
resource (e.g. synonym files) and then re-opening the index for the change to
take effect.

This PR adds a new API endpoint that allows triggering reloading of certain
analysis resources (currently token filters) that will then pick up changes in
underlying file resources. To achieve this we introduce a new type of custom
analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows
swapping out analysis components. Custom analyzers that contain filters that are
markes as "updateable" will automatically choose this implementation. This PR
also adds this capability to `synonym` token filters for use in search time
analyzers.

Relates to #29051
2019-06-28 09:55:40 +02:00
Alan Woodward 51b230f6ab
Fix PreConfiguredTokenFilters getSynonymFilter() implementations (#38839) (#43678)
When we added support for TokenFilterFactories to specialise how they were used when parsing
synonym files, PreConfiguredTokenFilters were set up to either apply themselves, or be ignored.
This behaviour is a leftover from an earlier iteration, and also has an incorrect default.

This commit makes preconfigured token filters usable in synonym file parsing by default, and brings
those filters that should not be used into line with index-specific filter factories; in indexes created
before version 7 we emit a deprecation warning, and we throw an error in indexes created after.

Fixes #38793
2019-06-28 08:19:00 +01:00
Alan Woodward 4882b932d8
Issue deprecation warnings when preconfigured delimited_payload_filter is used (#43684)
#26625 deprecated delimited_payload_filter and added tests to check
that warnings would be emitted when both a normal and pre-configured
filter were used. Unfortunately, due to a bug in the Analyze API, the pre-
configured filter check was never actually triggered, and it turns out that
the deprecation warning was not in fact being emitted in this case.

#43568 fixed the Analyze API bug, which then surfaced this on backport.

This commit ensures that the preconfigured filter also emits the warnings
and triggers an error if a new index tries to use a preconfigured
delimited_payload_filter
2019-06-27 12:44:29 +01:00
Alan Woodward 8ff5519b11 Use preconfigured filters correctly in Analyze API (#43568)
When a named token filter or char filter is passed as part of an Analyze API
request with no index, we currently try and build the relevant filter using no
index settings. However, this can miss cases where there is a pre-configured
filter defined in the analysis registry. One example here is the elision filter, which
has a pre-configured version built with the french elision set; when used as part
of normal analysis, this preconfigured set is used, but when used as part of the
Analyze API we end up with NPEs because it tries to instantiate the filter with
no index settings.

This commit changes the Analyze API to check for pre-configured filters in the case
that the request has no index defined, and is using a name rather than a custom
definition for a filter.

It also changes the pre-configured `word_delimiter_graph` filter and `edge_ngram`
tokenizer to make their settings consistent with the defaults used when creating
them with no settings

Closes #43002
Closes #43621
Closes #43582
2019-06-27 09:07:01 +01:00
Alan Woodward 05a7333eca Require [articles] setting in elision filter (#43083)
We should throw an exception at construction time if a list of
articles is not provided, otherwise we can get random NPEs during
indexing.

Relates to #43002
2019-06-27 09:02:36 +01:00
Marios Trivyzas ce30afcd01
Deprecate CommonTermsQuery and cutoff_frequency (#42619) (#42691)
Since the max_score optimization landed in Elasticsearch 7,
the CommonTermsQuery is redundant and slower. Moreover the
cutoff_frequency parameter for MatchQuery and MultiMatchQuery
is redundant.

Relates to #27096

(cherry picked from commit 04b74497314eeec076753a33b3b6cc11549646e8)
2019-05-30 18:04:47 +02:00
Jim Ferenczi 4ca5649a0d Upgrade to lucene 8.1.0-snapshot-e460356abe (#40952) 2019-05-23 11:45:33 +02:00
Jason Tedor 7f3ab4524f
Bump 7.x branch to version 7.2.0
This commit adds the 7.2.0 version constant to the 7.x branch, and bumps
BWC logic accordingly.
2019-05-01 13:38:57 -04:00
Alpar Torok 25944c4317 convert modules to use testclusters (#40804)
* convert modules to use testclusters
* Eliminate PluginPropertiesTask and move logic in plugin where it belongs
2019-04-04 11:45:40 +03:00
Alan Woodward 4296ff2fd1 Test that no-index synonyms can be used with the Analyze API (#40781)
Relates to #23943
2019-04-04 09:03:51 +01:00
Alan Woodward 83d2870308 Add `use_field` option to intervals query (#40157)
This is the equivalent of the `field_masking_span` query, allowing users to
merge intervals from multiple fields - for example, to search for stemmed tokens
near unstemmed tokens.
2019-03-20 16:26:04 +00:00
Christoph Büscher 4b77d0434a Remove `nGram` and `edgeNGram` token filter names (#39070)
In #30209 we deprecated the camel case `nGram` filter name in favour of `ngram` and
did the same for `edgeNGram` and `edge_ngram` and we are removing those names in
8.0. This change disallows using the deprecated names for new indices created in 7.0 by
throwing an error if these filters are used.

Relates to #38911
2019-02-21 16:55:40 +01:00
Julie Tibshirani c2e9d13ebd
Default include_type_name to false in the yml test harness. (#38058)
This PR removes the temporary change we made to the yml test harness in #37285
to automatically set `include_type_name` to `true` in index creation requests
if it's not already specified. This is possible now that the vast majority of
index creation requests were updated to be typeless in #37611. A few additional
tests also needed updating here.

Additionally, this PR updates the test harness to set `include_type_name` to
`false` in index creation requests when communicating with 6.x nodes. This
mirrors the logic added in #37611 to allow for typeless document write requests
in test set-up code. With this update in place, we can remove many references
to `include_type_name: false` from the yml tests.
2019-02-01 11:44:13 -08:00
Colin Goodheart-Smithe 21e392e95e
Removes typed calls from YAML REST tests (#37611)
This PR attempts to remove all typed calls from our YAML REST tests. The PR adds include_type_name: false to create index requests that use a mapping and also to put mapping requests. It also removes _type from index requests where they haven't already been removed. The PR ignores tests named *_with_types.yml since this are specifically testing typed API behaviour.

The change also includes changing the test harness to add the type _doc to index, update, get and bulk requests that do not specify the document type when the test is running against a mixed 7.x/6.x cluster.
2019-01-30 16:32:58 +00:00
Christoph Büscher b4b4cd6ebd
Clean codebase from empty statements (#37822)
* Remove empty statements

There are a couple of instances of undocumented empty statements all across the
code base. While they are mostly harmless, they make the code hard to read and
are potentially error-prone. Removing most of these instances and marking blocks
that look empty by intention as such.

* Change test, slightly more verbose but less confusing
2019-01-25 14:23:02 +01:00
Jun Ohtani 38b698d455
[Analysis] Deprecate Standard Html Strip Analyzer in master (#26719)
* [Analysis] Deprecate Standard Html Strip Analyzer

Deprecate only Standard Html Strip Analyzer
If user create index with the analyzer since 7.0, es throws an exception.
If an index was created before 7.0, es issue deprecation log
We will remove it in 8.0

Related #4704
2019-01-09 12:42:00 +09:00
Alan Woodward af57575838
Allow word_delimiter_graph_filter to not adjust internal offsets (#36699)
This commit adds an adjust_offsets parameter to the word_delimiter_graph token filter, defaulting
to true. Most of the time you'd want sub-tokens emitted by this filter to have offsets that are
adjusted to their real position in the token stream; however, some token filters can change the 
length or starting position of a token (eg trim) without changing their offset attributes, and this 
can lead to word_delimiter_graph emitting illegal offsets. Setting adjust_offsets to false in these 
cases will allow indexing again.

Fixes #34741, #33710
2018-12-18 13:20:51 +00:00
Julie Tibshirani 87831051dc
Deprecate types in explain requests. (#35611)
The following updates were made:
- Add a new untyped endpoint `{index}/_explain/{id}`.
- Add deprecation warnings to Rest*Action, plus tests in Rest*ActionTests.
- For each REST yml test, make sure there is one version without types, and another legacy version that retains types (called *_with_types.yml).
- Deprecate relevant methods on the Java HLRC requests/ responses.
- Update documentation (for both the REST API and Java HLRC).
2018-12-10 19:45:13 -08:00
Jim Ferenczi 18866c4c0b
Make hits.total an object in the search response (#35849)
This commit changes the format of the `hits.total` in the search response to be an object with
a `value` and a `relation`. The `value` indicates the number of hits that match the query and the
`relation` indicates whether the number is accurate (in which case the relation is equals to `eq`)
or a lower bound of the total (in which case it is equals to `gte`).
This change also adds a parameter called `rest_total_hits_as_int` that can be used in the
search APIs to opt out from this change (retrieve the total hits as a number in the rest response).
Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain
`hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a
follow up (to allow numbers to be passed to `track_total_hits`).

Relates #33028
2018-12-05 19:49:06 +01:00
Alan Woodward 73ceaad03a
Update to lucene-8.0.0-snapshot-c78429a554 (#36212)
Includes:

* A fix for a bug in Intervals.or() (https://issues.apache.org/jira/browse/LUCENE-8586)
* The ability to disable offset mangling in WordDelimiterGraphFilter
        (https://issues.apache.org/jira/browse/LUCENE-8509)
* BM25Similarity no longer multiplies scores by k1 + 1
2018-12-05 12:43:56 +00:00
Alan Woodward a646f85a99
Ensure TokenFilters only produce single tokens when parsing synonyms (#34331)
A number of tokenfilters can produce multiple tokens at the same position.  This
is a problem when using token chains to parse synonym files, as the SynonymMap
requires that there are no stacked tokens in its input.

This commit ensures that when used to parse synonyms, these tokenfilters either produce
a single version of their input token, or that they throw an error when mappings are 
generated.  In indexes created in elasticsearch 6.x deprecation warnings are emitted in place 
of the error. 

* asciifolding and cjk_bigram produce only the folded or bigrammed token
* decompounders, synonyms and keyword_repeat are skipped
* n-grams, word-delimiter-filter, multiplexer, fingerprint and phonetic throw errors

Fixes #34298
2018-11-29 10:35:38 +00:00
Jim Ferenczi e37a0ef844
Upgrade to lucene-8.0.0-snapshot-67cdd21996 (#35816) 2018-11-22 15:42:59 +01:00
Alpar Torok 8a85b2eada
Remove build qualifier from server's Version (#35172)
With this change, `Version` no longer carries information about the qualifier,
we still need a way to show the "display version" that does have both
qualifier and snapshot. This is now stored  by the build and red from `META-INF`.
2018-11-07 14:01:05 +02:00
Nick Knize a5e1f4d3a2 Upgrade to lucene-8.0.0-snapshot-31d7dfe6b1 (#35224) 2018-11-06 11:55:23 +01:00
Pratik Sanglikar f1135ef0ce Core: Replace deprecated Loggers calls with LogManager. (#34691)
Replace deprecated Loggers calls with LogManager.

Relates to #32174
2018-10-29 15:52:30 -04:00
Tal Levy e1fdd00420
Lowercase static final DeprecationLogger instance names (#34887)
After discussing on the team's FixItFriday, we concluded that
static final instance variables that are mutable should be lowercased.

Historically, DeprecationLogger was uppercased more frequently than lowercased.
2018-10-25 21:12:19 -07:00
Alpar Torok 59536966c2
Add a new "contains" feature (#34738)
The contains syntax was added in #30874 but the skips were not properly
put in place.
The java runner has the feature so the tests will run as part of the
build, but language clients will be able to support it at their own
pace.
2018-10-25 08:50:50 +03:00
Christoph Büscher c1c447a4cf
Check stemmer language setting early (#34601)
Currently the StemmerTokenFilterFactory checks the validity of the language
setting only when the first TokenStream is processed. Instead we should throw an
error earlier at mapping creation time. This change adds a check to the
StemmerTokenFilterFactory constructor that checks for a valid `language` setting
by trying to create a new TokenStream from an empty input stream. This will
throw errors about wrong language settings early on.

Closes #34170
2018-10-19 12:59:23 +02:00
Alan Woodward d25e3ef065 [CI] Use a single shard for synonym REST tests, to ensure score ordering
is stable
2018-10-01 09:47:22 +01:00
Alan Woodward f243d75f59
Remove special-casing of Synonym filters in AnalysisRegistry (#34034)
The synonym filters no longer need access to the AnalysisRegistry in their
constructors, so we can remove the special-case code and move them to the
common analysis module.

This commit means that synonyms are no longer available for `server` integration tests,
so several of these are either rewritten or migrated to the common analysis module
as rest-spec-api tests
2018-09-28 09:02:47 +01:00
Christoph Büscher ba3ceeaccf
Clean up "unused variable" warnings (#31876)
This change cleans up "unused variable" warnings. There are several cases were we 
most likely want to suppress the warnings (especially in the client documentation test
where the snippets contain many unused variables). In a lot of cases the unused
variables can just be deleted though.
2018-09-26 14:09:32 +02:00
Alan Woodward b33c18d316
Move SoraniNormalizationFilterFactory to the common analysis plugin (#33892)
Follow up to #25715
2018-09-20 17:31:41 +01:00
Alan Woodward 5107949402
Allow TokenFilterFactories to rewrite themselves against their preceding chain (#33702)
We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to 
know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This
special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional
filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they
appear before them in a chain, because they produce multiple tokens at the same position.

This commit adds two methods to the TokenFilterFactory interface.

* `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding
  filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and
  `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`.
* `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym
  list `Analyzer`. By default it returns `true`.

Fixes #33609
2018-09-19 15:52:14 +01:00
Alan Woodward f598297f55
Add predicate_token_filter (#33431)
This allows users to filter out tokens from a TokenStream using painless scripts, 
instead of having to write specialised Java code and packaging it up into a plugin.

The commit also refactors the AnalysisPredicateScript.Token class so that it wraps
and makes read-only an AttributeSource.
2018-09-11 09:16:39 +01:00
Jim Ferenczi 7ad71f906a
Upgrade to a Lucene 8 snapshot (#33310)
The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly.
Some comments about the change:
* Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309

Closes #32899
2018-09-06 14:42:06 +02:00
Alan Woodward e134f9b5f3
Fix generics in ScriptPlugin#getContexts() (#33426)
Changes the return value from List<ScriptContext> to List<ScriptContext<?>> to remove raw-types warnings.
2018-09-06 09:04:22 +01:00
Alan Woodward 636442700c
Add conditional token filter to elasticsearch (#31958)
This allows tokenfilters to be applied selectively, depending on the status of the current token in the tokenstream.  The filter takes a scripted predicate, and only applies its subfilter when the predicate returns true.
2018-09-05 14:52:43 +01:00
Jim Ferenczi f4e9729d64
Remove unsupported Version.V_5_* (#32937)
This change removes the es 5x version constants and their usages.
2018-08-24 09:51:21 +02:00
Alan Woodward cfb30144c9
Call setReferences() on custom referring tokenfilters in _analyze (#32157)
When building custom tokenfilters without an index in the _analyze endpoint,
we need to ensure that referring filters are correctly built by calling
their #setReferences() method

Fixes #32154
2018-07-18 14:43:20 +01:00
Armin Braun ed3b44fb4c
Handle TokenizerFactory TODOs (#32063)
* Don't replace Replace TokenizerFactory with Supplier, this approach was rejected in #32063 
* Remove unused parameter from constructor
2018-07-17 14:14:02 +02:00
Christoph Büscher 61486680a2
Add exclusion option to `keep_types` token filter (#32012)
Currently the `keep_types` token filter includes all token types specified using
its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where
instead of keeping the specified tokens (include) they are filtered out
(exclude). This change exposes this option as a new `mode` parameter that can
either take the values `include` (the default, if not specified) or `exclude`.

Closes #29277
2018-07-17 09:04:41 +02:00
Alan Woodward a01e26a39b
Correct spelling of AnalysisPlugin#requriesAnalysisSettings (#32025)
Because this is a static method on a public API, and one that we encourage
plugin authors to use, the method with the typo is deprecated in 6.x
rather than just renamed.
2018-07-13 13:13:21 +01:00
Alpar Torok 08b8d11e30
Add support for switching distribution for all integration tests (#30874)
* remove left-over comment

* make sure of the property for plugins

* skip installing modules if these exist in the distribution

* Log the distrbution being ran

* Don't allow running with integ-tests-zip passed externally

* top level x-pack/qa can't run with oss distro

* Add support for matching objects in lists

Makes it possible to have a key that points to a list and assert that a
certain object is present in the list. All keys have to be present and
values have to match. The objects in the source list may have additional
fields.

example:
```
  match:  { 'nodes.$master.plugins': { name: ingest-attachment }  }
```

* Update plugin and module tests to work with other distributions

Some of the tests expected that the integration tests will always be ran
with  the `integ-test-zip` distribution so that there will be no other
plugins loaded.

With this change, we check for the presence of the plugin without
assuming exclusivity.

* Allow modules to run on other distros as well

To match the behavior of tets.distributions

* Add and use a new `contains` assertion

Replaces the  previus changes that caused `match` to do a partial match.

* Implement PR review comments
2018-06-26 06:49:03 -07:00
Alan Woodward 5683bc60a6
Multiplexing token filter (#31208)
The `multiplexer` filter emits multiple tokens at the same position, each 
version of the token haivng been passed through a different filter chain.
Identical tokens at the same position are removed.

This allows users to, for example, index lowercase and original-case tokens,
or stemmed and unstemmed versions, in the same field, so that they can search
for a stemmed term within x positions of an unstemmed term.
2018-06-20 10:16:26 +01:00
Martijn van Groningen 47095357bc
Move language analyzers from server to analysis-common module. (#31300)
The following analyzers were moved from server module to analysis-common module:
`greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `latvian`,
`lithuanian`, `norwegian`, `persian`, `portuguese`, `romanian`, `russian`,
`sorani`, `spanish`, `swedish`, `turkish` and `thai`.

Relates to #23658
2018-06-18 11:24:43 +02:00
Alan Woodward 8c0ec05a12
Expose lucene's RemoveDuplicatesTokenFilter (#31275) 2018-06-18 09:46:12 +01:00
Martijn van Groningen 16d593b22f
Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202)
instead of lamda that creates the analyzer
2018-06-13 07:25:19 +02:00
Tanguy Leroux bf58660482
Remove all unused imports and fix CRLF (#31207)
The X-Pack opening and the recent other refactorings left a lot of 
unused imports in the codebase. This commit removes them all.
2018-06-11 15:12:12 +02:00
Martijn van Groningen 07a57cc131
Move number of language analyzers to analysis-common module (#31143)
The following analyzers were moved from server module to analysis-common module:
`snowball`, `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`,
`catalan`, `chinese`, `cjk`, `czech`, `danish`, `dutch`, `english`, `finnish`,
`french`, `galician` and `german`.

Relates to #23658
2018-06-08 08:58:46 +02:00
Martijn van Groningen 735d0e671a
Make PreBuiltAnalyzerProviderFactory plugable via AnalysisPlugin and
move `finger_print`, `pattern` and `standard_html_strip` analyzers
to analysis-common module. (both AnalysisProvider and PreBuiltAnalyzerProvider)

Changed PreBuiltAnalyzerProviderFactory to extend from PreConfiguredAnalysisComponent and
changed to make sure that predefined analyzers are always instantiated with the current
ES version and if an instance is requested for a different version then delegate to PreBuiltCache.
This is similar to the behaviour that exists today in AnalysisRegistry.PreBuiltAnalysis and
PreBuiltAnalyzerProviderFactory. (#31095)

Relates to #23658
2018-06-06 07:40:21 +02:00
Martijn van Groningen 544822c78b
Moved keyword tokenizer to analysis-common module (#30642)
Relates to #23658
2018-05-29 19:22:28 +02:00
Itamar Syn-Hershko 5f172b6795 [Feature] Adding a char_group tokenizer (#24186)
=== Char Group Tokenizer

The `char_group` tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, `pattern` tokenizer>>
is not acceptable.

=== Configuration

The `char_group` tokenizer accepts one parameter:

`tokenize_on_chars`::
    A string containing a list of characters to tokenize the string on.
Whenever a character
    from this list is encountered, a new token is started. Also supports
escaped values like `\\n` and `\\f`,
    and in addition `\\s` to represent whitespace, `\\d` to represent
digits and `\\w` to represent letters.
    Defaults to an empty list.

=== Example output

```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2```

When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the
above sentence would produce the following terms:

```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]```
2018-05-22 16:26:31 +02:00
Christoph Büscher b6340658f4
Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209)
The camel case name `nGram` should be removed in favour of `ngram` and
similar for `edgeNGram` and `edge_ngram`. Before removal, we need to
deprecate the camel case names first. This change adds deprecation
warnings for indices with versions 6.4.0 and higher and logs deprecation
warnings.
2018-05-17 12:52:22 +02:00
Martijn van Groningen 7b95470897
Moved tokenizers to analysis common module (#30538)
The following tokenizers were moved: classic, edge_ngram,
letter, lowercase, ngram, path_hierarchy, pattern, thai, uax_url_email and
whitespace.

Left keyword tokenizer factory in server module, because
normalizers directly depend on it.This should be addressed on a
follow up change.

Relates to #23658
2018-05-14 07:55:01 +02:00
Jim Ferenczi dbd857341f
Upgrade to 7.4.0-snapshot-1ed95c097b (#30357)
Upgrade to lucene-7.4.0-snapshot-1ed95c097b

This version contains:
* An Analyzer for Korean
* An IntervalQuery and IntervalsSource that retrieve minimum intervals of positional queries.
* A new API to retrieve matches (offsets and positions) of a query for a single document.
* Support for soft deletes in the index writer.
* A fixed shingle filter that handles index time synonyms.
* Support for emoji sequence in ICUTokenizer (with an upgrade to icu 61.1)
2018-05-04 11:44:22 +02:00
Adrien Grand 231a63fdf8
Remove useless version checks in REST tests. (#30165)
Many tests are added with a version check so that they do not run against a
version that doesn't have the feature yet. Master is 7.0, so all tests that
do not run against 6.0+ can be removed and the version check can be removed
on all tests that always run on 6.0+.
2018-05-02 11:34:15 +02:00
Christoph Büscher 24763d881e
Deprecate use of `htmlStrip` as name for HtmlStripCharFilter (#27429)
The camel case name `htmlStip` should be removed in favour of `html_strip`, but
we need to deprecate it first. This change adds deprecation warnings for indices 
with version starting with 6.3.0 and logs deprecation warnings in this cases.
2018-04-19 16:48:17 +02:00
Christoph Büscher 231fd4eb18
Remove `delimited_payload_filter` (#27705)
From 7.0 on, using `delimited_payload_filter` should throw an error. 
It was deprecated in 6.2 in favour of `delimited_payload` (#26625).

Relates to #27704
2018-04-05 18:41:04 +02:00
Adrien Grand 3bdfc8f3fb
Upgrade to lucene-7.3.0-snapshot-98a6b3d. (#29298)
Most notable changes include:
 - this release doesn't have the 7.2.1 version constant so I had to create one
 - spatial4j and jts were upgraded
2018-04-03 09:27:14 +02:00
Alan Woodward af3f63616b
Allow TrimFilter to be used in custom normalizers (#27758)
AnalysisFactoryTestCase checks that the ES custom token filter multi-term
awareness matches the underlying lucene factory.  For the trim filter this
won't be the case until LUCENE-8093 is released in 7.3, so we add a
temporary exclusion

Closes #27310
2017-12-18 14:27:03 +00:00
Christoph Büscher c4fe7d3f72 [Docs] add deprecation warning for `delimited_payload_filter` renaming 2017-12-04 10:22:05 +01:00
kel 4885acb048 Replace `delimited_payload_filter` by `delimited_payload` (#26625)
The `delimited_payload_filter` is renamed to `delimited_payload`, the old name is 
deprecated and should be replaced by `delimited_payload`.

Closes #21978
2017-11-24 13:03:19 +01:00
Mayya Sharipova 148376c2c5
Add limits for ngram and shingle settings (#27211)
* Add limits for ngram and shingle settings (#27211)

Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.

Closes #25887
2017-11-07 08:14:55 -05:00
David Roberts 749c3ec716
Remove the single argument Environment constructor (#27235)
Only tests should use the single argument Environment constructor.  To
enforce this the single arg Environment constructor has been replaced with
a test framework factory method.

Production code (beyond initial Bootstrap) should always use the same
Environment object that Node.getEnvironment() returns.  This Environment
is also available via dependency injection.
2017-11-04 13:25:09 +00:00
Simon Willnauer cdd7c1e6c2 Return List instead of an array from settings (#26903)
Today we return a `String[]` that requires copying values for every
access. Yet, we already store the setting as a list so we can also directly
return the unmodifiable list directly. This makes list / array access in settings
a much cheaper operation especially if lists are large.
2017-10-09 09:52:08 +02:00
Martijn van Groningen b27e408ed2
Removed void token filter entries and added two tests 2017-10-05 13:25:05 +02:00
Md. Abdulla-Al-Sun a40c474e10
Added Bengali Analyzer to Elasticsearch with respect to the lucene update(PR#238) 2017-10-05 13:25:05 +02:00
Simon Willnauer 00dfdf50cf Represent lists as actual lists inside Settings (#26878)
Today we represent each value of a list setting with it's own dedicated key
that ends with the index of the value in the list. Aside of the obvious
weirdness this has several issues especially if lists are massive since it
causes massive runtime penalties when validating settings. Like a list of 100k
words will literally cause a create index call to timeout and in-turn massive
slowdown on all subsequent validations runs.

With this change we use a simple string list to represent the list. This change
also forbids to add a settings that ends with a .0 which was internally used to
detect a list setting.  Once this has been rolled out for an entire major
version all the internal .0 handling can be removed since all settings will be
converted.

Relates to #26723
2017-10-05 09:27:08 +02:00
Simon Willnauer aab4655e63 Unify Settings xcontent reading and writing (#26739)
This change adds a fromXContent method to Settings that allows to read
the xcontent that is produced by toXContent. It also replaces the entire settings
loader infrastructure and removes the structured map representation. Future PRs will
also tackle the `getAsMap` that exposes the internal represenation of settings for
better encapsulation.
2017-09-25 13:23:01 +02:00
Michael Basnight f385e0cf26 Add bad_request to the rest-api-spec catch params (#26539)
This adds another request to the catch params. It also makes sure that
the generic request param does not allow 400 either.
2017-09-14 14:24:03 -05:00
Jim Ferenczi 86d97971a4 Remove the _all metadata field (#26356)
* Remove the _all metadata field

This change removes the `_all` metadata field. This field is deprecated in 6
and cannot be activated for indices created in 6 so it can be safely removed in
the next major version (e.g. 7).
2017-08-28 17:43:59 +02:00
Adrien Grand eb782492be Remove support for lenient booleans.
Closes #22298
2017-08-28 09:56:01 +02:00
Martijn van Groningen 1146a35870
Move more token filters to analysis-common module
The following token filters were moved: arabic_stem, brazilian_stem, czech_stem, dutch_stem, french_stem, german_stem and russian_stem.

Relates to #23658
2017-08-11 17:39:24 +02:00
Martijn van Groningen ff3b909a83
Moved HtmlStripCharFilterFactory to analyis.common package like the other factories. 2017-07-31 15:34:54 +02:00
Martijn van Groningen 0b776a1de0
Move more token filters to analysis-common module
The following token filters were moved: delimited_payload_filter, keep, keep_types, classic, apostrophe, decimal_digit, fingerprint, min_hash and scandinavian_folding.

Relates to #23658
2017-07-31 15:15:04 +02:00
Jim Ferenczi ab3b5c695a Pre-configured shingle filter should disable graph analysis (#25853)
This change disables the graph analysis on default `shingle` filter.
The pre-configured shingle filter produces shingles of different size.
Graph analysis on such token stream is useless and dangerous as it may create too many paths.

Fixes #25555
2017-07-24 18:42:15 +02:00