Commit Graph

103 Commits

Author SHA1 Message Date
Alan Woodward 8c0ec05a12
Expose lucene's RemoveDuplicatesTokenFilter (#31275) 2018-06-18 09:46:12 +01:00
Martijn van Groningen 16d593b22f
Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202)
instead of lamda that creates the analyzer
2018-06-13 07:25:19 +02:00
Tanguy Leroux bf58660482
Remove all unused imports and fix CRLF (#31207)
The X-Pack opening and the recent other refactorings left a lot of 
unused imports in the codebase. This commit removes them all.
2018-06-11 15:12:12 +02:00
Martijn van Groningen 07a57cc131
Move number of language analyzers to analysis-common module (#31143)
The following analyzers were moved from server module to analysis-common module:
`snowball`, `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`,
`catalan`, `chinese`, `cjk`, `czech`, `danish`, `dutch`, `english`, `finnish`,
`french`, `galician` and `german`.

Relates to #23658
2018-06-08 08:58:46 +02:00
Martijn van Groningen 735d0e671a
Make PreBuiltAnalyzerProviderFactory plugable via AnalysisPlugin and
move `finger_print`, `pattern` and `standard_html_strip` analyzers
to analysis-common module. (both AnalysisProvider and PreBuiltAnalyzerProvider)

Changed PreBuiltAnalyzerProviderFactory to extend from PreConfiguredAnalysisComponent and
changed to make sure that predefined analyzers are always instantiated with the current
ES version and if an instance is requested for a different version then delegate to PreBuiltCache.
This is similar to the behaviour that exists today in AnalysisRegistry.PreBuiltAnalysis and
PreBuiltAnalyzerProviderFactory. (#31095)

Relates to #23658
2018-06-06 07:40:21 +02:00
Martijn van Groningen 544822c78b
Moved keyword tokenizer to analysis-common module (#30642)
Relates to #23658
2018-05-29 19:22:28 +02:00
Itamar Syn-Hershko 5f172b6795 [Feature] Adding a char_group tokenizer (#24186)
=== Char Group Tokenizer

The `char_group` tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, `pattern` tokenizer>>
is not acceptable.

=== Configuration

The `char_group` tokenizer accepts one parameter:

`tokenize_on_chars`::
    A string containing a list of characters to tokenize the string on.
Whenever a character
    from this list is encountered, a new token is started. Also supports
escaped values like `\\n` and `\\f`,
    and in addition `\\s` to represent whitespace, `\\d` to represent
digits and `\\w` to represent letters.
    Defaults to an empty list.

=== Example output

```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2```

When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the
above sentence would produce the following terms:

```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]```
2018-05-22 16:26:31 +02:00
Christoph Büscher b6340658f4
Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209)
The camel case name `nGram` should be removed in favour of `ngram` and
similar for `edgeNGram` and `edge_ngram`. Before removal, we need to
deprecate the camel case names first. This change adds deprecation
warnings for indices with versions 6.4.0 and higher and logs deprecation
warnings.
2018-05-17 12:52:22 +02:00
Martijn van Groningen 7b95470897
Moved tokenizers to analysis common module (#30538)
The following tokenizers were moved: classic, edge_ngram,
letter, lowercase, ngram, path_hierarchy, pattern, thai, uax_url_email and
whitespace.

Left keyword tokenizer factory in server module, because
normalizers directly depend on it.This should be addressed on a
follow up change.

Relates to #23658
2018-05-14 07:55:01 +02:00
Jim Ferenczi dbd857341f
Upgrade to 7.4.0-snapshot-1ed95c097b (#30357)
Upgrade to lucene-7.4.0-snapshot-1ed95c097b

This version contains:
* An Analyzer for Korean
* An IntervalQuery and IntervalsSource that retrieve minimum intervals of positional queries.
* A new API to retrieve matches (offsets and positions) of a query for a single document.
* Support for soft deletes in the index writer.
* A fixed shingle filter that handles index time synonyms.
* Support for emoji sequence in ICUTokenizer (with an upgrade to icu 61.1)
2018-05-04 11:44:22 +02:00
Adrien Grand 231a63fdf8
Remove useless version checks in REST tests. (#30165)
Many tests are added with a version check so that they do not run against a
version that doesn't have the feature yet. Master is 7.0, so all tests that
do not run against 6.0+ can be removed and the version check can be removed
on all tests that always run on 6.0+.
2018-05-02 11:34:15 +02:00
Christoph Büscher 24763d881e
Deprecate use of `htmlStrip` as name for HtmlStripCharFilter (#27429)
The camel case name `htmlStip` should be removed in favour of `html_strip`, but
we need to deprecate it first. This change adds deprecation warnings for indices 
with version starting with 6.3.0 and logs deprecation warnings in this cases.
2018-04-19 16:48:17 +02:00
Christoph Büscher 231fd4eb18
Remove `delimited_payload_filter` (#27705)
From 7.0 on, using `delimited_payload_filter` should throw an error. 
It was deprecated in 6.2 in favour of `delimited_payload` (#26625).

Relates to #27704
2018-04-05 18:41:04 +02:00
Adrien Grand 3bdfc8f3fb
Upgrade to lucene-7.3.0-snapshot-98a6b3d. (#29298)
Most notable changes include:
 - this release doesn't have the 7.2.1 version constant so I had to create one
 - spatial4j and jts were upgraded
2018-04-03 09:27:14 +02:00
Alan Woodward af3f63616b
Allow TrimFilter to be used in custom normalizers (#27758)
AnalysisFactoryTestCase checks that the ES custom token filter multi-term
awareness matches the underlying lucene factory.  For the trim filter this
won't be the case until LUCENE-8093 is released in 7.3, so we add a
temporary exclusion

Closes #27310
2017-12-18 14:27:03 +00:00
Christoph Büscher c4fe7d3f72 [Docs] add deprecation warning for `delimited_payload_filter` renaming 2017-12-04 10:22:05 +01:00
kel 4885acb048 Replace `delimited_payload_filter` by `delimited_payload` (#26625)
The `delimited_payload_filter` is renamed to `delimited_payload`, the old name is 
deprecated and should be replaced by `delimited_payload`.

Closes #21978
2017-11-24 13:03:19 +01:00
Mayya Sharipova 148376c2c5
Add limits for ngram and shingle settings (#27211)
* Add limits for ngram and shingle settings (#27211)

Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.

Closes #25887
2017-11-07 08:14:55 -05:00
David Roberts 749c3ec716
Remove the single argument Environment constructor (#27235)
Only tests should use the single argument Environment constructor.  To
enforce this the single arg Environment constructor has been replaced with
a test framework factory method.

Production code (beyond initial Bootstrap) should always use the same
Environment object that Node.getEnvironment() returns.  This Environment
is also available via dependency injection.
2017-11-04 13:25:09 +00:00
Simon Willnauer cdd7c1e6c2 Return List instead of an array from settings (#26903)
Today we return a `String[]` that requires copying values for every
access. Yet, we already store the setting as a list so we can also directly
return the unmodifiable list directly. This makes list / array access in settings
a much cheaper operation especially if lists are large.
2017-10-09 09:52:08 +02:00
Martijn van Groningen b27e408ed2
Removed void token filter entries and added two tests 2017-10-05 13:25:05 +02:00
Md. Abdulla-Al-Sun a40c474e10
Added Bengali Analyzer to Elasticsearch with respect to the lucene update(PR#238) 2017-10-05 13:25:05 +02:00
Simon Willnauer 00dfdf50cf Represent lists as actual lists inside Settings (#26878)
Today we represent each value of a list setting with it's own dedicated key
that ends with the index of the value in the list. Aside of the obvious
weirdness this has several issues especially if lists are massive since it
causes massive runtime penalties when validating settings. Like a list of 100k
words will literally cause a create index call to timeout and in-turn massive
slowdown on all subsequent validations runs.

With this change we use a simple string list to represent the list. This change
also forbids to add a settings that ends with a .0 which was internally used to
detect a list setting.  Once this has been rolled out for an entire major
version all the internal .0 handling can be removed since all settings will be
converted.

Relates to #26723
2017-10-05 09:27:08 +02:00
Simon Willnauer aab4655e63 Unify Settings xcontent reading and writing (#26739)
This change adds a fromXContent method to Settings that allows to read
the xcontent that is produced by toXContent. It also replaces the entire settings
loader infrastructure and removes the structured map representation. Future PRs will
also tackle the `getAsMap` that exposes the internal represenation of settings for
better encapsulation.
2017-09-25 13:23:01 +02:00
Michael Basnight f385e0cf26 Add bad_request to the rest-api-spec catch params (#26539)
This adds another request to the catch params. It also makes sure that
the generic request param does not allow 400 either.
2017-09-14 14:24:03 -05:00
Jim Ferenczi 86d97971a4 Remove the _all metadata field (#26356)
* Remove the _all metadata field

This change removes the `_all` metadata field. This field is deprecated in 6
and cannot be activated for indices created in 6 so it can be safely removed in
the next major version (e.g. 7).
2017-08-28 17:43:59 +02:00
Adrien Grand eb782492be Remove support for lenient booleans.
Closes #22298
2017-08-28 09:56:01 +02:00
Martijn van Groningen 1146a35870
Move more token filters to analysis-common module
The following token filters were moved: arabic_stem, brazilian_stem, czech_stem, dutch_stem, french_stem, german_stem and russian_stem.

Relates to #23658
2017-08-11 17:39:24 +02:00
Martijn van Groningen ff3b909a83
Moved HtmlStripCharFilterFactory to analyis.common package like the other factories. 2017-07-31 15:34:54 +02:00
Martijn van Groningen 0b776a1de0
Move more token filters to analysis-common module
The following token filters were moved: delimited_payload_filter, keep, keep_types, classic, apostrophe, decimal_digit, fingerprint, min_hash and scandinavian_folding.

Relates to #23658
2017-07-31 15:15:04 +02:00
Jim Ferenczi ab3b5c695a Pre-configured shingle filter should disable graph analysis (#25853)
This change disables the graph analysis on default `shingle` filter.
The pre-configured shingle filter produces shingles of different size.
Graph analysis on such token stream is useless and dangerous as it may create too many paths.

Fixes #25555
2017-07-24 18:42:15 +02:00
Martijn van Groningen 8003171a0c
Move more token filters to analysis-common module
The following token filters were moved: arabic_normalization, german_normalization, hindi_normalization, indic_normalization, persian_normalization, scandinavian_normalization, serbian_normalization, sorani_normalization, cjk_width and cjk_width

Relates to #23658
2017-07-17 08:29:44 +02:00
Jim Ferenczi 13da3eb53e Refactor QueryStringQuery for 6.0 (#25646)
This change refactors the query_string query to analyze the query text around logical operators of the query string the same way than a match_query/multi_match_query.
It also adds a type parameter that can be used to change the way multi fields query are built the same way than a multi_match query does.

Now that these queries share the same behavior regarding text analysis, some parameters are obsolete and have been deprecated:

split_on_whitespace: This setting is now ignored with a deprecation notice
if it is used explicitely. With this PR The query_string always splits on logical operator.
It simplifies the understanding of the other parameters that can have different meanings
depending on the value of split_on_whitespace.

auto_generate_phrase_queries: This setting is now ignored with a deprecation notice
if it is used explicitely. This setting only makes sense when the parser splits on whitespace.

use_dismax: This setting is now ignored with a deprecation notice
if it is used explicitely. The tie_breaker parameter is sufficient to handle best_fields/most_fields.

Fixes #25574
2017-07-13 15:32:17 +02:00
Martijn van Groningen 6db708ef75
Move more token filters to analysis-common module
The following token filters were moved: common grams, limit token, pattern capture and pattern raplace.

Relates to #23658
2017-07-07 10:02:52 +02:00
Jun Ohtani 6894ef6057 [Analysis] Support normalizer in request param (#24767)
* [Analysis] Support normalizer in request param

Support normalizer param
Support custom normalizer with char_filter/filter param

Closes #23347
2017-07-04 19:16:56 +09:00
Martijn van Groningen a34f5fa812
Move more token filters to analysis-common module
The following token filters were moved: stemmer, stemmer_override, kstem, dictionary_decompounder, hyphenation_decompounder, reverse, elision and truncate.

Relates to #23658
2017-06-26 09:02:16 +02:00
Jun Ohtani 62d1969595 Parse synonyms with the same analysis chain (#8049)
* [Analysis] Parse synonyms with the same analysis chain

Synonym Token Filter / Synonym Graph Filter tokenize synonyms with whatever tokenizer and token filters appear before it in the chain.

Close #7199
2017-06-20 21:50:33 +09:00
Andy Bristol 4c5bd57619 Rename simple pattern tokenizers (#25300)
Changed names to be snake case for consistency

Related to #25159, original issue #23363
2017-06-19 13:48:43 -07:00
Nik Everett ecc87f613f Move pre-configured "keyword" tokenizer to the analysis-common module (#24863)
Moves the keyword tokenizer to the analysis-common module. The keyword tokenizer is special because it is used by CustomNormalizerProvider so I pulled it out into its own PR. To get the move to work I've reworked the lookup from static to one using the AnalysisRegistry. This seems safe enough.

Part of #23658.
2017-06-16 11:48:15 -04:00
Martijn van Groningen 428e70758a
Moved more token filters to analysis-common module.
The following token filters were moved: `edge_ngram`, `ngram`, `uppercase`, `lowercase`, `length`, `flatten_graph` and `unique`.

Relates to #23658
2017-06-15 18:28:31 +02:00
Andy Bristol 48696ab544 expose simple pattern tokenizers (#25159)
Expose the experimental simplepattern and 
simplepatternsplit tokenizers in the common 
analysis plugin. They provide tokenization based 
on regular expressions, using Lucene's 
deterministic regex implementation that is usually 
faster than Java's and has protections against 
creating too-deep stacks during matching.

Both have a not-very-useful default pattern of the 
empty string because all tokenizer factories must 
be able to be instantiated at index creation time. 
They should always be configured by the user 
in practice.
2017-06-13 12:46:59 -07:00
Jim Ferenczi 8250aa4267 Remove the postings highlighter and make unified the default highlighter choice (#25028)
This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the `unified` highlighter when index_options is set to `offsets`:
https://issues.apache.org/jira/browse/LUCENE-7815

It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided).
The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text.
Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such.
There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available.
I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.
2017-06-09 14:09:57 +02:00
Nik Everett 73307a2144 Plugins can register pre-configured char filters (#25000)
Fixes the plumbing so plugins can register char filters and moves
the `html_strip` char filter into analysis-common.

Relates to #23658
2017-06-05 09:25:15 -04:00
Martijn van Groningen 258be2b135
Moved `keyword_marker`, `trim`, `snowball` and `porter_stemmer` tokenfilter factories from core to common-analysis module.
Relates to #23658
2017-05-31 09:34:08 +02:00
Nik Everett b9ea579633 Allow plugins to register pre-configured tokenizers (#24751)
Allows plugins to register pre-configured tokenizers. Much
of the decisions are the same as those in #24223, #24572,
and #24223. This only migrates the lowercase tokenizer but
I figure that is a good start because it proves out the features.
2017-05-19 12:07:04 -04:00
Nik Everett 0189a65e6b Fail rest tests on yaml files (#24740)
We've switched to supporting only `yml` files but anyone who didn't
notice will commit a `yaml` file which won't be executed
which is bad because it is easy not to notice. The test to catch this is
simple enough that I think it is worth adding just to warn folks about
their mistake.
2017-05-17 10:24:57 -04:00
Nik Everett 5fc6f17121 Fix new analysis-common yml tests
These tests are broken because I added them with the `yml` extension
and didn't realize that we weren't running tests with that extension
until we merged #24659. I used that extension in anticipation of #24659
but didn't verify that the tests were actually running. Ooops!

Closes #24734
2017-05-17 07:55:04 -04:00
Ryan Ernst 2a65bed243 Tests: Change rest test extension from .yaml to .yml (#24659)
This commit renames all rest test files to use the .yml extension
instead of .yaml. This way the extension used within all of
elasticsearch for yaml is consistent.
2017-05-16 17:24:35 -07:00
Nik Everett 7ef390068a Move remaining pre-configured token filters into analysis-common (#24716)
Moves the remaining preconfigured token figured into the analysis-common module. There were a couple of tests in core that depended on the pre-configured token filters so I had to touch them:

* `GetTermVectorsCheckDocFreqIT` depended on `type_as_payload` but didn't do anything important with it. I dropped the dependency. Then I moved the test to a single node test case because we're trying to cut down on the number of `ESIntegTestCase` subclasses.
* `AbstractTermVectorsTestCase` and its subclasses depended on `type_as_payload`. I dropped their usage of the token filter and added an integration test for the termvectors API that uses `type_as_payload` to the `analysis-common` module.
* `AnalysisModuleTests` expected a few pre-configured token filtes be registered by default. They aren't any more so I dropped this assertion. We assert that the `CommonAnalysisPlugin` registers these pre-built token filters in `CommonAnalysisFactoryTests`
* `SearchQueryIT` and `SuggestSearchIT` had tests that depended on the specific behavior of the token filters so I moved the tests to integration tests in `analysis-common`.
2017-05-16 13:10:24 -04:00
Nik Everett 65f2717ab7 Make PreConfiguredTokenFilter harder to misuse (#24572)
There are now three public static method to build instances of
PreConfiguredTokenFilter and the ctor is private. I chose static
methods instead of constructors because those allow us to change
out the implementation returned if we so desire.

Relates to #23658
2017-05-10 22:39:43 -04:00
Nik Everett bb06d8ec4f Allow plugins to build pre-configured token filters (#24223)
This changes the way we register pre-configured token filters so that
plugins can declare them and starts to move all of the pre-configured
token filters out of core. It doesn't finish the job because doing
so would make the change unreviewably large. So this PR includes
a shim that keeps the "old" way of registering pre-configured token
filters around.

The Lowercase token filter is special because there is a "special"
interaction between it and the lowercase tokenizer. I'm not sure
exactly what to do about it so for now I'm leaving it alone with
the intent of figuring out what to do with it in a followup.

This also renames these pre-configured token filters from
"pre-built" to "pre-configured" because that seemed like a more
descriptive name.

This is a part of #23658
2017-05-09 14:50:49 -04:00
Nik Everett 7c3efb829b Move char filters into analysis-common (#24261)
Another step down the road to dropping the
lucene-analyzers-common dependency from core.

Note that this removes some tests that no longer compile from
core. I played around with adding them to the analysis-common
module where they would compile but we already test these in
the tests generated from the example usage in the documentation.

I'm not super happy with the way that `requriesAnalysisSettings`
works with regards to plugins. I think it'd be fairly bug-prone
for plugin authors to use. But I'm making it visible as is for
now and I'll rethink later.

A part of #23658
2017-04-26 13:25:34 -04:00
Nik Everett caf376c8af Start building analysis-common module (#23614)
Start moving built in analysis components into the new analysis-common
module. The goal of this project is:
1. Remove core's dependency on lucene-analyzers-common.jar which should
shrink the dependencies for transport client and high level rest client.
2. Prove that analysis plugins can do all the "built in" things by moving all
"built in" behavior to a plugin.
3. Force tests not to depend on any oddball analyzer behavior. If tests
need anything more than the standard analyzer they can use the mock
analyzer provided by Lucene's test infrastructure.
2017-04-19 18:51:34 -04:00