Commit Graph

74 Commits

Author SHA1 Message Date
Martijn van Groningen 0b776a1de0
Move more token filters to analysis-common module
The following token filters were moved: delimited_payload_filter, keep, keep_types, classic, apostrophe, decimal_digit, fingerprint, min_hash and scandinavian_folding.

Relates to #23658
2017-07-31 15:15:04 +02:00
Jim Ferenczi ab3b5c695a Pre-configured shingle filter should disable graph analysis (#25853)
This change disables the graph analysis on default `shingle` filter.
The pre-configured shingle filter produces shingles of different size.
Graph analysis on such token stream is useless and dangerous as it may create too many paths.

Fixes #25555
2017-07-24 18:42:15 +02:00
Martijn van Groningen 8003171a0c
Move more token filters to analysis-common module
The following token filters were moved: arabic_normalization, german_normalization, hindi_normalization, indic_normalization, persian_normalization, scandinavian_normalization, serbian_normalization, sorani_normalization, cjk_width and cjk_width

Relates to #23658
2017-07-17 08:29:44 +02:00
Jim Ferenczi 13da3eb53e Refactor QueryStringQuery for 6.0 (#25646)
This change refactors the query_string query to analyze the query text around logical operators of the query string the same way than a match_query/multi_match_query.
It also adds a type parameter that can be used to change the way multi fields query are built the same way than a multi_match query does.

Now that these queries share the same behavior regarding text analysis, some parameters are obsolete and have been deprecated:

split_on_whitespace: This setting is now ignored with a deprecation notice
if it is used explicitely. With this PR The query_string always splits on logical operator.
It simplifies the understanding of the other parameters that can have different meanings
depending on the value of split_on_whitespace.

auto_generate_phrase_queries: This setting is now ignored with a deprecation notice
if it is used explicitely. This setting only makes sense when the parser splits on whitespace.

use_dismax: This setting is now ignored with a deprecation notice
if it is used explicitely. The tie_breaker parameter is sufficient to handle best_fields/most_fields.

Fixes #25574
2017-07-13 15:32:17 +02:00
Martijn van Groningen 6db708ef75
Move more token filters to analysis-common module
The following token filters were moved: common grams, limit token, pattern capture and pattern raplace.

Relates to #23658
2017-07-07 10:02:52 +02:00
Jun Ohtani 6894ef6057 [Analysis] Support normalizer in request param (#24767)
* [Analysis] Support normalizer in request param

Support normalizer param
Support custom normalizer with char_filter/filter param

Closes #23347
2017-07-04 19:16:56 +09:00
Martijn van Groningen a34f5fa812
Move more token filters to analysis-common module
The following token filters were moved: stemmer, stemmer_override, kstem, dictionary_decompounder, hyphenation_decompounder, reverse, elision and truncate.

Relates to #23658
2017-06-26 09:02:16 +02:00
Jun Ohtani 62d1969595 Parse synonyms with the same analysis chain (#8049)
* [Analysis] Parse synonyms with the same analysis chain

Synonym Token Filter / Synonym Graph Filter tokenize synonyms with whatever tokenizer and token filters appear before it in the chain.

Close #7199
2017-06-20 21:50:33 +09:00
Andy Bristol 4c5bd57619 Rename simple pattern tokenizers (#25300)
Changed names to be snake case for consistency

Related to #25159, original issue #23363
2017-06-19 13:48:43 -07:00
Nik Everett ecc87f613f Move pre-configured "keyword" tokenizer to the analysis-common module (#24863)
Moves the keyword tokenizer to the analysis-common module. The keyword tokenizer is special because it is used by CustomNormalizerProvider so I pulled it out into its own PR. To get the move to work I've reworked the lookup from static to one using the AnalysisRegistry. This seems safe enough.

Part of #23658.
2017-06-16 11:48:15 -04:00
Martijn van Groningen 428e70758a
Moved more token filters to analysis-common module.
The following token filters were moved: `edge_ngram`, `ngram`, `uppercase`, `lowercase`, `length`, `flatten_graph` and `unique`.

Relates to #23658
2017-06-15 18:28:31 +02:00
Andy Bristol 48696ab544 expose simple pattern tokenizers (#25159)
Expose the experimental simplepattern and 
simplepatternsplit tokenizers in the common 
analysis plugin. They provide tokenization based 
on regular expressions, using Lucene's 
deterministic regex implementation that is usually 
faster than Java's and has protections against 
creating too-deep stacks during matching.

Both have a not-very-useful default pattern of the 
empty string because all tokenizer factories must 
be able to be instantiated at index creation time. 
They should always be configured by the user 
in practice.
2017-06-13 12:46:59 -07:00
Jim Ferenczi 8250aa4267 Remove the postings highlighter and make unified the default highlighter choice (#25028)
This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the `unified` highlighter when index_options is set to `offsets`:
https://issues.apache.org/jira/browse/LUCENE-7815

It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided).
The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text.
Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such.
There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available.
I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.
2017-06-09 14:09:57 +02:00
Nik Everett 73307a2144 Plugins can register pre-configured char filters (#25000)
Fixes the plumbing so plugins can register char filters and moves
the `html_strip` char filter into analysis-common.

Relates to #23658
2017-06-05 09:25:15 -04:00
Martijn van Groningen 258be2b135
Moved `keyword_marker`, `trim`, `snowball` and `porter_stemmer` tokenfilter factories from core to common-analysis module.
Relates to #23658
2017-05-31 09:34:08 +02:00
Nik Everett b9ea579633 Allow plugins to register pre-configured tokenizers (#24751)
Allows plugins to register pre-configured tokenizers. Much
of the decisions are the same as those in #24223, #24572,
and #24223. This only migrates the lowercase tokenizer but
I figure that is a good start because it proves out the features.
2017-05-19 12:07:04 -04:00
Nik Everett 0189a65e6b Fail rest tests on yaml files (#24740)
We've switched to supporting only `yml` files but anyone who didn't
notice will commit a `yaml` file which won't be executed
which is bad because it is easy not to notice. The test to catch this is
simple enough that I think it is worth adding just to warn folks about
their mistake.
2017-05-17 10:24:57 -04:00
Nik Everett 5fc6f17121 Fix new analysis-common yml tests
These tests are broken because I added them with the `yml` extension
and didn't realize that we weren't running tests with that extension
until we merged #24659. I used that extension in anticipation of #24659
but didn't verify that the tests were actually running. Ooops!

Closes #24734
2017-05-17 07:55:04 -04:00
Ryan Ernst 2a65bed243 Tests: Change rest test extension from .yaml to .yml (#24659)
This commit renames all rest test files to use the .yml extension
instead of .yaml. This way the extension used within all of
elasticsearch for yaml is consistent.
2017-05-16 17:24:35 -07:00
Nik Everett 7ef390068a Move remaining pre-configured token filters into analysis-common (#24716)
Moves the remaining preconfigured token figured into the analysis-common module. There were a couple of tests in core that depended on the pre-configured token filters so I had to touch them:

* `GetTermVectorsCheckDocFreqIT` depended on `type_as_payload` but didn't do anything important with it. I dropped the dependency. Then I moved the test to a single node test case because we're trying to cut down on the number of `ESIntegTestCase` subclasses.
* `AbstractTermVectorsTestCase` and its subclasses depended on `type_as_payload`. I dropped their usage of the token filter and added an integration test for the termvectors API that uses `type_as_payload` to the `analysis-common` module.
* `AnalysisModuleTests` expected a few pre-configured token filtes be registered by default. They aren't any more so I dropped this assertion. We assert that the `CommonAnalysisPlugin` registers these pre-built token filters in `CommonAnalysisFactoryTests`
* `SearchQueryIT` and `SuggestSearchIT` had tests that depended on the specific behavior of the token filters so I moved the tests to integration tests in `analysis-common`.
2017-05-16 13:10:24 -04:00
Nik Everett 65f2717ab7 Make PreConfiguredTokenFilter harder to misuse (#24572)
There are now three public static method to build instances of
PreConfiguredTokenFilter and the ctor is private. I chose static
methods instead of constructors because those allow us to change
out the implementation returned if we so desire.

Relates to #23658
2017-05-10 22:39:43 -04:00
Nik Everett bb06d8ec4f Allow plugins to build pre-configured token filters (#24223)
This changes the way we register pre-configured token filters so that
plugins can declare them and starts to move all of the pre-configured
token filters out of core. It doesn't finish the job because doing
so would make the change unreviewably large. So this PR includes
a shim that keeps the "old" way of registering pre-configured token
filters around.

The Lowercase token filter is special because there is a "special"
interaction between it and the lowercase tokenizer. I'm not sure
exactly what to do about it so for now I'm leaving it alone with
the intent of figuring out what to do with it in a followup.

This also renames these pre-configured token filters from
"pre-built" to "pre-configured" because that seemed like a more
descriptive name.

This is a part of #23658
2017-05-09 14:50:49 -04:00
Nik Everett 7c3efb829b Move char filters into analysis-common (#24261)
Another step down the road to dropping the
lucene-analyzers-common dependency from core.

Note that this removes some tests that no longer compile from
core. I played around with adding them to the analysis-common
module where they would compile but we already test these in
the tests generated from the example usage in the documentation.

I'm not super happy with the way that `requriesAnalysisSettings`
works with regards to plugins. I think it'd be fairly bug-prone
for plugin authors to use. But I'm making it visible as is for
now and I'll rethink later.

A part of #23658
2017-04-26 13:25:34 -04:00
Nik Everett caf376c8af Start building analysis-common module (#23614)
Start moving built in analysis components into the new analysis-common
module. The goal of this project is:
1. Remove core's dependency on lucene-analyzers-common.jar which should
shrink the dependencies for transport client and high level rest client.
2. Prove that analysis plugins can do all the "built in" things by moving all
"built in" behavior to a plugin.
3. Force tests not to depend on any oddball analyzer behavior. If tests
need anything more than the standard analyzer they can use the mock
analyzer provided by Lucene's test infrastructure.
2017-04-19 18:51:34 -04:00