171 Commits

Author SHA1 Message Date
Mayya Sharipova
34e95e5d50 [DOCS] Add supported token filters
Update normalizers.asciidoc with the list of supported token filters

Closes #28605
2018-02-13 14:10:25 -08:00
Jim Ferenczi
7c2bcf3953
Mark synonym_graph as beta in the docs (#28496)
We do want to keep this functionality in the future and we provide support for it.
This change is a first step towards replacing the `synonym` token filter with `synonym_graph`.
2018-02-02 16:33:48 +01:00
deepybee
48c8098e15 Fixed several typos in analyzers section (#28247) 2018-01-18 08:51:53 +00:00
Adrien Grand
1b660821a2
Allow _doc as a type. (#27816)
Allowing `_doc` as a type will enable users to make the transition to 7.0
smoother since the index APIs will be `PUT index/_doc/id` and `POST index/_doc`.
This also moves most of the documentation to `_doc` as a type name.

Closes #27750
Closes #27751
2017-12-14 17:47:53 +01:00
Martijn van Groningen
442c3b8bcf
docs: fix link 2017-12-13 16:51:21 +01:00
Christoph Büscher
c4fe7d3f72 [Docs] add deprecation warning for delimited_payload_filter renaming 2017-12-04 10:22:05 +01:00
kel
4885acb048 Replace delimited_payload_filter by delimited_payload (#26625)
The `delimited_payload_filter` is renamed to `delimited_payload`, the old name is 
deprecated and should be replaced by `delimited_payload`.

Closes #21978
2017-11-24 13:03:19 +01:00
Mayya Sharipova
148376c2c5
Add limits for ngram and shingle settings (#27211)
* Add limits for ngram and shingle settings (#27211)

Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.

Closes #25887
2017-11-07 08:14:55 -05:00
Md. Abdulla-Al-Sun
a40c474e10
Added Bengali Analyzer to Elasticsearch with respect to the lucene update(PR#238) 2017-10-05 13:25:05 +02:00
markwalkom
dbea83a1d0 [Docs] Update length-tokenfilter.asciidoc (#26849)
Made it clear what the numeric value of `Integer.MAX_VALUE`  is,
2017-10-02 11:01:43 +02:00
olcbean
6952f7b560 Validate top-level keys for create index request (#23755) (#23869)
This commit ensures create index requests do not ignore unknown keys passed to the request.

closes #23755
2017-09-26 09:49:20 -07:00
Christoph Büscher
3827918417 Add configurable maxTokenLength parameter to whitespace tokenizer (#26749)
Other tokenizers like the standard tokenizer allow overriding the default
maximum token length of 255 using the `"max_token_length` parameter. This change
enables using this parameter also with the whitespace tokenizer. The range that
is currently allowed is from 0 to StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT,
which is 1024 * 1024 = 1048576 characters.

Closes #26643
2017-09-25 17:21:19 +02:00
Tahmim Ahmed Shibli
34662c9e6d [Docs] Fix name of character filter in example. (#26724) 2017-09-20 17:08:43 +02:00
Christoph Büscher
254c1b28e9 [Docs] Clarify behaviour of Pattern Capture Token Filter during search (#26278)
There was some confusion about the fact that tokens emitted from a Pattern
Capture Token Filter are treated as synonyms when used to analyze a search
query. This commit adds an explanation to the note in the docs to emphasize this
behaviour.

Closes #25746
2017-08-21 14:56:52 +02:00
Clinton Gormley
ff4a2519f2 Update experimental labels in the docs (#25727)
Relates https://github.com/elastic/elasticsearch/issues/19798

Removed experimental label from:
* Painless
* Diversified Sampler Agg
* Sampler Agg
* Significant Terms Agg
* Terms Agg document count error and execution_hint
* Cardinality Agg precision_threshold
* Pipeline Aggregations
* index.shard.check_on_startup
* index.store.type (added warning)
* Preloading data into the file system cache
* foreach ingest processor
* Field caps API
* Profile API

Added experimental label to:
* Moving Average Agg Prediction


Changed experimental to beta for:
* Adjacency matrix agg
* Normalizers
* Tasks API
* Index sorting

Labelled experimental in Lucene:
* ICU plugin custom rules file
* Flatten graph token filter
* Synonym graph token filter
* Word delimiter graph token filter
* Simple pattern tokenizer
* Simple pattern split tokenizer

Replaced experimental label with warning that details may change in the future:
* Analysis explain output format
* Segments verbose output format
* Percentile Agg compression and HDR Histogram
* Percentile Rank Agg HDR Histogram
2017-07-18 14:06:22 +02:00
Neil Rickards
5189bd14f1 [Docs] Fix typo in pattern-tokenizer.asciidoc (#25626) 2017-07-13 18:43:48 +02:00
Simon Willnauer
e81804cfa4 Add a shard filter search phase to pre-filter shards based on query rewriting (#25658)
Today if we search across a large amount of shards we hit every shard. Yet, it's quite
common to search across an index pattern for time based indices but filtering will exclude
all results outside a certain time range ie. `now-3d`. While the search can potentially hit
hundreds of shards the majority of the shards might yield 0 results since there is not document
that is within this date range. Kibana for instance does this regularly but used `_field_stats`
to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results.

This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards
and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.
2017-07-12 22:19:20 +02:00
Jun Ohtani
62d1969595 Parse synonyms with the same analysis chain (#8049)
* [Analysis] Parse synonyms with the same analysis chain

Synonym Token Filter / Synonym Graph Filter tokenize synonyms with whatever tokenizer and token filters appear before it in the chain.

Close #7199
2017-06-20 21:50:33 +09:00
Andy Bristol
4c5bd57619 Rename simple pattern tokenizers (#25300)
Changed names to be snake case for consistency

Related to #25159, original issue #23363
2017-06-19 13:48:43 -07:00
debadair
c161d90524 [DOCS] Defined es-test-dir and plugins-examples-dir in index.asciidoc. (#25232)
Use these attributes when specifying the location of included tests.
2017-06-15 08:54:10 -07:00
Adrien Grand
0c117145f6 Upgrade to lucene-7.0.0-snapshot-92b1783. (#25222)
This snapshot has faster range queries on range fields (LUCENE-7828), more
accurate norms (LUCENE-7730) and the ability to use fake term frequencies
(LUCENE-7854).
2017-06-15 09:52:07 +02:00
Andy Bristol
48696ab544 expose simple pattern tokenizers (#25159)
Expose the experimental simplepattern and 
simplepatternsplit tokenizers in the common 
analysis plugin. They provide tokenization based 
on regular expressions, using Lucene's 
deterministic regex implementation that is usually 
faster than Java's and has protections against 
creating too-deep stacks during matching.

Both have a not-very-useful default pattern of the 
empty string because all tokenizer factories must 
be able to be instantiated at index creation time. 
They should always be configured by the user 
in practice.
2017-06-13 12:46:59 -07:00
Jim Ferenczi
2508df6cc8 Add missing link for the WordDelimiterGraphFilter 2017-04-28 17:12:38 +02:00
Adrien Grand
1be2800120 Only allow one type on 7.0 indices (#24317)
This adds the `index.mapping.single_type` setting, which enforces that indices
have at most one type when it is true. The default value is true for 6.0+ indices
and false for old indices.

Relates #15613
2017-04-27 08:43:20 +02:00
Nik Everett
ad69503dce CONSOLEify analysis docs
Converts the analysis docs to that were marked as json into `CONSOLE`
format. A few of them were in yaml but marked as json for historical
reasons. I added more complete examples for a few of the less obvious
sounding ones.

Relates to #18160
2017-04-02 11:17:14 -04:00
Nik Everett
514187be8e Fix language in some docs
The pattern-analyzer docs contained a snippet that was an expanded
regex that was marked as `[source,js]`. This changes it to
`[source,regex]`.

The htmlstrip-charfilter and pattern-replace-charfilter docs had
examples that were actually a list of tokens but marked `[source,js]`.
This marks them as `[source,text]` so they don't count as unconverted
CONSOLE snippets.

The pattern-replace-charfilter also had a doc who's test was
skipped because of funny interaction with the test framework. This
fixes the test.

Three more down, eighty-two to go.

Relates to #18160
2017-04-01 14:45:44 -04:00
Nik Everett
9baa48a928 CONSOLEify lang-analyzer docs
CONSOLEifies the lang-analyzer docs and replaces the (invalid)
empty `keyword_marker` setups that were on the page with one
that contains the word "example" translated into the appropriate
language.

Relates to #18160
2017-04-01 14:21:58 -04:00
Abdon Pijpelink
ef1329727d Update compound-word-tokenfilter.asciidoc (#23817)
Updated URL to OFFO Sourceforge project
2017-03-30 12:27:32 +02:00
Ali Beyad
2120086d82 Adds pattern keyword marker filter support (#23600)
This commit adds support for the pattern keyword marker filter in
Lucene.  Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the `keywords_pattern` setting is excluded from being
stemmed by any stemming filters.

Closes #4877
2017-03-28 11:13:34 -04:00
Nik Everett
a783c6c85c CONSOLEify some more docs
And expand on the `stemmer_override` examples, including the
file on disk and an example of specifying the rules inline.

Relates to #18160
2017-03-22 17:58:06 -04:00
Nik Everett
e860fe7363 CONSOLEify some more docs
Relates to #18160
2017-03-22 17:15:14 -04:00
Nik Everett
1dee2f32a4 Docs: CONSOLEify synonym tokenfiler docs
Relates to #18160
2017-03-22 16:30:52 -04:00
Nik Everett
1c1b29400b Docs: Fix language on a few snippets
They aren't `js`, they are their own thing.

Relates to #18160
2017-03-22 15:57:28 -04:00
Jim Ferenczi
63bdd01eb7 Expose WordDelimiterGraphTokenFilter (#23327)
This change exposes the new Lucene graph based word delimiter token filter in the analysis filters.
Unlike the `word_delimiter` this token filter named `word_delimiter_graph` correctly handles multi terms expansion at query time.

Closes #23104
2017-02-24 00:53:38 +01:00
markwalkom
ced99dde50 Update stop-analyzer.asciidoc (#23195)
Clarified where the stopwords file needs to live
2017-02-16 13:36:15 +01:00
Adrien Grand
f3509b8003 Consolify docs/reference/analysis/tokenfilters/pattern-capture-tokenfilter.asciidoc. (#23050) 2017-02-13 11:00:12 +01:00
Clinton Gormley
f5e7c25e24 Update normalizers.asciidoc
analyzers -> normalizers
2017-02-07 12:09:39 +01:00
Shubham Aggarwal
e07e4cc4dd Fix incorrect heading for Whitespace Tokenizer (#22883) 2017-01-31 12:51:37 +01:00
Daniel Mitterdorfer
aece89d6a1 Make boolean conversion strict (#22200)
This PR removes all leniency in the conversion of Strings to booleans: "true"
is converted to the boolean value `true`, "false" is converted to the boolean
value `false`. Everything else raises an error.
2017-01-19 07:59:18 +01:00
Michael McCandless
1d1bdd476c Finish exposing FlattenGraphTokenFilter (#22667) 2017-01-18 11:05:34 -05:00
Clinton Gormley
519a9c469d Update truncate token filter to not mention the keyword tokenizer
The advice predates the existence of the keyword field

Closes #22650
2017-01-17 12:15:22 +01:00
Matt Weber
609d2aab15 QueryString and SimpleQueryString Graph Support (#22541)
Add support for graph token streams to "query_String" and
"simple_query_string" queries.
2017-01-11 18:59:43 +01:00
Achraf
5dc85c25d9 Hindu-Arabico-Latino Numerals (#22476)
Hi, same edit as for : https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
2017-01-10 15:24:56 +01:00
Adrien Grand
3f805d68cb Add the ability to set an analyzer on keyword fields. (#21919)
This adds a new `normalizer` property to `keyword` fields that pre-processes the
field value prior to indexing, but without altering the `_source`. Note that
only the normalization components that work on a per-character basis are
applied, so for instance stemming filters will be ignored while lowercasing or
ascii folding will be applied.

Closes #18064
2016-12-30 09:36:10 +01:00
Francesc Gil
dec6fc2d40 Repeated language analyzers (#22240)
* Repeated language analyzers

The `catalan` analyzer was repeated on the supported list :)

* Reordered the languages to have alphabetic order

* Added space for format

* Reordered the languages and removed repeated
2016-12-21 17:32:02 +01:00
Thibault Pierre
e494d6a94e Fix wrong link (#22019) 2016-12-07 17:58:46 +01:00
Allen Torres
887fbb6387 Update lowercase-tokenizer.asciidoc (#21896)
Fixed typo
2016-12-02 10:49:51 -05:00
Matt Weber
04e07bcdb6 Synonym Graph Support (LUCENE-6664) (#21517)
Integrate the patch from LUCENE-6664 into elasticsearch and
add support for handling a graph token stream in match/multi-match
queries.

This fixes longstanding bugs with multi-token synonyms returning
incorrect results with proximity queries.
2016-11-28 09:25:49 -08:00
Achraf
d81a928b1f Correction of the names of numirals (#21531)
What was called Arabic numerals is actually Hindu - Eastern Arabic notation. And the Latin numerals you refer to is the Arabic numbers.
2016-11-25 14:30:49 +01:00
Pascal Borreli
fcb01deb34 Fixed typos (#20843) 2016-10-10 14:51:47 -06:00