OpenSearch

Commit Graph

Author	SHA1	Message	Date
Itamar Syn-Hershko	5f172b6795	[Feature] Adding a char_group tokenizer (#24186 ) === Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```	2018-05-22 16:26:31 +02:00
Jim Ferenczi	bdb79d021a	Fix docs failure on language analyzers (#30722 ) This commit fixes docs failure on language analyzers when compared to the built in analyzers. The `elision` filters used by the rebuilt language analyzers should be case insensitive to match the definition of the prebuilt analyzers. Closes #30557	2018-05-22 09:58:12 +02:00
Nik Everett	9881bfaea5	Docs: Document how to rebuild analyzers (#30498 ) Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499	2018-05-14 18:40:54 -04:00
Jason Tedor	4a4e3d70d5	Default to one shard (#30539 ) This commit changes the default out-of-the-box configuration for the number of shards from five to one. We think this will help address a common problem of oversharding. For users with time-based indices that need a different default, this can be managed with index templates. For users with non-time-based indices that find they need to re-shard with the split API in place they no longer need to resort only to reindexing. Since this has the impact of changing the default number of shards used in REST tests, we want to ensure that we still have coverage for issues that could arise from multiple shards. As such, we randomize (rarely) the default number of shards in REST tests to two. This is managed via a global index template. However, some tests check the templates that are in the cluster state during the test. Since this template is randomly there, we need a way for tests to skip adding the template used to set the number of shards to two. For this we add the default_shards feature skip. To avoid having to write our docs in a complicated way because sometimes they might be behind one shard, and sometimes they might be behind two shards we apply the default_shards feature skip to all docs tests. That is, these tests will always run with the default number of shards (one).	2018-05-14 12:22:35 -04:00
Nik Everett	f9dc86836d	Docs: Test examples that recreate lang analyzers (#29535 ) We have a pile of documentation describing how to rebuild the built in language analyzers and, previously, our documentation testing framework made sure that the examples successfully built an analyzer but they didn't assert that the analyzer built by the documentation matches the built in anlayzer. Unsuprisingly, some of the examples aren't quite right. This adds a mechanism that tests that the analyzers built by the docs. The mechanism is fairly simple and brutal but it seems to be working: build a hundred random unicode sequences and send them through the `_analyze` API with the rebuilt analyzer and then again through the built in analyzer. Then make sure both APIs return the same results. Each of these calls to `_anlayze` takes about 20ms on my laptop which seems fine.	2018-05-09 09:23:10 -04:00
Mayya Sharipova	34e95e5d50	[DOCS] Add supported token filters Update normalizers.asciidoc with the list of supported token filters Closes #28605	2018-02-13 14:10:25 -08:00
Jim Ferenczi	7c2bcf3953	Mark synonym_graph as beta in the docs (#28496 ) We do want to keep this functionality in the future and we provide support for it. This change is a first step towards replacing the `synonym` token filter with `synonym_graph`.	2018-02-02 16:33:48 +01:00
deepybee	48c8098e15	Fixed several typos in analyzers section (#28247 )	2018-01-18 08:51:53 +00:00
Adrien Grand	1b660821a2	Allow `_doc` as a type. (#27816 ) Allowing `_doc` as a type will enable users to make the transition to 7.0 smoother since the index APIs will be `PUT index/_doc/id` and `POST index/_doc`. This also moves most of the documentation to `_doc` as a type name. Closes #27750 Closes #27751	2017-12-14 17:47:53 +01:00
Martijn van Groningen	442c3b8bcf	docs: fix link	2017-12-13 16:51:21 +01:00
Christoph Büscher	c4fe7d3f72	[Docs] add deprecation warning for `delimited_payload_filter` renaming	2017-12-04 10:22:05 +01:00
kel	4885acb048	Replace `delimited_payload_filter` by `delimited_payload` (#26625 ) The `delimited_payload_filter` is renamed to `delimited_payload`, the old name is deprecated and should be replaced by `delimited_payload`. Closes #21978	2017-11-24 13:03:19 +01:00
Mayya Sharipova	148376c2c5	Add limits for ngram and shingle settings (#27211 ) * Add limits for ngram and shingle settings (#27211) Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes #25887	2017-11-07 08:14:55 -05:00
Md. Abdulla-Al-Sun	a40c474e10	Added Bengali Analyzer to Elasticsearch with respect to the lucene update(PR#238)	2017-10-05 13:25:05 +02:00
markwalkom	dbea83a1d0	[Docs] Update length-tokenfilter.asciidoc (#26849 ) Made it clear what the numeric value of `Integer.MAX_VALUE` is,	2017-10-02 11:01:43 +02:00
olcbean	6952f7b560	Validate top-level keys for create index request (#23755 ) (#23869 ) This commit ensures create index requests do not ignore unknown keys passed to the request. closes #23755	2017-09-26 09:49:20 -07:00
Christoph Büscher	3827918417	Add configurable `maxTokenLength` parameter to whitespace tokenizer (#26749 ) Other tokenizers like the standard tokenizer allow overriding the default maximum token length of 255 using the `"max_token_length` parameter. This change enables using this parameter also with the whitespace tokenizer. The range that is currently allowed is from 0 to StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT, which is 1024 * 1024 = 1048576 characters. Closes #26643	2017-09-25 17:21:19 +02:00
Tahmim Ahmed Shibli	34662c9e6d	[Docs] Fix name of character filter in example. (#26724 )	2017-09-20 17:08:43 +02:00
Christoph Büscher	254c1b28e9	[Docs] Clarify behaviour of Pattern Capture Token Filter during search (#26278 ) There was some confusion about the fact that tokens emitted from a Pattern Capture Token Filter are treated as synonyms when used to analyze a search query. This commit adds an explanation to the note in the docs to emphasize this behaviour. Closes #25746	2017-08-21 14:56:52 +02:00
Clinton Gormley	ff4a2519f2	Update experimental labels in the docs (#25727 ) Relates https://github.com/elastic/elasticsearch/issues/19798 Removed experimental label from: * Painless * Diversified Sampler Agg * Sampler Agg * Significant Terms Agg * Terms Agg document count error and execution_hint * Cardinality Agg precision_threshold * Pipeline Aggregations * index.shard.check_on_startup * index.store.type (added warning) * Preloading data into the file system cache * foreach ingest processor * Field caps API * Profile API Added experimental label to: * Moving Average Agg Prediction Changed experimental to beta for: * Adjacency matrix agg * Normalizers * Tasks API * Index sorting Labelled experimental in Lucene: * ICU plugin custom rules file * Flatten graph token filter * Synonym graph token filter * Word delimiter graph token filter * Simple pattern tokenizer * Simple pattern split tokenizer Replaced experimental label with warning that details may change in the future: * Analysis explain output format * Segments verbose output format * Percentile Agg compression and HDR Histogram * Percentile Rank Agg HDR Histogram	2017-07-18 14:06:22 +02:00
Neil Rickards	5189bd14f1	[Docs] Fix typo in pattern-tokenizer.asciidoc (#25626 )	2017-07-13 18:43:48 +02:00
Simon Willnauer	e81804cfa4	Add a shard filter search phase to pre-filter shards based on query rewriting (#25658 ) Today if we search across a large amount of shards we hit every shard. Yet, it's quite common to search across an index pattern for time based indices but filtering will exclude all results outside a certain time range ie. `now-3d`. While the search can potentially hit hundreds of shards the majority of the shards might yield 0 results since there is not document that is within this date range. Kibana for instance does this regularly but used `_field_stats` to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results. This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.	2017-07-12 22:19:20 +02:00
Jun Ohtani	62d1969595	Parse synonyms with the same analysis chain (#8049 ) * [Analysis] Parse synonyms with the same analysis chain Synonym Token Filter / Synonym Graph Filter tokenize synonyms with whatever tokenizer and token filters appear before it in the chain. Close #7199	2017-06-20 21:50:33 +09:00
Andy Bristol	4c5bd57619	Rename simple pattern tokenizers (#25300 ) Changed names to be snake case for consistency Related to #25159, original issue #23363	2017-06-19 13:48:43 -07:00
debadair	c161d90524	[DOCS] Defined es-test-dir and plugins-examples-dir in index.asciidoc. (#25232 ) Use these attributes when specifying the location of included tests.	2017-06-15 08:54:10 -07:00
Adrien Grand	0c117145f6	Upgrade to lucene-7.0.0-snapshot-92b1783. (#25222 ) This snapshot has faster range queries on range fields (LUCENE-7828), more accurate norms (LUCENE-7730) and the ability to use fake term frequencies (LUCENE-7854).	2017-06-15 09:52:07 +02:00
Andy Bristol	48696ab544	expose simple pattern tokenizers (#25159 ) Expose the experimental simplepattern and simplepatternsplit tokenizers in the common analysis plugin. They provide tokenization based on regular expressions, using Lucene's deterministic regex implementation that is usually faster than Java's and has protections against creating too-deep stacks during matching. Both have a not-very-useful default pattern of the empty string because all tokenizer factories must be able to be instantiated at index creation time. They should always be configured by the user in practice.	2017-06-13 12:46:59 -07:00
Jim Ferenczi	2508df6cc8	Add missing link for the WordDelimiterGraphFilter	2017-04-28 17:12:38 +02:00
Adrien Grand	1be2800120	Only allow one type on 7.0 indices (#24317 ) This adds the `index.mapping.single_type` setting, which enforces that indices have at most one type when it is true. The default value is true for 6.0+ indices and false for old indices. Relates #15613	2017-04-27 08:43:20 +02:00
Nik Everett	ad69503dce	CONSOLEify analysis docs Converts the analysis docs to that were marked as json into `CONSOLE` format. A few of them were in yaml but marked as json for historical reasons. I added more complete examples for a few of the less obvious sounding ones. Relates to #18160	2017-04-02 11:17:14 -04:00
Nik Everett	514187be8e	Fix language in some docs The pattern-analyzer docs contained a snippet that was an expanded regex that was marked as `[source,js]`. This changes it to `[source,regex]`. The htmlstrip-charfilter and pattern-replace-charfilter docs had examples that were actually a list of tokens but marked `[source,js]`. This marks them as `[source,text]` so they don't count as unconverted CONSOLE snippets. The pattern-replace-charfilter also had a doc who's test was skipped because of funny interaction with the test framework. This fixes the test. Three more down, eighty-two to go. Relates to #18160	2017-04-01 14:45:44 -04:00
Nik Everett	9baa48a928	CONSOLEify lang-analyzer docs CONSOLEifies the lang-analyzer docs and replaces the (invalid) empty `keyword_marker` setups that were on the page with one that contains the word "example" translated into the appropriate language. Relates to #18160	2017-04-01 14:21:58 -04:00
Abdon Pijpelink	ef1329727d	Update compound-word-tokenfilter.asciidoc (#23817 ) Updated URL to OFFO Sourceforge project	2017-03-30 12:27:32 +02:00
Ali Beyad	2120086d82	Adds pattern keyword marker filter support (#23600 ) This commit adds support for the pattern keyword marker filter in Lucene. Previously, the keyword marker filter in Elasticsearch supported specifying a keywords set or a path to a set of keywords. This commit exposes the regular expression pattern based keyword marker filter also available in Lucene, so that any token matching the pattern specified by the `keywords_pattern` setting is excluded from being stemmed by any stemming filters. Closes #4877	2017-03-28 11:13:34 -04:00
Nik Everett	a783c6c85c	CONSOLEify some more docs And expand on the `stemmer_override` examples, including the file on disk and an example of specifying the rules inline. Relates to #18160	2017-03-22 17:58:06 -04:00
Nik Everett	e860fe7363	CONSOLEify some more docs Relates to #18160	2017-03-22 17:15:14 -04:00
Nik Everett	1dee2f32a4	Docs: CONSOLEify synonym tokenfiler docs Relates to #18160	2017-03-22 16:30:52 -04:00
Nik Everett	1c1b29400b	Docs: Fix language on a few snippets They aren't `js`, they are their own thing. Relates to #18160	2017-03-22 15:57:28 -04:00
Jim Ferenczi	63bdd01eb7	Expose WordDelimiterGraphTokenFilter (#23327 ) This change exposes the new Lucene graph based word delimiter token filter in the analysis filters. Unlike the `word_delimiter` this token filter named `word_delimiter_graph` correctly handles multi terms expansion at query time. Closes #23104	2017-02-24 00:53:38 +01:00
markwalkom	ced99dde50	Update stop-analyzer.asciidoc (#23195 ) Clarified where the stopwords file needs to live	2017-02-16 13:36:15 +01:00
Adrien Grand	f3509b8003	Consolify docs/reference/analysis/tokenfilters/pattern-capture-tokenfilter.asciidoc. (#23050 )	2017-02-13 11:00:12 +01:00
Clinton Gormley	f5e7c25e24	Update normalizers.asciidoc analyzers -> normalizers	2017-02-07 12:09:39 +01:00
Shubham Aggarwal	e07e4cc4dd	Fix incorrect heading for Whitespace Tokenizer (#22883 )	2017-01-31 12:51:37 +01:00
Daniel Mitterdorfer	aece89d6a1	Make boolean conversion strict (#22200 ) This PR removes all leniency in the conversion of Strings to booleans: "true" is converted to the boolean value `true`, "false" is converted to the boolean value `false`. Everything else raises an error.	2017-01-19 07:59:18 +01:00
Michael McCandless	1d1bdd476c	Finish exposing FlattenGraphTokenFilter (#22667 )	2017-01-18 11:05:34 -05:00
Clinton Gormley	519a9c469d	Update truncate token filter to not mention the keyword tokenizer The advice predates the existence of the keyword field Closes #22650	2017-01-17 12:15:22 +01:00
Matt Weber	609d2aab15	QueryString and SimpleQueryString Graph Support (#22541 ) Add support for graph token streams to "query_String" and "simple_query_string" queries.	2017-01-11 18:59:43 +01:00
Achraf	5dc85c25d9	Hindu-Arabico-Latino Numerals (#22476 ) Hi, same edit as for : https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html	2017-01-10 15:24:56 +01:00
Adrien Grand	3f805d68cb	Add the ability to set an analyzer on keyword fields. (#21919 ) This adds a new `normalizer` property to `keyword` fields that pre-processes the field value prior to indexing, but without altering the `_source`. Note that only the normalization components that work on a per-character basis are applied, so for instance stemming filters will be ignored while lowercasing or ascii folding will be applied. Closes #18064	2016-12-30 09:36:10 +01:00
Francesc Gil	dec6fc2d40	Repeated language analyzers (#22240 ) * Repeated language analyzers The `catalan` analyzer was repeated on the supported list :) * Reordered the languages to have alphabetic order * Added space for format * Reordered the languages and removed repeated	2016-12-21 17:32:02 +01:00

1 2 3 4

176 Commits