OpenSearch

Commit Graph

Author	SHA1	Message	Date
James Rodewig	9c75f14a9f	[DOCS] Reformat classic token filter docs (#48314 )	2019-10-23 10:14:25 -05:00
James Rodewig	a66bb2c7ed	[DOCS] Reformat CJK bigram and CJK width token filter docs (#48210 )	2019-10-21 08:44:49 -05:00
James Rodewig	8677653c5b	[DOCS] Reformat apostrophe token filter docs (#48076 )	2019-10-16 08:51:14 -04:00
Wilder Pereira	8c73e215b2	[DOCS] Remove unneeded spaces from custom analyzer snippet (#47332 )	2019-10-15 15:53:16 -04:00
James Rodewig	601a88bede	[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068 )	2019-10-15 15:47:25 -04:00
James Rodewig	af7aba18d4	Fixed sample code for minhash (#46385 ) The sample code is wrong. Field type is required for the sample field. I guess the intention was to give the sample field the name ```fingerprint```, mapping it as ```text``` using the custom analyzer ```my_analyzer```	2019-09-12 13:29:44 -04:00
Abhilash Bolla	20e93bca6b	Fixed grammar in pattern replace char filter docs. (#46546 ) Minor grammar fix in the pattern replace char filter docs.	2019-09-10 11:04:07 -07:00
James Rodewig	b59ecde041	[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353 ) (#46502 )	2019-09-09 13:38:14 -04:00
James Rodewig	f04573f8e8	[DOCS] [5 of 5] Change // TESTRESPONSE comments to [source,console-results] (#46449 ) (#46459 )	2019-09-06 16:09:09 -04:00
James Rodewig	bb7bff5e30	[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295 ) (#46418 )	2019-09-06 09:22:08 -04:00
James Rodewig	3e62cf9d74	[DOCS] Correct custom analyzer callouts (#46030 )	2019-08-29 10:08:18 -04:00
James Rodewig	d46545f729	[DOCS] Update anchors and links for Elasticsearch API relocation (#44500 )	2019-07-19 09:18:23 -04:00
Christoph Büscher	2cc7f5a744	Allow reloading of search time analyzers (#43313 ) Currently changing resources (like dictionaries, synonym files etc...) of search time analyzers is only possible by closing an index, changing the underlying resource (e.g. synonym files) and then re-opening the index for the change to take effect. This PR adds a new API endpoint that allows triggering reloading of certain analysis resources (currently token filters) that will then pick up changes in underlying file resources. To achieve this we introduce a new type of custom analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows swapping out analysis components. Custom analyzers that contain filters that are markes as "updateable" will automatically choose this implementation. This PR also adds this capability to `synonym` token filters for use in search time analyzers. Relates to #29051	2019-06-28 09:55:40 +02:00
Alan Woodward	05a7333eca	Require [articles] setting in elision filter (#43083 ) We should throw an exception at construction time if a list of articles is not provided, otherwise we can get random NPEs during indexing. Relates to #43002	2019-06-27 09:02:36 +01:00
Sachin Frayne	44aedcf97a	Correct the description of generate_word_parts (#43026 )	2019-06-10 11:36:31 +01:00
James Rodewig	5342616a23	[DOCS] Add explicit `articles_case` parameter to Elision Token Filter example (#42987 )	2019-06-07 11:24:43 -04:00
Mayya Sharipova	5a76f46ac6	Fix error with mapping in docs Related to #39630	2019-05-30 10:28:09 -04:00
Peter Dyson	b84b5525e1	[DOCS] path_hierarchy tokenizer examples (#39630 ) Closes #17138	2019-05-30 09:17:55 -04:00
Alan Woodward	3a35427b6d	Improvements to docs around multiplexer and synonyms (#41645 ) This commit fixes a multiplexer doc error concerning synonyms, and adds suggestions on how to combine the two filters.	2019-05-07 09:10:14 +01:00
James Rodewig	d46f55f013	[DOCS] Add attribute to escape minimal pt token link in Asciidoctor (#41613 )	2019-04-30 14:11:48 -04:00
James Rodewig	53702efddd	[DOCS] Add anchors for Asciidoctor migration (#41648 )	2019-04-30 10:20:17 -04:00
Guilherme Ferreira	48a17d5768	[Docs] Correct default stop list constant (#41342 )	2019-04-23 19:13:51 +02:00
Guilherme Ferreira	23e40c040a	[Docs] Correct spelling of "_none_" (#41192 )	2019-04-15 15:12:28 +02:00
Guilherme Ferreira	414debd740	[Docs] Correct spelling the "_none_" stopwords element (#41191 )	2019-04-15 14:12:26 +02:00
Christoph Büscher	dfc70e6ef0	Correct indention in synonym docs (#40711 ) The stopword filter should be on the same level as the synonym filter in the example request. Correcting this for better readability.	2019-04-02 01:44:24 +02:00
Mayya Sharipova	671a209ed9	Correct errors in min_hash filter documentation Related to #39671	2019-03-08 16:21:24 -05:00
Mayya Sharipova	54d41afac1	Add documentation for min_hash filter (#39671 ) Closes #20757	2019-03-07 08:49:48 -05:00
jimczi	ecb6df137c	fix typo in synonym graph filter docs	2019-03-05 18:20:14 +01:00
Christoph Büscher	4b77d0434a	Remove `nGram` and `edgeNGram` token filter names (#39070 ) In #30209 we deprecated the camel case `nGram` filter name in favour of `ngram` and did the same for `edgeNGram` and `edge_ngram` and we are removing those names in 8.0. This change disallows using the deprecated names for new indices created in 7.0 by throwing an error if these filters are used. Relates to #38911	2019-02-21 16:55:40 +01:00
Jim Ferenczi	83402b1320	Remove beta marker from the synonym_graph docs (#38185 )	2019-02-19 10:49:49 +01:00
Mayya Sharipova	0e1b1959fe	Correct rebuilt persian analyzer (#38724 ) (#38744 ) Make substitution of \u200C with a space explicit The problem with this symbol `\u200C` in a test string, that SHOULD be substituted with space in the rebuilt Persian analyzer, but it is not. Correcting this line `"mappings": [ "\\u200C=> "] <1>` to `"mappings": [ "\\u200C=>\\u0020"] <1>` in solves the problem. This change explicitly says to substitute ZWNJ with a space. Closes #38188	2019-02-11 14:17:18 -05:00
Christoph Büscher	34f2d2ec91	Remove remaining occurances of "include_type_name=true" in docs (#37646 )	2019-01-22 15:13:52 +01:00
Christoph Büscher	3a96608b3f	Remove more include_type_name and types from docs (#37601 )	2019-01-18 14:11:18 +01:00
Christoph Büscher	25aac4f77f	Remove `include_type_name` in asciidoc where possible (#37568 ) The "include_type_name" parameter was temporarily introduced in #37285 to facilitate moving the default parameter setting to "false" in many places in the documentation code snippets. Most of the places can simply be reverted without causing errors. In this change I looked for asciidoc files that contained the "include_type_name=true" addition when creating new indices but didn't look likey they made use of the "_doc" type for mappings. This is mostly the case e.g. in the analysis docs where index creating often only contains settings. I manually corrected the use of types in some places where the docs still used an explicit type name and not the dummy "_doc" type.	2019-01-18 09:34:11 +01:00
Julie Tibshirani	36a3b84fc9	Update the default for include_type_name to false. (#37285 ) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.	2019-01-14 13:08:01 -08:00
Josh Soref	edb48321ba	[DOCS] Various spelling corrections (#37046 )	2019-01-07 14:44:12 +01:00
Christoph Büscher	132ccbec2f	[Docs] Extend common-grams-tokenfilter doctest example (#36807 ) Adding an example output using the "_analyze" API and expected response.	2018-12-19 09:49:23 +01:00
Christoph Büscher	41feaf137c	[Docs] Fix error in Common Grams Token Filter (#36774 ) The first example given is missing the two single-token cases for "is" and "a". The later usage example is slightly wrong in that custom analyzers should go under `settings.analysis.analyzer`.	2018-12-18 16:54:06 +01:00
Alan Woodward	af57575838	Allow word_delimiter_graph_filter to not adjust internal offsets (#36699 ) This commit adds an adjust_offsets parameter to the word_delimiter_graph token filter, defaulting to true. Most of the time you'd want sub-tokens emitted by this filter to have offsets that are adjusted to their real position in the token stream; however, some token filters can change the length or starting position of a token (eg trim) without changing their offset attributes, and this can lead to word_delimiter_graph emitting illegal offsets. Setting adjust_offsets to false in these cases will allow indexing again. Fixes #34741, #33710	2018-12-18 13:20:51 +00:00
Jim Ferenczi	18866c4c0b	Make hits.total an object in the search response (#35849 ) This commit changes the format of the `hits.total` in the search response to be an object with a `value` and a `relation`. The `value` indicates the number of hits that match the query and the `relation` indicates whether the number is accurate (in which case the relation is equals to `eq`) or a lower bound of the total (in which case it is equals to `gte`). This change also adds a parameter called `rest_total_hits_as_int` that can be used in the search APIs to opt out from this change (retrieve the total hits as a number in the rest response). Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain `hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a follow up (to allow numbers to be passed to `track_total_hits`). Relates #33028	2018-12-05 19:49:06 +01:00
Alan Woodward	a646f85a99	Ensure TokenFilters only produce single tokens when parsing synonyms (#34331 ) A number of tokenfilters can produce multiple tokens at the same position. This is a problem when using token chains to parse synonym files, as the SynonymMap requires that there are no stacked tokens in its input. This commit ensures that when used to parse synonyms, these tokenfilters either produce a single version of their input token, or that they throw an error when mappings are generated. In indexes created in elasticsearch 6.x deprecation warnings are emitted in place of the error. * asciifolding and cjk_bigram produce only the folded or bigrammed token * decompounders, synonyms and keyword_repeat are skipped * n-grams, word-delimiter-filter, multiplexer, fingerprint and phonetic throw errors Fixes #34298	2018-11-29 10:35:38 +00:00
Alan Woodward	26cc8ff8c3	Add pointer to the index-phrases option in shingle filter docs (#35771 ) We should be discouraging the use of shingle filters and instead pointing users to the index-phrases parameter on text fields.	2018-11-21 15:27:11 +00:00
Alan Woodward	f6a43b5939	Add a prebuilt ICU Analyzer (#34958 ) The ICU plugin provides the building blocks of an analysis chain, but doesn't actually have a prebuilt analyzer. It would be a better for users if there was a simple analyzer that they could use out of the box, and also something we can point to from the CJK Analyzer docs as a superior alternative. Relates to #34285	2018-11-21 09:00:48 +00:00
Julie Tibshirani	f854330e06	Make sure to use the type _doc in the REST documentation. (#34662 ) * Replace custom type names with _doc in REST examples. * Avoid using two mapping types in the percolator docs. * Rename doc -> _doc in the main repository README. * Also replace some custom type names in the HLRC docs.	2018-10-22 11:54:04 -07:00
Christoph Büscher	e869f9a78c	[Docs] Update synonym-tokenfilter.asciidoc (#34706 ) Remove ugly double-dot.	2018-10-22 17:18:29 +02:00
Jim Ferenczi	a9daa5cb90	[DOCS] Remove beta label from normalizers (#34326 )	2018-10-05 15:42:00 +02:00
Nikolay Vasiliev	16956a1a05	[DOCS] Clarify 'type' parameter meaning for custom analyzer (#34012 ) This pull request improves the docs on the meaning of type parameter on the custom analyzer doc page. Closes #33456	2018-09-25 15:32:27 +02:00
Alan Woodward	5107949402	Allow TokenFilterFactories to rewrite themselves against their preceding chain (#33702 ) We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they appear before them in a chain, because they produce multiple tokens at the same position. This commit adds two methods to the TokenFilterFactory interface. * `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`. * `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym list `Analyzer`. By default it returns `true`. Fixes #33609	2018-09-19 15:52:14 +01:00
Alan Woodward	f598297f55	Add predicate_token_filter (#33431 ) This allows users to filter out tokens from a TokenStream using painless scripts, instead of having to write specialised Java code and packaging it up into a plugin. The commit also refactors the AnalysisPredicateScript.Token class so that it wraps and makes read-only an AttributeSource.	2018-09-11 09:16:39 +01:00
Jim Ferenczi	7ad71f906a	Upgrade to a Lucene 8 snapshot (#33310 ) The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly. Some comments about the change: * Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309 Closes #32899	2018-09-06 14:42:06 +02:00
Alan Woodward	636442700c	Add conditional token filter to elasticsearch (#31958 ) This allows tokenfilters to be applied selectively, depending on the status of the current token in the tokenstream. The filter takes a scripted predicate, and only applies its subfilter when the predicate returns true.	2018-09-05 14:52:43 +01:00
Matthias Sieber	a39f6f09f4	fixed elements in array of produced terms (#32519 )	2018-08-02 11:12:15 -04:00
Christoph Büscher	61486680a2	Add exclusion option to `keep_types` token filter (#32012 ) Currently the `keep_types` token filter includes all token types specified using its `types` parameter. Lucenes TypeTokenFilter also provides a second mode where instead of keeping the specified tokens (include) they are filtered out (exclude). This change exposes this option as a new `mode` parameter that can either take the values `include` (the default, if not specified) or `exclude`. Closes #29277	2018-07-17 09:04:41 +02:00
Sohaib Iftikhar	88c270d844	Added lenient flag for synonym token filter (#31484 ) * Added lenient flag for synonym-tokenfilter. Relates to #30968 * added docs for synonym-graph-tokenfilter -- Also made lenient final -- changed from !lenient to lenient == false * Changes after review (1) -- Renamed to ElasticsearchSynonymParser -- Added explanation for ElasticsearchSynonymParser::add method -- Changed ElasticsearchSynonymParser::logger instance to static * Added lenient option for WordnetSynonymParser -- also added more documentation * Added additional documentation * Improved documentation	2018-07-10 17:11:50 -04:00
Alan Woodward	5683bc60a6	Multiplexing token filter (#31208 ) The `multiplexer` filter emits multiple tokens at the same position, each version of the token haivng been passed through a different filter chain. Identical tokens at the same position are removed. This allows users to, for example, index lowercase and original-case tokens, or stemmed and unstemmed versions, in the same field, so that they can search for a stemmed term within x positions of an unstemmed term.	2018-06-20 10:16:26 +01:00
Alan Woodward	8c0ec05a12	Expose lucene's RemoveDuplicatesTokenFilter (#31275 )	2018-06-18 09:46:12 +01:00
Itamar Syn-Hershko	5f172b6795	[Feature] Adding a char_group tokenizer (#24186 ) === Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```	2018-05-22 16:26:31 +02:00
Jim Ferenczi	bdb79d021a	Fix docs failure on language analyzers (#30722 ) This commit fixes docs failure on language analyzers when compared to the built in analyzers. The `elision` filters used by the rebuilt language analyzers should be case insensitive to match the definition of the prebuilt analyzers. Closes #30557	2018-05-22 09:58:12 +02:00
Nik Everett	9881bfaea5	Docs: Document how to rebuild analyzers (#30498 ) Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499	2018-05-14 18:40:54 -04:00
Jason Tedor	4a4e3d70d5	Default to one shard (#30539 ) This commit changes the default out-of-the-box configuration for the number of shards from five to one. We think this will help address a common problem of oversharding. For users with time-based indices that need a different default, this can be managed with index templates. For users with non-time-based indices that find they need to re-shard with the split API in place they no longer need to resort only to reindexing. Since this has the impact of changing the default number of shards used in REST tests, we want to ensure that we still have coverage for issues that could arise from multiple shards. As such, we randomize (rarely) the default number of shards in REST tests to two. This is managed via a global index template. However, some tests check the templates that are in the cluster state during the test. Since this template is randomly there, we need a way for tests to skip adding the template used to set the number of shards to two. For this we add the default_shards feature skip. To avoid having to write our docs in a complicated way because sometimes they might be behind one shard, and sometimes they might be behind two shards we apply the default_shards feature skip to all docs tests. That is, these tests will always run with the default number of shards (one).	2018-05-14 12:22:35 -04:00
Nik Everett	f9dc86836d	Docs: Test examples that recreate lang analyzers (#29535 ) We have a pile of documentation describing how to rebuild the built in language analyzers and, previously, our documentation testing framework made sure that the examples successfully built an analyzer but they didn't assert that the analyzer built by the documentation matches the built in anlayzer. Unsuprisingly, some of the examples aren't quite right. This adds a mechanism that tests that the analyzers built by the docs. The mechanism is fairly simple and brutal but it seems to be working: build a hundred random unicode sequences and send them through the `_analyze` API with the rebuilt analyzer and then again through the built in analyzer. Then make sure both APIs return the same results. Each of these calls to `_anlayze` takes about 20ms on my laptop which seems fine.	2018-05-09 09:23:10 -04:00
Mayya Sharipova	34e95e5d50	[DOCS] Add supported token filters Update normalizers.asciidoc with the list of supported token filters Closes #28605	2018-02-13 14:10:25 -08:00
Jim Ferenczi	7c2bcf3953	Mark synonym_graph as beta in the docs (#28496 ) We do want to keep this functionality in the future and we provide support for it. This change is a first step towards replacing the `synonym` token filter with `synonym_graph`.	2018-02-02 16:33:48 +01:00
deepybee	48c8098e15	Fixed several typos in analyzers section (#28247 )	2018-01-18 08:51:53 +00:00
Adrien Grand	1b660821a2	Allow `_doc` as a type. (#27816 ) Allowing `_doc` as a type will enable users to make the transition to 7.0 smoother since the index APIs will be `PUT index/_doc/id` and `POST index/_doc`. This also moves most of the documentation to `_doc` as a type name. Closes #27750 Closes #27751	2017-12-14 17:47:53 +01:00
Martijn van Groningen	442c3b8bcf	docs: fix link	2017-12-13 16:51:21 +01:00
Christoph Büscher	c4fe7d3f72	[Docs] add deprecation warning for `delimited_payload_filter` renaming	2017-12-04 10:22:05 +01:00
kel	4885acb048	Replace `delimited_payload_filter` by `delimited_payload` (#26625 ) The `delimited_payload_filter` is renamed to `delimited_payload`, the old name is deprecated and should be replaced by `delimited_payload`. Closes #21978	2017-11-24 13:03:19 +01:00
Mayya Sharipova	148376c2c5	Add limits for ngram and shingle settings (#27211 ) * Add limits for ngram and shingle settings (#27211) Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes #25887	2017-11-07 08:14:55 -05:00
Md. Abdulla-Al-Sun	a40c474e10	Added Bengali Analyzer to Elasticsearch with respect to the lucene update(PR#238)	2017-10-05 13:25:05 +02:00
markwalkom	dbea83a1d0	[Docs] Update length-tokenfilter.asciidoc (#26849 ) Made it clear what the numeric value of `Integer.MAX_VALUE` is,	2017-10-02 11:01:43 +02:00
olcbean	6952f7b560	Validate top-level keys for create index request (#23755 ) (#23869 ) This commit ensures create index requests do not ignore unknown keys passed to the request. closes #23755	2017-09-26 09:49:20 -07:00
Christoph Büscher	3827918417	Add configurable `maxTokenLength` parameter to whitespace tokenizer (#26749 ) Other tokenizers like the standard tokenizer allow overriding the default maximum token length of 255 using the `"max_token_length` parameter. This change enables using this parameter also with the whitespace tokenizer. The range that is currently allowed is from 0 to StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT, which is 1024 * 1024 = 1048576 characters. Closes #26643	2017-09-25 17:21:19 +02:00
Tahmim Ahmed Shibli	34662c9e6d	[Docs] Fix name of character filter in example. (#26724 )	2017-09-20 17:08:43 +02:00
Christoph Büscher	254c1b28e9	[Docs] Clarify behaviour of Pattern Capture Token Filter during search (#26278 ) There was some confusion about the fact that tokens emitted from a Pattern Capture Token Filter are treated as synonyms when used to analyze a search query. This commit adds an explanation to the note in the docs to emphasize this behaviour. Closes #25746	2017-08-21 14:56:52 +02:00
Clinton Gormley	ff4a2519f2	Update experimental labels in the docs (#25727 ) Relates https://github.com/elastic/elasticsearch/issues/19798 Removed experimental label from: * Painless * Diversified Sampler Agg * Sampler Agg * Significant Terms Agg * Terms Agg document count error and execution_hint * Cardinality Agg precision_threshold * Pipeline Aggregations * index.shard.check_on_startup * index.store.type (added warning) * Preloading data into the file system cache * foreach ingest processor * Field caps API * Profile API Added experimental label to: * Moving Average Agg Prediction Changed experimental to beta for: * Adjacency matrix agg * Normalizers * Tasks API * Index sorting Labelled experimental in Lucene: * ICU plugin custom rules file * Flatten graph token filter * Synonym graph token filter * Word delimiter graph token filter * Simple pattern tokenizer * Simple pattern split tokenizer Replaced experimental label with warning that details may change in the future: * Analysis explain output format * Segments verbose output format * Percentile Agg compression and HDR Histogram * Percentile Rank Agg HDR Histogram	2017-07-18 14:06:22 +02:00
Neil Rickards	5189bd14f1	[Docs] Fix typo in pattern-tokenizer.asciidoc (#25626 )	2017-07-13 18:43:48 +02:00
Simon Willnauer	e81804cfa4	Add a shard filter search phase to pre-filter shards based on query rewriting (#25658 ) Today if we search across a large amount of shards we hit every shard. Yet, it's quite common to search across an index pattern for time based indices but filtering will exclude all results outside a certain time range ie. `now-3d`. While the search can potentially hit hundreds of shards the majority of the shards might yield 0 results since there is not document that is within this date range. Kibana for instance does this regularly but used `_field_stats` to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results. This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.	2017-07-12 22:19:20 +02:00
Jun Ohtani	62d1969595	Parse synonyms with the same analysis chain (#8049 ) * [Analysis] Parse synonyms with the same analysis chain Synonym Token Filter / Synonym Graph Filter tokenize synonyms with whatever tokenizer and token filters appear before it in the chain. Close #7199	2017-06-20 21:50:33 +09:00
Andy Bristol	4c5bd57619	Rename simple pattern tokenizers (#25300 ) Changed names to be snake case for consistency Related to #25159, original issue #23363	2017-06-19 13:48:43 -07:00
debadair	c161d90524	[DOCS] Defined es-test-dir and plugins-examples-dir in index.asciidoc. (#25232 ) Use these attributes when specifying the location of included tests.	2017-06-15 08:54:10 -07:00
Adrien Grand	0c117145f6	Upgrade to lucene-7.0.0-snapshot-92b1783. (#25222 ) This snapshot has faster range queries on range fields (LUCENE-7828), more accurate norms (LUCENE-7730) and the ability to use fake term frequencies (LUCENE-7854).	2017-06-15 09:52:07 +02:00
Andy Bristol	48696ab544	expose simple pattern tokenizers (#25159 ) Expose the experimental simplepattern and simplepatternsplit tokenizers in the common analysis plugin. They provide tokenization based on regular expressions, using Lucene's deterministic regex implementation that is usually faster than Java's and has protections against creating too-deep stacks during matching. Both have a not-very-useful default pattern of the empty string because all tokenizer factories must be able to be instantiated at index creation time. They should always be configured by the user in practice.	2017-06-13 12:46:59 -07:00
Jim Ferenczi	2508df6cc8	Add missing link for the WordDelimiterGraphFilter	2017-04-28 17:12:38 +02:00
Adrien Grand	1be2800120	Only allow one type on 7.0 indices (#24317 ) This adds the `index.mapping.single_type` setting, which enforces that indices have at most one type when it is true. The default value is true for 6.0+ indices and false for old indices. Relates #15613	2017-04-27 08:43:20 +02:00
Nik Everett	ad69503dce	CONSOLEify analysis docs Converts the analysis docs to that were marked as json into `CONSOLE` format. A few of them were in yaml but marked as json for historical reasons. I added more complete examples for a few of the less obvious sounding ones. Relates to #18160	2017-04-02 11:17:14 -04:00
Nik Everett	514187be8e	Fix language in some docs The pattern-analyzer docs contained a snippet that was an expanded regex that was marked as `[source,js]`. This changes it to `[source,regex]`. The htmlstrip-charfilter and pattern-replace-charfilter docs had examples that were actually a list of tokens but marked `[source,js]`. This marks them as `[source,text]` so they don't count as unconverted CONSOLE snippets. The pattern-replace-charfilter also had a doc who's test was skipped because of funny interaction with the test framework. This fixes the test. Three more down, eighty-two to go. Relates to #18160	2017-04-01 14:45:44 -04:00
Nik Everett	9baa48a928	CONSOLEify lang-analyzer docs CONSOLEifies the lang-analyzer docs and replaces the (invalid) empty `keyword_marker` setups that were on the page with one that contains the word "example" translated into the appropriate language. Relates to #18160	2017-04-01 14:21:58 -04:00
Abdon Pijpelink	ef1329727d	Update compound-word-tokenfilter.asciidoc (#23817 ) Updated URL to OFFO Sourceforge project	2017-03-30 12:27:32 +02:00
Ali Beyad	2120086d82	Adds pattern keyword marker filter support (#23600 ) This commit adds support for the pattern keyword marker filter in Lucene. Previously, the keyword marker filter in Elasticsearch supported specifying a keywords set or a path to a set of keywords. This commit exposes the regular expression pattern based keyword marker filter also available in Lucene, so that any token matching the pattern specified by the `keywords_pattern` setting is excluded from being stemmed by any stemming filters. Closes #4877	2017-03-28 11:13:34 -04:00
Nik Everett	a783c6c85c	CONSOLEify some more docs And expand on the `stemmer_override` examples, including the file on disk and an example of specifying the rules inline. Relates to #18160	2017-03-22 17:58:06 -04:00
Nik Everett	e860fe7363	CONSOLEify some more docs Relates to #18160	2017-03-22 17:15:14 -04:00
Nik Everett	1dee2f32a4	Docs: CONSOLEify synonym tokenfiler docs Relates to #18160	2017-03-22 16:30:52 -04:00
Nik Everett	1c1b29400b	Docs: Fix language on a few snippets They aren't `js`, they are their own thing. Relates to #18160	2017-03-22 15:57:28 -04:00
Jim Ferenczi	63bdd01eb7	Expose WordDelimiterGraphTokenFilter (#23327 ) This change exposes the new Lucene graph based word delimiter token filter in the analysis filters. Unlike the `word_delimiter` this token filter named `word_delimiter_graph` correctly handles multi terms expansion at query time. Closes #23104	2017-02-24 00:53:38 +01:00
markwalkom	ced99dde50	Update stop-analyzer.asciidoc (#23195 ) Clarified where the stopwords file needs to live	2017-02-16 13:36:15 +01:00
Adrien Grand	f3509b8003	Consolify docs/reference/analysis/tokenfilters/pattern-capture-tokenfilter.asciidoc. (#23050 )	2017-02-13 11:00:12 +01:00
Clinton Gormley	f5e7c25e24	Update normalizers.asciidoc analyzers -> normalizers	2017-02-07 12:09:39 +01:00
Shubham Aggarwal	e07e4cc4dd	Fix incorrect heading for Whitespace Tokenizer (#22883 )	2017-01-31 12:51:37 +01:00
Daniel Mitterdorfer	aece89d6a1	Make boolean conversion strict (#22200 ) This PR removes all leniency in the conversion of Strings to booleans: "true" is converted to the boolean value `true`, "false" is converted to the boolean value `false`. Everything else raises an error.	2017-01-19 07:59:18 +01:00
Michael McCandless	1d1bdd476c	Finish exposing FlattenGraphTokenFilter (#22667 )	2017-01-18 11:05:34 -05:00
Clinton Gormley	519a9c469d	Update truncate token filter to not mention the keyword tokenizer The advice predates the existence of the keyword field Closes #22650	2017-01-17 12:15:22 +01:00
Matt Weber	609d2aab15	QueryString and SimpleQueryString Graph Support (#22541 ) Add support for graph token streams to "query_String" and "simple_query_string" queries.	2017-01-11 18:59:43 +01:00
Achraf	5dc85c25d9	Hindu-Arabico-Latino Numerals (#22476 ) Hi, same edit as for : https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html	2017-01-10 15:24:56 +01:00
Adrien Grand	3f805d68cb	Add the ability to set an analyzer on keyword fields. (#21919 ) This adds a new `normalizer` property to `keyword` fields that pre-processes the field value prior to indexing, but without altering the `_source`. Note that only the normalization components that work on a per-character basis are applied, so for instance stemming filters will be ignored while lowercasing or ascii folding will be applied. Closes #18064	2016-12-30 09:36:10 +01:00
Francesc Gil	dec6fc2d40	Repeated language analyzers (#22240 ) * Repeated language analyzers The `catalan` analyzer was repeated on the supported list :) * Reordered the languages to have alphabetic order * Added space for format * Reordered the languages and removed repeated	2016-12-21 17:32:02 +01:00
Thibault Pierre	e494d6a94e	Fix wrong link (#22019 )	2016-12-07 17:58:46 +01:00
Allen Torres	887fbb6387	Update lowercase-tokenizer.asciidoc (#21896 ) Fixed typo	2016-12-02 10:49:51 -05:00
Matt Weber	04e07bcdb6	Synonym Graph Support (LUCENE-6664) (#21517 ) Integrate the patch from LUCENE-6664 into elasticsearch and add support for handling a graph token stream in match/multi-match queries. This fixes longstanding bugs with multi-token synonyms returning incorrect results with proximity queries.	2016-11-28 09:25:49 -08:00
Achraf	d81a928b1f	Correction of the names of numirals (#21531 ) What was called Arabic numerals is actually Hindu - Eastern Arabic notation. And the Latin numerals you refer to is the Arabic numbers.	2016-11-25 14:30:49 +01:00
Pascal Borreli	fcb01deb34	Fixed typos (#20843 )	2016-10-10 14:51:47 -06:00
Clinton Gormley	22f1acde94	Docs: Pattern analyzer does not support a max_token_length parameter Closes #20713	2016-10-08 12:27:33 +02:00
Alexander Lin	7cd0316b51	Fix minhash docs level Relates #20547	2016-09-19 07:54:04 -04:00
Clinton Gormley	2f6d0119f1	Added warning messages about the dangers of pathological regexes to: * pattern-replace charfilter * pattern-capture and pattern-replace token filters * pattern tokenizer * pattern analyzer Relates to #20038	2016-09-09 09:53:07 +02:00
Alexander Lin	f825e8f4cb	Exposing lucene 6.x minhash filter. (#20206 ) Exposing lucene 6.x minhash tokenfilter Generate min hash tokens from an incoming stream of tokens that can be used to estimate document similarity. Closes #20149	2016-09-07 09:38:12 +02:00
Jim Ferenczi	4682fc34ae	Add the ability to disable the retrieval of the stored fields entirely This change adds a special field named _none_ that allows to disable the retrieval of the stored fields in a search request or in a TopHitsAggregation. To completely disable stored fields retrieval (including disabling metadata fields retrieval such as _id or _type) use _none_ like this: ```` POST _search { "stored_fields": "_none_" } ````	2016-08-24 16:40:08 +02:00
markwalkom	f556424ab9	Update synonym-tokenfilter.asciidoc (#19988 ) * Update synonym-tokenfilter.asciidoc * Update synonym-tokenfilter.asciidoc	2016-08-17 13:39:22 +02:00
Nik Everett	7aeea764ba	Remove wait_for_status=yellow from the docs It is no longer required after `687e2e12b3`.	2016-07-15 16:02:07 -04:00
Clinton Gormley	6f17736eb1	Fixed asciidoc	2016-07-15 12:58:38 +02:00
Jim Ferenczi	881afcba60	Fixed tests that failed now that BM25 is the default similarity.	2016-06-21 15:42:42 +02:00
Nik Everett	a0585269be	[docs] s/lags/Flags/ Copy and paste lots an `F`.	2016-06-09 13:08:53 -04:00
Nik Everett	09cc4c449a	[docs] Pattern replace char filter now support flags	2016-06-09 12:41:20 -04:00
Clinton Gormley	5da9e5dcbc	Docs: Improved tokenizer docs (#18356 ) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer	2016-05-19 19:42:23 +02:00
Nik Everett	8155e1efda	[docs] Add wait_for_status=yellow Another unstable snippet.... https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-os-compatibility/os=sles/402/console	2016-05-12 17:53:34 -04:00
Zachary Tong	5ee5cc25cc	Move AsciiFolding earlier in FingerprintAnalyzer filter chain Rearranges the FingerprintAnalyzer so that AsciiFolding comes earlier in the chain (after lowercasing, before stop removal, for maximum deduping power) Closes #18266	2016-05-12 09:34:15 -04:00
Clinton Gormley	97a41ee973	First pass at improving analyzer docs (#18269 ) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos	2016-05-11 14:17:56 +02:00
Clinton Gormley	3f594089c2	Renamed all AUTOSENSE snippets to CONSOLE (#18210 )	2016-05-09 15:42:23 +02:00
Nik Everett	3912761572	[docs] Add wait_until_yellow to fix build failure The snippet in the docs creates and index and uses it with the _analyze api. The trouble is that if the index hasn't been created fully the _analyze API will fail. This adds a GET _cluster/health?wait_for_status=yellow which fixes the issue. While this does make the docs more cluttered, it also makes the snippets actually runnable. Closes #18165	2016-05-05 16:02:00 -04:00
Nik Everett	4b1c116461	Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start.	2016-05-05 13:58:03 -04:00
Zachary Tong	80288ad60c	Add `fingerprint` token filter and `fingerprint` analyzer Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter, and a `fingerprint` analyzer that combines the Fingerprint filter with lowercasing, stop word removal and asciifolding. Closes #13325	2016-04-20 16:10:56 -04:00
Clinton Gormley	a62b9296c6	Docs: Fixed link to phonetic plugin	2016-04-13 10:17:46 +02:00
Adrien Grand	b42f66c8ac	Document 5.0 mapping changes.	2016-03-22 16:22:58 +01:00
Clinton Gormley	dc21ab7576	Docs: Corrected behaviour of max_token_length in standard tokenizer	2016-03-18 10:58:16 +01:00
Clinton Gormley	a5a9bbfe88	Update compound-word-tokenfilter.asciidoc Only FOP v1.2 compatible hyphenation files are supported by the hyphenation decompounder	2016-03-11 15:08:36 +01:00
Lee Hinman	6adbbff97c	Fix organization rename in all files in project Basically a query-replace of "https://github.com/elasticsearch/" with "https://github.com/elastic/"	2016-03-03 12:04:13 -07:00
Andrey Ryaguzov	f744c3f724	Docs: Added migration description for custom analysis file path Closes #15597 Closes #15556	2016-02-29 20:56:19 +01:00
Dongjoon Hyun	21ea552070	Fix typos in docs.	2016-02-09 02:07:32 -08:00
Adrien Grand	f8e802c028	Merge pull request #15794 from damienalexandre/french-doc [Doc] Fix french analyzer elision token filter doc	2016-01-06 18:39:26 +01:00
Damien Alexandre	23a64f8214	Fix french analyzer elision token filter doc Fix #15774	2016-01-06 18:26:03 +01:00
David Pilato	995e796eab	[doc] Fix cross link with ICU plugin Doc bug introduced with #15695	2015-12-30 12:07:33 +01:00
David Pilato	3076377fdb	Remove ICU Plugin in reference guide This documentation lives now in plugins documentation at https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html. We don't need a copy in analysis reference guide.	2015-12-29 11:23:28 +01:00
socurites	485915bbe7	comma(,) was duplicated deleted it.	2015-12-24 14:31:26 +01:00
socurites	25d23091e2	Edge NGram: "side" setting was depercated Edge NGram: "side" setting was depercated	2015-12-24 14:26:24 +01:00
Jason Tedor	d9a24961c5	Fix minor issues in delimited payload token filter docs This commit addresses a few minor issues in the delimited payload token filter docs: - the provided example reversed the payloads associated with the tokens "the" and "fox" - two additional typos in the same sentence - "per default" -> "by default" - "default int to" -> "default into" - adds two serial commas	2015-12-16 13:00:20 -05:00
tomoya yokota	82d26c852a	property name is not right `ignore_script` is not right. `ignored_script' is right. See org.elasticsearch.index.analysis.CJKBigramFilterFactory	2015-11-26 14:22:23 +09:00
Clinton Gormley	98028419a5	Merge pull request #14610 from yokotaso/patch-1 Update snowball document page.	2015-11-17 14:17:30 +01:00
Jason O'Donnell	42fb690a1c	Fixing typo	2015-10-26 16:46:36 -04:00
Adrien Grand	d3aa3565db	Deprecate `index.analysis.analyzer.default_index` in favor of `index.analysis.analyzer.default`. Close #11861	2015-10-12 22:19:16 +02:00
Clinton Gormley	1f76f49003	Update compound-word-tokenfilter.asciidoc Improved the docs for compound work token filter. Closes #13670 Closes #13595	2015-09-21 11:22:14 +02:00
Robert Muir	f216d92d19	Upgrade to lucene 5.4-snapshot r1701068	2015-09-03 15:13:33 -04:00

1 2 3 4 5 ...

332 Commits