Commit Graph

138 Commits

Author SHA1 Message Date
Jim Ferenczi 63bdd01eb7 Expose WordDelimiterGraphTokenFilter (#23327)
This change exposes the new Lucene graph based word delimiter token filter in the analysis filters.
Unlike the `word_delimiter` this token filter named `word_delimiter_graph` correctly handles multi terms expansion at query time.

Closes #23104
2017-02-24 00:53:38 +01:00
markwalkom ced99dde50 Update stop-analyzer.asciidoc (#23195)
Clarified where the stopwords file needs to live
2017-02-16 13:36:15 +01:00
Adrien Grand f3509b8003 Consolify docs/reference/analysis/tokenfilters/pattern-capture-tokenfilter.asciidoc. (#23050) 2017-02-13 11:00:12 +01:00
Clinton Gormley f5e7c25e24 Update normalizers.asciidoc
analyzers -> normalizers
2017-02-07 12:09:39 +01:00
Shubham Aggarwal e07e4cc4dd Fix incorrect heading for Whitespace Tokenizer (#22883) 2017-01-31 12:51:37 +01:00
Daniel Mitterdorfer aece89d6a1 Make boolean conversion strict (#22200)
This PR removes all leniency in the conversion of Strings to booleans: "true"
is converted to the boolean value `true`, "false" is converted to the boolean
value `false`. Everything else raises an error.
2017-01-19 07:59:18 +01:00
Michael McCandless 1d1bdd476c Finish exposing FlattenGraphTokenFilter (#22667) 2017-01-18 11:05:34 -05:00
Clinton Gormley 519a9c469d Update truncate token filter to not mention the keyword tokenizer
The advice predates the existence of the keyword field

Closes #22650
2017-01-17 12:15:22 +01:00
Matt Weber 609d2aab15 QueryString and SimpleQueryString Graph Support (#22541)
Add support for graph token streams to "query_String" and
"simple_query_string" queries.
2017-01-11 18:59:43 +01:00
Achraf 5dc85c25d9 Hindu-Arabico-Latino Numerals (#22476)
Hi, same edit as for : https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
2017-01-10 15:24:56 +01:00
Adrien Grand 3f805d68cb Add the ability to set an analyzer on keyword fields. (#21919)
This adds a new `normalizer` property to `keyword` fields that pre-processes the
field value prior to indexing, but without altering the `_source`. Note that
only the normalization components that work on a per-character basis are
applied, so for instance stemming filters will be ignored while lowercasing or
ascii folding will be applied.

Closes #18064
2016-12-30 09:36:10 +01:00
Francesc Gil dec6fc2d40 Repeated language analyzers (#22240)
* Repeated language analyzers

The `catalan` analyzer was repeated on the supported list :)

* Reordered the languages to have alphabetic order

* Added space for format

* Reordered the languages and removed repeated
2016-12-21 17:32:02 +01:00
Thibault Pierre e494d6a94e Fix wrong link (#22019) 2016-12-07 17:58:46 +01:00
Allen Torres 887fbb6387 Update lowercase-tokenizer.asciidoc (#21896)
Fixed typo
2016-12-02 10:49:51 -05:00
Matt Weber 04e07bcdb6 Synonym Graph Support (LUCENE-6664) (#21517)
Integrate the patch from LUCENE-6664 into elasticsearch and
add support for handling a graph token stream in match/multi-match
queries.

This fixes longstanding bugs with multi-token synonyms returning
incorrect results with proximity queries.
2016-11-28 09:25:49 -08:00
Achraf d81a928b1f Correction of the names of numirals (#21531)
What was called Arabic numerals is actually Hindu - Eastern Arabic notation. And the Latin numerals you refer to is the Arabic numbers.
2016-11-25 14:30:49 +01:00
Pascal Borreli fcb01deb34 Fixed typos (#20843) 2016-10-10 14:51:47 -06:00
Clinton Gormley 22f1acde94 Docs: Pattern analyzer does not support a max_token_length parameter
Closes #20713
2016-10-08 12:27:33 +02:00
Alexander Lin 7cd0316b51 Fix minhash docs level
Relates #20547
2016-09-19 07:54:04 -04:00
Clinton Gormley 2f6d0119f1 Added warning messages about the dangers of pathological regexes to:
* pattern-replace charfilter
* pattern-capture and pattern-replace token filters
* pattern tokenizer
* pattern analyzer

Relates to #20038
2016-09-09 09:53:07 +02:00
Alexander Lin f825e8f4cb Exposing lucene 6.x minhash filter. (#20206)
Exposing lucene 6.x minhash tokenfilter

Generate min hash tokens from an incoming stream of tokens that can
be used to estimate document similarity.

Closes #20149
2016-09-07 09:38:12 +02:00
Jim Ferenczi 4682fc34ae Add the ability to disable the retrieval of the stored fields entirely
This change adds a special field named _none_ that allows to disable the retrieval of the stored fields in a search request or in a TopHitsAggregation.

To completely disable stored fields retrieval (including disabling metadata fields retrieval such as _id or _type) use _none_ like this:

````
POST _search
{
   "stored_fields": "_none_"
}
````
2016-08-24 16:40:08 +02:00
markwalkom f556424ab9 Update synonym-tokenfilter.asciidoc (#19988)
* Update synonym-tokenfilter.asciidoc

* Update synonym-tokenfilter.asciidoc
2016-08-17 13:39:22 +02:00
Nik Everett 7aeea764ba Remove wait_for_status=yellow from the docs
It is no longer required after 687e2e12b3.
2016-07-15 16:02:07 -04:00
Clinton Gormley 6f17736eb1 Fixed asciidoc 2016-07-15 12:58:38 +02:00
Jim Ferenczi 881afcba60 Fixed tests that failed now that BM25 is the default similarity. 2016-06-21 15:42:42 +02:00
Nik Everett a0585269be [docs] s/lags/Flags/
Copy and paste lots an `F`.
2016-06-09 13:08:53 -04:00
Nik Everett 09cc4c449a [docs] Pattern replace char filter now support flags 2016-06-09 12:41:20 -04:00
Clinton Gormley 5da9e5dcbc Docs: Improved tokenizer docs (#18356)
* Docs: Improved tokenizer docs

Added descriptions and runnable examples

* Addressed Nik's comments

* Added TESTRESPONSEs for all tokenizer examples

* Added TESTRESPONSEs for all analyzer examples too

* Added docs, examples, and TESTRESPONSES for character filters

* Skipping two tests:

One interprets "$1" as a stack variable - same problem exists with the REST tests

The other because the "took" value is always different

* Fixed tests with "took"

* Fixed failing tests and removed preserve_original from fingerprint analyzer
2016-05-19 19:42:23 +02:00
Nik Everett 8155e1efda [docs] Add wait_for_status=yellow
Another unstable snippet....

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-os-compatibility/os=sles/402/console
2016-05-12 17:53:34 -04:00
Zachary Tong 5ee5cc25cc Move AsciiFolding earlier in FingerprintAnalyzer filter chain
Rearranges the FingerprintAnalyzer so that AsciiFolding comes earlier in the chain (after lowercasing, before stop removal, for maximum deduping power)

Closes #18266
2016-05-12 09:34:15 -04:00
Clinton Gormley 97a41ee973 First pass at improving analyzer docs (#18269)
* Docs: First pass at improving analyzer docs

I've rewritten the intro to analyzers plus the docs
for all analyzers to provide working examples.

I've also removed:

* analyzer aliases (see #18244)
* analyzer versions (see #18267)
* snowball analyzer (see #8690)

Next steps will be tokenizers, token filters, char filters

* Fixed two typos
2016-05-11 14:17:56 +02:00
Clinton Gormley 3f594089c2 Renamed all AUTOSENSE snippets to CONSOLE (#18210) 2016-05-09 15:42:23 +02:00
Nik Everett 3912761572 [docs] Add wait_until_yellow to fix build failure
The snippet in the docs creates and index and uses it with the
_analyze api. The trouble is that if the index hasn't been created
fully the _analyze API will fail. This adds a
GET _cluster/health?wait_for_status=yellow
which fixes the issue.

While this does make the docs more cluttered, it also makes the snippets
actually runnable.

Closes #18165
2016-05-05 16:02:00 -04:00
Nik Everett 4b1c116461 Generate and run tests from the docs
Adds infrastructure so `gradle :docs:check` will extract tests from
snippets in the documentation and execute the tests. This is included
in `gradle check` so it should happen on CI and during a normal build.

By default each `// AUTOSENSE` snippet creates a unique REST test. These
tests are executed in a random order and the cluster is wiped between
each one. If multiple snippets chain together into a test you can annotate
all snippets after the first with `// TEST[continued]` to have the
generated tests for both snippets joined.

Snippets marked as `// TESTRESPONSE` are checked against the response
of the last action.

See docs/README.asciidoc for lots more.

Closes #12583. That issue is about catching bugs in the docs during build.
This catches *some* bugs in the docs during build which is a good start.
2016-05-05 13:58:03 -04:00
Zachary Tong 80288ad60c Add `fingerprint` token filter and `fingerprint` analyzer
Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter,
and a `fingerprint` analyzer that combines the Fingerprint filter with
lowercasing, stop word removal and asciifolding.

Closes #13325
2016-04-20 16:10:56 -04:00
Clinton Gormley a62b9296c6 Docs: Fixed link to phonetic plugin 2016-04-13 10:17:46 +02:00
Adrien Grand b42f66c8ac Document 5.0 mapping changes. 2016-03-22 16:22:58 +01:00
Clinton Gormley dc21ab7576 Docs: Corrected behaviour of max_token_length in standard tokenizer 2016-03-18 10:58:16 +01:00
Clinton Gormley a5a9bbfe88 Update compound-word-tokenfilter.asciidoc
Only FOP v1.2 compatible hyphenation files are supported by the hyphenation decompounder
2016-03-11 15:08:36 +01:00
Lee Hinman 6adbbff97c Fix organization rename in all files in project
Basically a query-replace of "https://github.com/elasticsearch/" with "https://github.com/elastic/"
2016-03-03 12:04:13 -07:00
Andrey Ryaguzov f744c3f724 Docs: Added migration description for custom analysis file path
Closes #15597
Closes #15556
2016-02-29 20:56:19 +01:00
Dongjoon Hyun 21ea552070 Fix typos in docs. 2016-02-09 02:07:32 -08:00
Adrien Grand f8e802c028 Merge pull request #15794 from damienalexandre/french-doc
[Doc] Fix french analyzer elision token filter doc
2016-01-06 18:39:26 +01:00
Damien Alexandre 23a64f8214 Fix french analyzer elision token filter doc
Fix #15774
2016-01-06 18:26:03 +01:00
David Pilato 995e796eab [doc] Fix cross link with ICU plugin
Doc bug introduced with #15695
2015-12-30 12:07:33 +01:00
David Pilato 3076377fdb Remove ICU Plugin in reference guide
This documentation lives now in plugins documentation at https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html.

We don't need a copy in analysis reference guide.
2015-12-29 11:23:28 +01:00
socurites 485915bbe7 comma(,) was duplicated
deleted it.
2015-12-24 14:31:26 +01:00
socurites 25d23091e2 Edge NGram: "side" setting was depercated
Edge NGram: "side" setting was depercated
2015-12-24 14:26:24 +01:00
Jason Tedor d9a24961c5 Fix minor issues in delimited payload token filter docs
This commit addresses a few minor issues in the delimited payload token
filter docs:
  - the provided example reversed the payloads associated with the
    tokens "the" and "fox"
  - two additional typos in the same sentence
    - "per default" -> "by default"
    - "default int to" -> "default into"
  - adds two serial commas
2015-12-16 13:00:20 -05:00