21 Commits

Author SHA1 Message Date
Andy Bristol
4c5bd57619 Rename simple pattern tokenizers (#25300)
Changed names to be snake case for consistency

Related to #25159, original issue #23363
2017-06-19 13:48:43 -07:00
Adrien Grand
0c117145f6 Upgrade to lucene-7.0.0-snapshot-92b1783. (#25222)
This snapshot has faster range queries on range fields (LUCENE-7828), more
accurate norms (LUCENE-7730) and the ability to use fake term frequencies
(LUCENE-7854).
2017-06-15 09:52:07 +02:00
Andy Bristol
48696ab544 expose simple pattern tokenizers (#25159)
Expose the experimental simplepattern and 
simplepatternsplit tokenizers in the common 
analysis plugin. They provide tokenization based 
on regular expressions, using Lucene's 
deterministic regex implementation that is usually 
faster than Java's and has protections against 
creating too-deep stacks during matching.

Both have a not-very-useful default pattern of the 
empty string because all tokenizer factories must 
be able to be instantiated at index creation time. 
They should always be configured by the user 
in practice.
2017-06-13 12:46:59 -07:00
Shubham Aggarwal
e07e4cc4dd Fix incorrect heading for Whitespace Tokenizer (#22883) 2017-01-31 12:51:37 +01:00
Allen Torres
887fbb6387 Update lowercase-tokenizer.asciidoc (#21896)
Fixed typo
2016-12-02 10:49:51 -05:00
Pascal Borreli
fcb01deb34 Fixed typos (#20843) 2016-10-10 14:51:47 -06:00
Clinton Gormley
2f6d0119f1 Added warning messages about the dangers of pathological regexes to:
* pattern-replace charfilter
* pattern-capture and pattern-replace token filters
* pattern tokenizer
* pattern analyzer

Relates to #20038
2016-09-09 09:53:07 +02:00
Nik Everett
7aeea764ba Remove wait_for_status=yellow from the docs
It is no longer required after 687e2e12b31ed3c12ef4c411333bff9da58fc808.
2016-07-15 16:02:07 -04:00
Jim Ferenczi
881afcba60 Fixed tests that failed now that BM25 is the default similarity. 2016-06-21 15:42:42 +02:00
Clinton Gormley
5da9e5dcbc Docs: Improved tokenizer docs (#18356)
* Docs: Improved tokenizer docs

Added descriptions and runnable examples

* Addressed Nik's comments

* Added TESTRESPONSEs for all tokenizer examples

* Added TESTRESPONSEs for all analyzer examples too

* Added docs, examples, and TESTRESPONSES for character filters

* Skipping two tests:

One interprets "$1" as a stack variable - same problem exists with the REST tests

The other because the "took" value is always different

* Fixed tests with "took"

* Fixed failing tests and removed preserve_original from fingerprint analyzer
2016-05-19 19:42:23 +02:00
Clinton Gormley
dc21ab7576 Docs: Corrected behaviour of max_token_length in standard tokenizer 2016-03-18 10:58:16 +01:00
Clinton Gormley
cb00d4a542 Docs: Removed all the added/deprecated tags from 1.x 2014-09-26 21:04:42 +02:00
Simon Willnauer
5bfea56457 [DOCS] move all coming tags to added in master 2014-07-23 16:37:19 +02:00
Mikhail Korobov
955473f475 Docs: unescape regexes in Pattern Tokenizer docs
Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`.

Closes #6615
2014-07-03 13:34:13 +02:00
Robert Muir
b9a09c2b06 Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene
Add `irish` analyzer
Add `sorani` analyzer (Kurdish)

Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.

Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`

Add support access to default Thai stopword set "_thai_"

Fix some bugs and broken links in documentation.

Closes #5935
2014-07-03 05:47:49 -04:00
Richard Boulton
fdb5eb6555 Update keyword-tokenizer.asciidoc 2014-05-07 15:04:07 +02:00
Clinton Gormley
4c34615686 [DOCS] Fixed some bad UTF8 2014-03-19 12:46:06 +01:00
Yousef
302c762d5e Wrong link to Token Filter 2013-12-03 10:39:13 +01:00
Adrien Grand
90524d7ad2 Fix formatting of the documentation.
Remaining '@'s have been replaced with '`'s.
2013-09-18 12:35:44 +02:00
Clinton Gormley
393c28bee4 [DOCS] Removed outdated new/deprecated version notices 2013-09-03 21:28:31 +02:00
Clinton Gormley
822043347e Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00