Commit Graph

16 Commits

Author SHA1 Message Date
Clinton Gormley e4baa56f4b Docs: Language analyzers
Clarified the use of stem_exclusion and the keyword_marker
token filter

Closes #6613
2014-07-07 10:06:18 +02:00
Clinton Gormley 54790eea10 Update lang-analyzer.asciidoc
Clarified the use of the `stem_exclusion` token filter.

Closes #6613
2014-07-04 17:50:43 +02:00
Robert Muir b9a09c2b06 Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene
Add `irish` analyzer
Add `sorani` analyzer (Kurdish)

Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.

Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`

Add support access to default Thai stopword set "_thai_"

Fix some bugs and broken links in documentation.

Closes #5935
2014-07-03 05:47:49 -04:00
Clinton Gormley 04dacaaf27 Docs: Use the "stemmer" token filter for the english analyzer, to be consistent 2014-06-11 13:47:07 +02:00
Clinton Gormley 8a94b71b75 Docs: Corrected the use of keyword_marker on the lang analyzers 2014-06-11 13:43:02 +02:00
Clinton Gormley 5e40868f44 Docs: Fixed a bad ref on lang analyzers page 2014-06-09 23:03:12 +02:00
Clinton Gormley 5c5c1da06c Docs: Fixed some errors on the language analyzers page 2014-06-09 22:51:28 +02:00
Clinton Gormley 585b0ef730 Docs: Added custom-analyzer equivalents of all the language analyzers 2014-06-09 22:41:25 +02:00
Alexander Reelsen c6155c5142 release [1.0.0.RC1] 2014-01-15 17:02:22 +00:00
Benjamin Vetter ba8e012be9 Referring to stop analyzer for stopword docs #329 2014-01-14 11:53:30 +01:00
Benjamin Vetter 22a96e6a18 Added stopwords: _none_ to the docs #329 2014-01-14 11:53:29 +01:00
Simon Willnauer 7f63ddf94e Default stopwords list should be `_none_` for all but language-specific analyzers
`standard_html_strip` and `pattern` analyzer support stopwords which are
set to the default `english` stopwords by default. Those analyzers
should not use stopwords by default since they are language neutral

Closes #4699
2014-01-13 14:44:10 +01:00
Simon Willnauer 77bc5d5ecf release [1.0.0.Beta1] 2013-11-06 15:32:43 +01:00
Simon Willnauer 9654631186 Change 'standart' analyzer to use emtpy stopword list by default.
The 'default' / 'standard' analyzer can be a trappy default sicne it filters
english stopwords by default. Yet a default should not be dedicated to a certain language
since elasticsearch is used in many different scenarios where a standard analysis chain
with specialization to english full-text might be rather counter productive.

This commit changes the 'standard' analyzer to use an empty stopword list for indices
that are created from 1.0.0.Beta1 version onwards but will maintain backwards compatibiliy
for older indices.

Closes #3775
2013-11-05 21:07:21 +01:00
Ben McCann cc4bc7d57d Fix nonsensical sentence in standard analyzer documentation so that it is more understandable 2013-10-25 00:18:32 +02:00
Clinton Gormley 822043347e Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00