Commit Graph

73 Commits

Author SHA1 Message Date
James Rodewig 7ef906fde8 [DOCS] Add tutorials section to analysis topic (#50809)
Adds a 'Configure text analysis' page to house tutorial content for the
analysis topic.

Also relocates the following pages as children as this new page:

* 'Test an analyzer'
* 'Configuring built-in analyzers'
* 'Create a custom analyzer'

I plan to add a tutorial for specifying index-time and search-time
analyzers to this section as part of a future PR.
2020-01-16 13:12:06 -05:00
Nik Everett 01293ebad5
Fix docs typos (#50365) (#50464)
Fixes a few typos in the docs.

Co-authored-by: Xiang Dai <764524258@qq.com>
2019-12-23 12:38:17 -05:00
gpaimla 7d20b50f45 Implement Lucene EstonianAnalyzer, Stemmer (#49149)
This PR adds a new analyzer and stemmer for the Estonian language.

Closes #48895
2019-11-18 17:24:21 +01:00
Wilder Pereira 8c73e215b2 [DOCS] Remove unneeded spaces from custom analyzer snippet (#47332) 2019-10-15 15:53:16 -04:00
James Rodewig b59ecde041
[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00
James Rodewig bb7bff5e30
[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418) 2019-09-06 09:22:08 -04:00
James Rodewig 3e62cf9d74 [DOCS] Correct custom analyzer callouts (#46030) 2019-08-29 10:08:18 -04:00
Guilherme Ferreira 48a17d5768 [Docs] Correct default stop list constant (#41342) 2019-04-23 19:13:51 +02:00
Guilherme Ferreira 414debd740 [Docs] Correct spelling the "_none_" stopwords element (#41191) 2019-04-15 14:12:26 +02:00
Mayya Sharipova 0e1b1959fe
Correct rebuilt persian analyzer (#38724) (#38744)
Make substitution of \u200C with a space explicit

The problem with this symbol `\u200C` in a test string, 
that **SHOULD** be substituted with space in the rebuilt Persian analyzer, but it is not.

Correcting this line `"mappings": [ "\\u200C=> "] <1>` to
 `"mappings": [ "\\u200C=>\\u0020"] <1>` in solves the problem.
This change explicitly says to substitute ZWNJ with a space.

Closes #38188
2019-02-11 14:17:18 -05:00
Christoph Büscher 34f2d2ec91
Remove remaining occurances of "include_type_name=true" in docs (#37646) 2019-01-22 15:13:52 +01:00
Christoph Büscher 25aac4f77f
Remove `include_type_name` in asciidoc where possible (#37568)
The "include_type_name" parameter was temporarily introduced in #37285 to facilitate
moving the default parameter setting to "false" in many places in the documentation
code snippets. Most of the places can simply be reverted without causing errors.
In this change I looked for asciidoc files that contained the
"include_type_name=true" addition when creating new indices but didn't look
likey they made use of the "_doc" type for mappings. This is mostly the case
e.g. in the analysis docs where index creating often only contains settings. I
manually corrected the use of types in some places where the docs still used an
explicit type name and not the dummy "_doc" type.
2019-01-18 09:34:11 +01:00
Julie Tibshirani 36a3b84fc9
Update the default for include_type_name to false. (#37285)
* Default include_type_name to false for get and put mappings.

* Default include_type_name to false for get field mappings.

* Add a constant for the default include_type_name value.

* Default include_type_name to false for get and put index templates.

* Default include_type_name to false for create index.

* Update create index calls in REST documentation to use include_type_name=true.

* Some minor clean-ups around the get index API.

* In REST tests, use include_type_name=true by default for index creation.

* Make sure to use 'expression == false'.

* Clarify the different IndexTemplateMetaData toXContent methods.

* Fix FullClusterRestartIT#testSnapshotRestore.

* Fix the ml_anomalies_default_mappings test.

* Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests.

We make sure to specify include_type_name=true during xContent parsing,
so we continue to test the legacy typed responses. XContent generation
for the typeless responses is currently only covered by REST tests,
but we will be adding unit test coverage for these as we implement
each typeless API in the Java HLRC.

This commit also refactors GetMappingsResponse to follow the same appraoch
as the other mappings-related responses, where we read include_type_name
out of the xContent params, instead of creating a second toXContent method.
This gives better consistency in the response parsing code.

* Fix more REST tests.

* Improve some wording in the create index documentation.

* Add a note about types removal in the create index docs.

* Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL.

* Make sure to mention include_type_name in the REST docs for affected APIs.

* Make sure to use 'expression == false' in FullClusterRestartIT.

* Mention include_type_name in the REST templates docs.
2019-01-14 13:08:01 -08:00
Josh Soref edb48321ba [DOCS] Various spelling corrections (#37046) 2019-01-07 14:44:12 +01:00
Alan Woodward f6a43b5939
Add a prebuilt ICU Analyzer (#34958)
The ICU plugin provides the building blocks of an analysis chain, but doesn't actually have a prebuilt analyzer. It would be a better for users if there was a simple analyzer that they could use out of the box, and also something we can point to from the CJK Analyzer docs as a superior alternative.

Relates to #34285
2018-11-21 09:00:48 +00:00
Nikolay Vasiliev 16956a1a05 [DOCS] Clarify 'type' parameter meaning for custom analyzer (#34012)
This pull request improves the docs on the meaning of type parameter on the custom 
analyzer doc page. 

Closes #33456
2018-09-25 15:32:27 +02:00
Jim Ferenczi 7ad71f906a
Upgrade to a Lucene 8 snapshot (#33310)
The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly.
Some comments about the change:
* Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309

Closes #32899
2018-09-06 14:42:06 +02:00
Jim Ferenczi bdb79d021a
Fix docs failure on language analyzers (#30722)
This commit fixes docs failure on language analyzers when compared to the built in analyzers.
The `elision` filters used by the rebuilt language analyzers should be case insensitive to match
the definition of the prebuilt analyzers.

Closes #30557
2018-05-22 09:58:12 +02:00
Nik Everett 9881bfaea5
Docs: Document how to rebuild analyzers (#30498)
Adds documentation for how to rebuild all the built in analyzers and
tests for that documentation using the mechanism added in #29535.

Closes #29499
2018-05-14 18:40:54 -04:00
Nik Everett f9dc86836d
Docs: Test examples that recreate lang analyzers (#29535)
We have a pile of documentation describing how to rebuild the built in
language analyzers and, previously, our documentation testing framework
made sure that the examples successfully built *an* analyzer but they
didn't assert that the analyzer built by the documentation matches the
built in anlayzer. Unsuprisingly, some of the examples aren't quite
right.

This adds a mechanism that tests that the analyzers built by the docs.
The mechanism is fairly simple and brutal but it seems to be working:
build a hundred random unicode sequences and send them through the
`_analyze` API with the rebuilt analyzer and then again through the
built in analyzer. Then make sure both APIs return the same results.
Each of these calls to `_anlayze` takes about 20ms on my laptop which
seems fine.
2018-05-09 09:23:10 -04:00
deepybee 48c8098e15 Fixed several typos in analyzers section (#28247) 2018-01-18 08:51:53 +00:00
Adrien Grand 1b660821a2
Allow `_doc` as a type. (#27816)
Allowing `_doc` as a type will enable users to make the transition to 7.0
smoother since the index APIs will be `PUT index/_doc/id` and `POST index/_doc`.
This also moves most of the documentation to `_doc` as a type name.

Closes #27750
Closes #27751
2017-12-14 17:47:53 +01:00
Md. Abdulla-Al-Sun a40c474e10
Added Bengali Analyzer to Elasticsearch with respect to the lucene update(PR#238) 2017-10-05 13:25:05 +02:00
Tahmim Ahmed Shibli 34662c9e6d [Docs] Fix name of character filter in example. (#26724) 2017-09-20 17:08:43 +02:00
Nik Everett 514187be8e Fix language in some docs
The pattern-analyzer docs contained a snippet that was an expanded
regex that was marked as `[source,js]`. This changes it to
`[source,regex]`.

The htmlstrip-charfilter and pattern-replace-charfilter docs had
examples that were actually a list of tokens but marked `[source,js]`.
This marks them as `[source,text]` so they don't count as unconverted
CONSOLE snippets.

The pattern-replace-charfilter also had a doc who's test was
skipped because of funny interaction with the test framework. This
fixes the test.

Three more down, eighty-two to go.

Relates to #18160
2017-04-01 14:45:44 -04:00
Nik Everett 9baa48a928 CONSOLEify lang-analyzer docs
CONSOLEifies the lang-analyzer docs and replaces the (invalid)
empty `keyword_marker` setups that were on the page with one
that contains the word "example" translated into the appropriate
language.

Relates to #18160
2017-04-01 14:21:58 -04:00
markwalkom ced99dde50 Update stop-analyzer.asciidoc (#23195)
Clarified where the stopwords file needs to live
2017-02-16 13:36:15 +01:00
Francesc Gil dec6fc2d40 Repeated language analyzers (#22240)
* Repeated language analyzers

The `catalan` analyzer was repeated on the supported list :)

* Reordered the languages to have alphabetic order

* Added space for format

* Reordered the languages and removed repeated
2016-12-21 17:32:02 +01:00
Clinton Gormley 22f1acde94 Docs: Pattern analyzer does not support a max_token_length parameter
Closes #20713
2016-10-08 12:27:33 +02:00
Clinton Gormley 2f6d0119f1 Added warning messages about the dangers of pathological regexes to:
* pattern-replace charfilter
* pattern-capture and pattern-replace token filters
* pattern tokenizer
* pattern analyzer

Relates to #20038
2016-09-09 09:53:07 +02:00
Jim Ferenczi 4682fc34ae Add the ability to disable the retrieval of the stored fields entirely
This change adds a special field named _none_ that allows to disable the retrieval of the stored fields in a search request or in a TopHitsAggregation.

To completely disable stored fields retrieval (including disabling metadata fields retrieval such as _id or _type) use _none_ like this:

````
POST _search
{
   "stored_fields": "_none_"
}
````
2016-08-24 16:40:08 +02:00
Nik Everett 7aeea764ba Remove wait_for_status=yellow from the docs
It is no longer required after 687e2e12b3.
2016-07-15 16:02:07 -04:00
Nik Everett a0585269be [docs] s/lags/Flags/
Copy and paste lots an `F`.
2016-06-09 13:08:53 -04:00
Clinton Gormley 5da9e5dcbc Docs: Improved tokenizer docs (#18356)
* Docs: Improved tokenizer docs

Added descriptions and runnable examples

* Addressed Nik's comments

* Added TESTRESPONSEs for all tokenizer examples

* Added TESTRESPONSEs for all analyzer examples too

* Added docs, examples, and TESTRESPONSES for character filters

* Skipping two tests:

One interprets "$1" as a stack variable - same problem exists with the REST tests

The other because the "took" value is always different

* Fixed tests with "took"

* Fixed failing tests and removed preserve_original from fingerprint analyzer
2016-05-19 19:42:23 +02:00
Zachary Tong 5ee5cc25cc Move AsciiFolding earlier in FingerprintAnalyzer filter chain
Rearranges the FingerprintAnalyzer so that AsciiFolding comes earlier in the chain (after lowercasing, before stop removal, for maximum deduping power)

Closes #18266
2016-05-12 09:34:15 -04:00
Clinton Gormley 97a41ee973 First pass at improving analyzer docs (#18269)
* Docs: First pass at improving analyzer docs

I've rewritten the intro to analyzers plus the docs
for all analyzers to provide working examples.

I've also removed:

* analyzer aliases (see #18244)
* analyzer versions (see #18267)
* snowball analyzer (see #8690)

Next steps will be tokenizers, token filters, char filters

* Fixed two typos
2016-05-11 14:17:56 +02:00
Clinton Gormley 3f594089c2 Renamed all AUTOSENSE snippets to CONSOLE (#18210) 2016-05-09 15:42:23 +02:00
Nik Everett 3912761572 [docs] Add wait_until_yellow to fix build failure
The snippet in the docs creates and index and uses it with the
_analyze api. The trouble is that if the index hasn't been created
fully the _analyze API will fail. This adds a
GET _cluster/health?wait_for_status=yellow
which fixes the issue.

While this does make the docs more cluttered, it also makes the snippets
actually runnable.

Closes #18165
2016-05-05 16:02:00 -04:00
Nik Everett 4b1c116461 Generate and run tests from the docs
Adds infrastructure so `gradle :docs:check` will extract tests from
snippets in the documentation and execute the tests. This is included
in `gradle check` so it should happen on CI and during a normal build.

By default each `// AUTOSENSE` snippet creates a unique REST test. These
tests are executed in a random order and the cluster is wiped between
each one. If multiple snippets chain together into a test you can annotate
all snippets after the first with `// TEST[continued]` to have the
generated tests for both snippets joined.

Snippets marked as `// TESTRESPONSE` are checked against the response
of the last action.

See docs/README.asciidoc for lots more.

Closes #12583. That issue is about catching bugs in the docs during build.
This catches *some* bugs in the docs during build which is a good start.
2016-05-05 13:58:03 -04:00
Zachary Tong 80288ad60c Add `fingerprint` token filter and `fingerprint` analyzer
Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter,
and a `fingerprint` analyzer that combines the Fingerprint filter with
lowercasing, stop word removal and asciifolding.

Closes #13325
2016-04-20 16:10:56 -04:00
Adrien Grand b42f66c8ac Document 5.0 mapping changes. 2016-03-22 16:22:58 +01:00
Adrien Grand f8e802c028 Merge pull request #15794 from damienalexandre/french-doc
[Doc] Fix french analyzer elision token filter doc
2016-01-06 18:39:26 +01:00
Damien Alexandre 23a64f8214 Fix french analyzer elision token filter doc
Fix #15774
2016-01-06 18:26:03 +01:00
Clinton Gormley 98028419a5 Merge pull request #14610 from yokotaso/patch-1
Update snowball document page.
2015-11-17 14:17:30 +01:00
Robert Muir 0d3e3f81fc Lithuanian analysis 2015-09-01 08:52:10 -04:00
xuzha fb2be6d6a1 The name "position_offset_gap" is confusing because Lucene has three
similar sounding things:

* Analyzer#getPositionIncrementGap
* Analyzer#getOffsetGap
* IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS and
* FieldType#storeTermVectorOffsets

Rename position_offset_gap to position_increment_gap
closes #13056
2015-08-26 14:56:35 -07:00
Nik Everett 4b9664beeb Mapping: Default position_offset_gap to 100
This is much more fiddly than you'd expect it to be because of the way
position_offset_gap is applied in StringFieldMapper. Instead of setting
the default to 100 its simpler to make sure that all the analyzers default
to 100 and that StringFieldMapper doesn't override the default unless the
user specifies something different. Unless the index was created before
2.1, in which case the old default of 0 has to take.

Also postition_offset_gaps less than 0 aren't allowed at all.

New tests test that:
1. the new default doesn't match phrases across values with reasonably low
slop (5)
2. the new default doest match phrases across values with reasonably high
slop (50)
3. you can override the value and phrases work as you'd expect
4. if you leave the value undefined in the mapping and define it on a
custom analyzer the the value from the custom analyzer shines through

Closes #7268
2015-08-25 14:21:50 -04:00
Britta Weber eeeb29f900 spell correct and add single quotes 2015-05-26 11:41:19 +02:00
Britta Weber 37782c1745 analyzers: custom analyzers names and aliases must not start with _
closes #9596
2015-05-26 11:38:15 +02:00
Clinton Gormley 3a69b65e88 Docs: Fixed the backslash escaping on the pattern analyzer docs
Closes #11099
2015-05-15 18:40:16 +02:00