Changes the titles for tokenizer pages to sentence case. Also moves the 'Path hierarchy tokenizer examples' page within the 'Path hierarchy tokenizer' page and adds a related redirect.
This commit is contained in:
parent
eaa60b7c54
commit
ab29162ab3
|
@ -140,8 +140,6 @@ include::tokenizers/ngram-tokenizer.asciidoc[]
|
|||
|
||||
include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
|
||||
|
||||
include::tokenizers/pathhierarchy-tokenizer-examples.asciidoc[]
|
||||
|
||||
include::tokenizers/pattern-tokenizer.asciidoc[]
|
||||
|
||||
include::tokenizers/simplepattern-tokenizer.asciidoc[]
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-chargroup-tokenizer]]
|
||||
=== Char Group Tokenizer
|
||||
=== Character group tokenizer
|
||||
++++
|
||||
<titleabbrev>Character group</titleabbrev>
|
||||
++++
|
||||
|
||||
The `char_group` tokenizer breaks text into terms whenever it encounters a
|
||||
character which is in a defined set. It is mostly useful for cases where a simple
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-classic-tokenizer]]
|
||||
=== Classic Tokenizer
|
||||
=== Classic tokenizer
|
||||
++++
|
||||
<titleabbrev>Classic</titleabbrev>
|
||||
++++
|
||||
|
||||
The `classic` tokenizer is a grammar based tokenizer that is good for English
|
||||
language documents. This tokenizer has heuristics for special treatment of
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-edgengram-tokenizer]]
|
||||
=== Edge n-gram tokenizer
|
||||
++++
|
||||
<titleabbrev>Edge n-gram</titleabbrev>
|
||||
++++
|
||||
|
||||
The `edge_ngram` tokenizer first breaks text down into words whenever it
|
||||
encounters one of a list of specified characters, then it emits
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-keyword-tokenizer]]
|
||||
=== Keyword Tokenizer
|
||||
=== Keyword tokenizer
|
||||
++++
|
||||
<titleabbrev>Keyword</titleabbrev>
|
||||
++++
|
||||
|
||||
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
|
||||
is given and outputs the exact same text as a single term. It can be combined
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-letter-tokenizer]]
|
||||
=== Letter Tokenizer
|
||||
=== Letter tokenizer
|
||||
++++
|
||||
<titleabbrev>Letter</titleabbrev>
|
||||
++++
|
||||
|
||||
The `letter` tokenizer breaks text into terms whenever it encounters a
|
||||
character which is not a letter. It does a reasonable job for most European
|
||||
|
|
|
@ -1,6 +1,8 @@
|
|||
[[analysis-lowercase-tokenizer]]
|
||||
=== Lowercase Tokenizer
|
||||
|
||||
=== Lowercase tokenizer
|
||||
++++
|
||||
<titleabbrev>Lowercase</titleabbrev>
|
||||
++++
|
||||
|
||||
The `lowercase` tokenizer, like the
|
||||
<<analysis-letter-tokenizer, `letter` tokenizer>> breaks text into terms
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-ngram-tokenizer]]
|
||||
=== N-gram tokenizer
|
||||
++++
|
||||
<titleabbrev>N-gram</titleabbrev>
|
||||
++++
|
||||
|
||||
The `ngram` tokenizer first breaks text down into words whenever it encounters
|
||||
one of a list of specified characters, then it emits
|
||||
|
|
|
@ -1,183 +0,0 @@
|
|||
[[analysis-pathhierarchy-tokenizer-examples]]
|
||||
=== Path Hierarchy Tokenizer Examples
|
||||
|
||||
A common use-case for the `path_hierarchy` tokenizer is filtering results by
|
||||
file paths. If indexing a file path along with the data, the use of the
|
||||
`path_hierarchy` tokenizer to analyze the path allows filtering the results
|
||||
by different parts of the file path string.
|
||||
|
||||
|
||||
This example configures an index to have two custom analyzers and applies
|
||||
those analyzers to multifields of the `file_path` text field that will
|
||||
store filenames. One of the two analyzers uses reverse tokenization.
|
||||
Some sample documents are then indexed to represent some file paths
|
||||
for photos inside photo folders of two different users.
|
||||
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT file-path-test
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"custom_path_tree": {
|
||||
"tokenizer": "custom_hierarchy"
|
||||
},
|
||||
"custom_path_tree_reversed": {
|
||||
"tokenizer": "custom_hierarchy_reversed"
|
||||
}
|
||||
},
|
||||
"tokenizer": {
|
||||
"custom_hierarchy": {
|
||||
"type": "path_hierarchy",
|
||||
"delimiter": "/"
|
||||
},
|
||||
"custom_hierarchy_reversed": {
|
||||
"type": "path_hierarchy",
|
||||
"delimiter": "/",
|
||||
"reverse": "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"file_path": {
|
||||
"type": "text",
|
||||
"fields": {
|
||||
"tree": {
|
||||
"type": "text",
|
||||
"analyzer": "custom_path_tree"
|
||||
},
|
||||
"tree_reversed": {
|
||||
"type": "text",
|
||||
"analyzer": "custom_path_tree_reversed"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/1
|
||||
{
|
||||
"file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/2
|
||||
{
|
||||
"file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/3
|
||||
{
|
||||
"file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/4
|
||||
{
|
||||
"file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/5
|
||||
{
|
||||
"file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TESTSETUP
|
||||
|
||||
|
||||
A search for a particular file path string against the text field matches all
|
||||
the example documents, with Bob's documents ranking highest due to `bob` also
|
||||
being one of the terms created by the standard analyzer boosting relevance for
|
||||
Bob's documents.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET file-path-test/_search
|
||||
{
|
||||
"query": {
|
||||
"match": {
|
||||
"file_path": "/User/bob/photos/2017/05"
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
It's simple to match or filter documents with file paths that exist within a
|
||||
particular directory using the `file_path.tree` field.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET file-path-test/_search
|
||||
{
|
||||
"query": {
|
||||
"term": {
|
||||
"file_path.tree": "/User/alice/photos/2017/05/16"
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
With the reverse parameter for this tokenizer, it's also possible to match
|
||||
from the other end of the file path, such as individual file names or a deep
|
||||
level subdirectory. The following example shows a search for all files named
|
||||
`my_photo1.jpg` within any directory via the `file_path.tree_reversed` field
|
||||
configured to use the reverse parameter in the mapping.
|
||||
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET file-path-test/_search
|
||||
{
|
||||
"query": {
|
||||
"term": {
|
||||
"file_path.tree_reversed": {
|
||||
"value": "my_photo1.jpg"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
Viewing the tokens generated with both forward and reverse is instructive
|
||||
in showing the tokens created for the same file path value.
|
||||
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
POST file-path-test/_analyze
|
||||
{
|
||||
"analyzer": "custom_path_tree",
|
||||
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_analyze
|
||||
{
|
||||
"analyzer": "custom_path_tree_reversed",
|
||||
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
|
||||
It's also useful to be able to filter with file paths when combined with other
|
||||
types of searches, such as this example looking for any files paths with `16`
|
||||
that also must be in Alice's photo directory.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET file-path-test/_search
|
||||
{
|
||||
"query": {
|
||||
"bool" : {
|
||||
"must" : {
|
||||
"match" : { "file_path" : "16" }
|
||||
},
|
||||
"filter": {
|
||||
"term" : { "file_path.tree" : "/User/alice" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-pathhierarchy-tokenizer]]
|
||||
=== Path Hierarchy Tokenizer
|
||||
=== Path hierarchy tokenizer
|
||||
++++
|
||||
<titleabbrev>Path hierarchy</titleabbrev>
|
||||
++++
|
||||
|
||||
The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
|
||||
path, splits on the path separator, and emits a term for each component in the
|
||||
|
@ -167,6 +170,191 @@ If we were to set `reverse` to `true`, it would produce the following:
|
|||
[ one/two/three/, two/three/, three/ ]
|
||||
---------------------------
|
||||
|
||||
[float]
|
||||
=== Detailed Examples
|
||||
See <<analysis-pathhierarchy-tokenizer-examples, detailed examples here>>.
|
||||
[discrete]
|
||||
[[analysis-pathhierarchy-tokenizer-detailed-examples]]
|
||||
=== Detailed examples
|
||||
|
||||
A common use-case for the `path_hierarchy` tokenizer is filtering results by
|
||||
file paths. If indexing a file path along with the data, the use of the
|
||||
`path_hierarchy` tokenizer to analyze the path allows filtering the results
|
||||
by different parts of the file path string.
|
||||
|
||||
|
||||
This example configures an index to have two custom analyzers and applies
|
||||
those analyzers to multifields of the `file_path` text field that will
|
||||
store filenames. One of the two analyzers uses reverse tokenization.
|
||||
Some sample documents are then indexed to represent some file paths
|
||||
for photos inside photo folders of two different users.
|
||||
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT file-path-test
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"custom_path_tree": {
|
||||
"tokenizer": "custom_hierarchy"
|
||||
},
|
||||
"custom_path_tree_reversed": {
|
||||
"tokenizer": "custom_hierarchy_reversed"
|
||||
}
|
||||
},
|
||||
"tokenizer": {
|
||||
"custom_hierarchy": {
|
||||
"type": "path_hierarchy",
|
||||
"delimiter": "/"
|
||||
},
|
||||
"custom_hierarchy_reversed": {
|
||||
"type": "path_hierarchy",
|
||||
"delimiter": "/",
|
||||
"reverse": "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"file_path": {
|
||||
"type": "text",
|
||||
"fields": {
|
||||
"tree": {
|
||||
"type": "text",
|
||||
"analyzer": "custom_path_tree"
|
||||
},
|
||||
"tree_reversed": {
|
||||
"type": "text",
|
||||
"analyzer": "custom_path_tree_reversed"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/1
|
||||
{
|
||||
"file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/2
|
||||
{
|
||||
"file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/3
|
||||
{
|
||||
"file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/4
|
||||
{
|
||||
"file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_doc/5
|
||||
{
|
||||
"file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
|
||||
A search for a particular file path string against the text field matches all
|
||||
the example documents, with Bob's documents ranking highest due to `bob` also
|
||||
being one of the terms created by the standard analyzer boosting relevance for
|
||||
Bob's documents.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET file-path-test/_search
|
||||
{
|
||||
"query": {
|
||||
"match": {
|
||||
"file_path": "/User/bob/photos/2017/05"
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TEST[continued]
|
||||
|
||||
It's simple to match or filter documents with file paths that exist within a
|
||||
particular directory using the `file_path.tree` field.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET file-path-test/_search
|
||||
{
|
||||
"query": {
|
||||
"term": {
|
||||
"file_path.tree": "/User/alice/photos/2017/05/16"
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TEST[continued]
|
||||
|
||||
With the reverse parameter for this tokenizer, it's also possible to match
|
||||
from the other end of the file path, such as individual file names or a deep
|
||||
level subdirectory. The following example shows a search for all files named
|
||||
`my_photo1.jpg` within any directory via the `file_path.tree_reversed` field
|
||||
configured to use the reverse parameter in the mapping.
|
||||
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET file-path-test/_search
|
||||
{
|
||||
"query": {
|
||||
"term": {
|
||||
"file_path.tree_reversed": {
|
||||
"value": "my_photo1.jpg"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TEST[continued]
|
||||
|
||||
Viewing the tokens generated with both forward and reverse is instructive
|
||||
in showing the tokens created for the same file path value.
|
||||
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
POST file-path-test/_analyze
|
||||
{
|
||||
"analyzer": "custom_path_tree",
|
||||
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||||
}
|
||||
|
||||
POST file-path-test/_analyze
|
||||
{
|
||||
"analyzer": "custom_path_tree_reversed",
|
||||
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TEST[continued]
|
||||
|
||||
|
||||
It's also useful to be able to filter with file paths when combined with other
|
||||
types of searches, such as this example looking for any files paths with `16`
|
||||
that also must be in Alice's photo directory.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET file-path-test/_search
|
||||
{
|
||||
"query": {
|
||||
"bool" : {
|
||||
"must" : {
|
||||
"match" : { "file_path" : "16" }
|
||||
},
|
||||
"filter": {
|
||||
"term" : { "file_path.tree" : "/User/alice" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TEST[continued]
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-pattern-tokenizer]]
|
||||
=== Pattern Tokenizer
|
||||
=== Pattern tokenizer
|
||||
++++
|
||||
<titleabbrev>Pattern</titleabbrev>
|
||||
++++
|
||||
|
||||
The `pattern` tokenizer uses a regular expression to either split text into
|
||||
terms whenever it matches a word separator, or to capture matching text as
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-simplepattern-tokenizer]]
|
||||
=== Simple Pattern Tokenizer
|
||||
=== Simple pattern tokenizer
|
||||
++++
|
||||
<titleabbrev>Simple pattern</titleabbrev>
|
||||
++++
|
||||
|
||||
The `simple_pattern` tokenizer uses a regular expression to capture matching
|
||||
text as terms. The set of regular expression features it supports is more
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-simplepatternsplit-tokenizer]]
|
||||
=== Simple Pattern Split Tokenizer
|
||||
=== Simple pattern split tokenizer
|
||||
++++
|
||||
<titleabbrev>Simple pattern split</titleabbrev>
|
||||
++++
|
||||
|
||||
The `simple_pattern_split` tokenizer uses a regular expression to split the
|
||||
input into terms at pattern matches. The set of regular expression features it
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-standard-tokenizer]]
|
||||
=== Standard Tokenizer
|
||||
=== Standard tokenizer
|
||||
++++
|
||||
<titleabbrev>Standard</titleabbrev>
|
||||
++++
|
||||
|
||||
The `standard` tokenizer provides grammar based tokenization (based on the
|
||||
Unicode Text Segmentation algorithm, as specified in
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-thai-tokenizer]]
|
||||
=== Thai Tokenizer
|
||||
=== Thai tokenizer
|
||||
++++
|
||||
<titleabbrev>Thai</titleabbrev>
|
||||
++++
|
||||
|
||||
The `thai` tokenizer segments Thai text into words, using the Thai
|
||||
segmentation algorithm included with Java. Text in other languages in general
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-uaxurlemail-tokenizer]]
|
||||
=== UAX URL Email Tokenizer
|
||||
=== UAX URL email tokenizer
|
||||
++++
|
||||
<titleabbrev>UAX URL email</titleabbrev>
|
||||
++++
|
||||
|
||||
The `uax_url_email` tokenizer is like the <<analysis-standard-tokenizer,`standard` tokenizer>> except that it
|
||||
recognises URLs and email addresses as single tokens.
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
[[analysis-whitespace-tokenizer]]
|
||||
=== Whitespace Tokenizer
|
||||
=== Whitespace tokenizer
|
||||
++++
|
||||
<titleabbrev>Whitespace</titleabbrev>
|
||||
++++
|
||||
|
||||
The `whitespace` tokenizer breaks text into terms whenever it encounters a
|
||||
whitespace character.
|
||||
|
|
|
@ -886,6 +886,10 @@ See <<ilm-existing-indices-apply>>.
|
|||
|
||||
See <<ilm-existing-indices-reindex>>.
|
||||
|
||||
[role="exclude",id="analysis-pathhierarchy-tokenizer-examples"]
|
||||
=== Path hierarchy tokenizer examples
|
||||
|
||||
See <<analysis-pathhierarchy-tokenizer-detailed-examples>>.
|
||||
|
||||
////
|
||||
[role="exclude",id="search-request-body"]
|
||||
|
|
Loading…
Reference in New Issue