OpenSearch/docs/reference/analysis/tokenizers/pathhierarchy-tokenizer.asc...

176 lines
3.4 KiB
Plaintext
Raw Normal View History

[[analysis-pathhierarchy-tokenizer]]
=== Path Hierarchy Tokenizer
The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
path, splits on the path separator, and emits a term for each component in the
tree.
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"tokenizer": "path_hierarchy",
"text": "/one/two/three"
}
---------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "/one",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "/one/two",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "/one/two/three",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above text would produce the following terms:
[source,text]
---------------------------
[ /one, /one/two, /one/two/three ]
---------------------------
[float]
=== Configuration
The `path_hierarchy` tokenizer accepts the following parameters:
[horizontal]
`delimiter`::
The character to use as the path separator. Defaults to `/`.
`replacement`::
An optional replacement character to use for the delimiter.
Defaults to the `delimiter`.
`buffer_size`::
The number of characters read into the term buffer in a single pass.
Defaults to `1024`. The term buffer will grow by this size until all the
text has been consumed. It is advisable not to change this setting.
`reverse`::
If set to `true`, emits the tokens in reverse order. Defaults to `false`.
`skip`::
The number of initial tokens to skip. Defaults to `0`.
[float]
=== Example configuration
In this example, we configure the `path_hierarchy` tokenizer to split on `-`
characters, and to replace them with `/`. The first two tokens are skipped:
[source,js]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-",
"replacement": "/",
"skip": 2
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "one-two-three-four-five"
}
----------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "/three",
"start_offset": 7,
"end_offset": 13,
"type": "word",
"position": 0
},
{
"token": "/three/four",
"start_offset": 7,
"end_offset": 18,
"type": "word",
"position": 0
},
{
"token": "/three/four/five",
"start_offset": 7,
"end_offset": 23,
"type": "word",
"position": 0
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ /three, /three/four, /three/four/five ]
---------------------------
If we were to set `reverse` to `true`, it would produce the following:
[source,text]
---------------------------
[ one/two/three/, two/three/, three/ ]
---------------------------