361 lines
8.1 KiB
Plaintext
361 lines
8.1 KiB
Plaintext
[[analysis-pathhierarchy-tokenizer]]
|
|
=== Path hierarchy tokenizer
|
|
++++
|
|
<titleabbrev>Path hierarchy</titleabbrev>
|
|
++++
|
|
|
|
The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
|
|
path, splits on the path separator, and emits a term for each component in the
|
|
tree.
|
|
|
|
[float]
|
|
=== Example output
|
|
|
|
[source,console]
|
|
---------------------------
|
|
POST _analyze
|
|
{
|
|
"tokenizer": "path_hierarchy",
|
|
"text": "/one/two/three"
|
|
}
|
|
---------------------------
|
|
|
|
/////////////////////
|
|
|
|
[source,console-result]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "/one",
|
|
"start_offset": 0,
|
|
"end_offset": 4,
|
|
"type": "word",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "/one/two",
|
|
"start_offset": 0,
|
|
"end_offset": 8,
|
|
"type": "word",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "/one/two/three",
|
|
"start_offset": 0,
|
|
"end_offset": 14,
|
|
"type": "word",
|
|
"position": 0
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
|
|
|
|
The above text would produce the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ /one, /one/two, /one/two/three ]
|
|
---------------------------
|
|
|
|
[float]
|
|
=== Configuration
|
|
|
|
The `path_hierarchy` tokenizer accepts the following parameters:
|
|
|
|
[horizontal]
|
|
`delimiter`::
|
|
The character to use as the path separator. Defaults to `/`.
|
|
|
|
`replacement`::
|
|
An optional replacement character to use for the delimiter.
|
|
Defaults to the `delimiter`.
|
|
|
|
`buffer_size`::
|
|
The number of characters read into the term buffer in a single pass.
|
|
Defaults to `1024`. The term buffer will grow by this size until all the
|
|
text has been consumed. It is advisable not to change this setting.
|
|
|
|
`reverse`::
|
|
If set to `true`, emits the tokens in reverse order. Defaults to `false`.
|
|
|
|
`skip`::
|
|
The number of initial tokens to skip. Defaults to `0`.
|
|
|
|
[float]
|
|
=== Example configuration
|
|
|
|
In this example, we configure the `path_hierarchy` tokenizer to split on `-`
|
|
characters, and to replace them with `/`. The first two tokens are skipped:
|
|
|
|
[source,console]
|
|
----------------------------
|
|
PUT my_index
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_analyzer": {
|
|
"tokenizer": "my_tokenizer"
|
|
}
|
|
},
|
|
"tokenizer": {
|
|
"my_tokenizer": {
|
|
"type": "path_hierarchy",
|
|
"delimiter": "-",
|
|
"replacement": "/",
|
|
"skip": 2
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST my_index/_analyze
|
|
{
|
|
"analyzer": "my_analyzer",
|
|
"text": "one-two-three-four-five"
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
[source,console-result]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "/three",
|
|
"start_offset": 7,
|
|
"end_offset": 13,
|
|
"type": "word",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "/three/four",
|
|
"start_offset": 7,
|
|
"end_offset": 18,
|
|
"type": "word",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "/three/four/five",
|
|
"start_offset": 7,
|
|
"end_offset": 23,
|
|
"type": "word",
|
|
"position": 0
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
|
|
The above example produces the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ /three, /three/four, /three/four/five ]
|
|
---------------------------
|
|
|
|
If we were to set `reverse` to `true`, it would produce the following:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ one/two/three/, two/three/, three/ ]
|
|
---------------------------
|
|
|
|
[discrete]
|
|
[[analysis-pathhierarchy-tokenizer-detailed-examples]]
|
|
=== Detailed examples
|
|
|
|
A common use-case for the `path_hierarchy` tokenizer is filtering results by
|
|
file paths. If indexing a file path along with the data, the use of the
|
|
`path_hierarchy` tokenizer to analyze the path allows filtering the results
|
|
by different parts of the file path string.
|
|
|
|
|
|
This example configures an index to have two custom analyzers and applies
|
|
those analyzers to multifields of the `file_path` text field that will
|
|
store filenames. One of the two analyzers uses reverse tokenization.
|
|
Some sample documents are then indexed to represent some file paths
|
|
for photos inside photo folders of two different users.
|
|
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT file-path-test
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"custom_path_tree": {
|
|
"tokenizer": "custom_hierarchy"
|
|
},
|
|
"custom_path_tree_reversed": {
|
|
"tokenizer": "custom_hierarchy_reversed"
|
|
}
|
|
},
|
|
"tokenizer": {
|
|
"custom_hierarchy": {
|
|
"type": "path_hierarchy",
|
|
"delimiter": "/"
|
|
},
|
|
"custom_hierarchy_reversed": {
|
|
"type": "path_hierarchy",
|
|
"delimiter": "/",
|
|
"reverse": "true"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"file_path": {
|
|
"type": "text",
|
|
"fields": {
|
|
"tree": {
|
|
"type": "text",
|
|
"analyzer": "custom_path_tree"
|
|
},
|
|
"tree_reversed": {
|
|
"type": "text",
|
|
"analyzer": "custom_path_tree_reversed"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST file-path-test/_doc/1
|
|
{
|
|
"file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
|
}
|
|
|
|
POST file-path-test/_doc/2
|
|
{
|
|
"file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
|
|
}
|
|
|
|
POST file-path-test/_doc/3
|
|
{
|
|
"file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
|
|
}
|
|
|
|
POST file-path-test/_doc/4
|
|
{
|
|
"file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
|
|
}
|
|
|
|
POST file-path-test/_doc/5
|
|
{
|
|
"file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
A search for a particular file path string against the text field matches all
|
|
the example documents, with Bob's documents ranking highest due to `bob` also
|
|
being one of the terms created by the standard analyzer boosting relevance for
|
|
Bob's documents.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET file-path-test/_search
|
|
{
|
|
"query": {
|
|
"match": {
|
|
"file_path": "/User/bob/photos/2017/05"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
It's simple to match or filter documents with file paths that exist within a
|
|
particular directory using the `file_path.tree` field.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET file-path-test/_search
|
|
{
|
|
"query": {
|
|
"term": {
|
|
"file_path.tree": "/User/alice/photos/2017/05/16"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
With the reverse parameter for this tokenizer, it's also possible to match
|
|
from the other end of the file path, such as individual file names or a deep
|
|
level subdirectory. The following example shows a search for all files named
|
|
`my_photo1.jpg` within any directory via the `file_path.tree_reversed` field
|
|
configured to use the reverse parameter in the mapping.
|
|
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET file-path-test/_search
|
|
{
|
|
"query": {
|
|
"term": {
|
|
"file_path.tree_reversed": {
|
|
"value": "my_photo1.jpg"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
Viewing the tokens generated with both forward and reverse is instructive
|
|
in showing the tokens created for the same file path value.
|
|
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST file-path-test/_analyze
|
|
{
|
|
"analyzer": "custom_path_tree",
|
|
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
|
}
|
|
|
|
POST file-path-test/_analyze
|
|
{
|
|
"analyzer": "custom_path_tree_reversed",
|
|
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
|
|
It's also useful to be able to filter with file paths when combined with other
|
|
types of searches, such as this example looking for any files paths with `16`
|
|
that also must be in Alice's photo directory.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET file-path-test/_search
|
|
{
|
|
"query": {
|
|
"bool" : {
|
|
"must" : {
|
|
"match" : { "file_path" : "16" }
|
|
},
|
|
"filter": {
|
|
"term" : { "file_path.tree" : "/User/alice" }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|