OpenSearch/docs/reference/analysis/tokenizers/uaxurlemail-tokenizer.asciidoc
James Rodewig ab29162ab3
[DOCS] Fix tokenizer page titles () ()
Changes the titles for tokenizer pages to sentence case.

Also moves the 'Path hierarchy tokenizer examples' page within the
'Path hierarchy tokenizer' page and adds a related redirect.
2020-06-26 09:24:41 -04:00

197 lines
3.8 KiB
Plaintext

[[analysis-uaxurlemail-tokenizer]]
=== UAX URL email tokenizer
++++
<titleabbrev>UAX URL email</titleabbrev>
++++
The `uax_url_email` tokenizer is like the <<analysis-standard-tokenizer,`standard` tokenizer>> except that it
recognises URLs and email addresses as single tokens.
[float]
=== Example output
[source,console]
---------------------------
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "Email me at john.smith@global-international.com"
}
---------------------------
/////////////////////
[source,console-result]
----------------------------
{
"tokens": [
{
"token": "Email",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "me",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "at",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "john.smith@global-international.com",
"start_offset": 12,
"end_offset": 47,
"type": "<EMAIL>",
"position": 3
}
]
}
----------------------------
/////////////////////
The above sentence would produce the following terms:
[source,text]
---------------------------
[ Email, me, at, john.smith@global-international.com ]
---------------------------
while the `standard` tokenizer would produce:
[source,text]
---------------------------
[ Email, me, at, john.smith, global, international.com ]
---------------------------
[float]
=== Configuration
The `uax_url_email` tokenizer accepts the following parameters:
[horizontal]
`max_token_length`::
The maximum token length. If a token is seen that exceeds this length then
it is split at `max_token_length` intervals. Defaults to `255`.
[float]
=== Example configuration
In this example, we configure the `uax_url_email` tokenizer to have a
`max_token_length` of 5 (for demonstration purposes):
[source,console]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "uax_url_email",
"max_token_length": 5
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "john.smith@global-international.com"
}
----------------------------
/////////////////////
[source,console-result]
----------------------------
{
"tokens": [
{
"token": "john",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "smith",
"start_offset": 5,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "globa",
"start_offset": 11,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "l",
"start_offset": 16,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "inter",
"start_offset": 18,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "natio",
"start_offset": 23,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "nal.c",
"start_offset": 28,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "om",
"start_offset": 33,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 7
}
]
}
----------------------------
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ john, smith, globa, l, inter, natio, nal.c, om ]
---------------------------