[DOCS] Reformat `pattern_replace` token filter (#57699) (#57995)

Changes:

* Rewrites description and adds Lucene link
* Adds analyze example
* Adds parameter definitions
* Adds custom analyzer example
This commit is contained in:
James Rodewig 2020-06-11 12:19:38 -04:00 committed by GitHub
parent 85b0b540f0
commit c36df27730
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 148 additions and 14 deletions

View File

@ -4,23 +4,157 @@
<titleabbrev>Pattern replace</titleabbrev> <titleabbrev>Pattern replace</titleabbrev>
++++ ++++
The `pattern_replace` token filter allows to easily handle string Uses a regular expression to match and replace token substrings.
replacements based on a regular expression. The regular expression is
defined using the `pattern` parameter, and the replacement string can be The `pattern_replace` filter uses
provided using the `replacement` parameter (supporting referencing the http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
original text, as explained regular expression syntax]. By default, the filter replaces matching
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement(java.lang.StringBuffer,%20java.lang.String)[here]). substrings with an empty substring (`""`).
Regular expressions cannot be anchored to the
beginning or end of a token. Replacement substrings can use Java's
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-[`$g` syntax] to reference capture groups
from the original token text.
[WARNING] [WARNING]
.Beware of Pathological Regular Expressions ====
======================================== A poorly-written regular expression may run slowly or return a
StackOverflowError, causing the node running the expression to exit suddenly.
The pattern replace token filter uses Read more about
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions]. http://www.regular-expressions.info/catastrophic.html[pathological regular
expressions and how to avoid them].
====
A badly written regular expression could run very slowly or even throw a This filter uses Lucene's
StackOverflowError and cause the node it is running on to exit suddenly. {lucene-analysis-docs}/pattern/PatternReplaceFilter.html[PatternReplaceFilter].
Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them]. [[analysis-pattern-replace-tokenfilter-analyze-ex]]
==== Example
======================================== The following <<indices-analyze,analyze API>> request uses the `pattern_replace`
filter to prepend `watch` to the substring `dog` in `foxes jump lazy dogs`.
[source,console]
----
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "pattern_replace",
"pattern": "(dog)",
"replacement": "watch$1"
}
],
"text": "foxes jump lazy dogs"
}
----
The filter produces the following tokens.
[source,text]
----
[ foxes, jump, lazy, watchdogs ]
----
////
[source,console-result]
----
{
"tokens": [
{
"token": "foxes",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "jump",
"start_offset": 6,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "lazy",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "watchdogs",
"start_offset": 16,
"end_offset": 20,
"type": "word",
"position": 3
}
]
}
----
////
[[analysis-pattern-replace-tokenfilter-configure-parms]]
==== Configurable parameters
`all`::
(Optional, boolean)
If `true`, all substrings matching the `pattern` parameter's regular expression
are replaced. If `false`, the filter replaces only the first matching substring
in each token. Defaults to `true`.
`pattern`::
(Required, string)
Regular expression, written in
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
regular expression syntax]. The filter replaces token substrings matching this
pattern with the substring in the `replacement` parameter.
`replacement`::
(Optional, string)
Replacement substring. Defaults to an empty substring (`""`).
[[analysis-pattern-replace-tokenfilter-customize]]
==== Customize and add to an analyzer
To customize the `pattern_replace` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
The following <<indices-create-index,create index API>> request
configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom
`pattern_replace` filter, `my_pattern_replace_filter`.
The `my_pattern_replace_filter` filter uses the regular expression `[£|€]` to
match and remove the currency symbols `£` and `€`. The filter's `all`
parameter is `false`, meaning only the first matching symbol in each token is
removed.
[source,console]
----
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [
"my_pattern_replace_filter"
]
}
},
"filter": {
"my_pattern_replace_filter": {
"type": "pattern_replace",
"pattern": "[£|€]",
"replacement": "",
"all": false
}
}
}
}
}
----