Changes: * Rewrites description and adds Lucene link * Adds analyze example * Adds parameter definitions * Adds custom analyzer example
This commit is contained in:
parent
85b0b540f0
commit
c36df27730
|
@ -4,23 +4,157 @@
|
||||||
<titleabbrev>Pattern replace</titleabbrev>
|
<titleabbrev>Pattern replace</titleabbrev>
|
||||||
++++
|
++++
|
||||||
|
|
||||||
The `pattern_replace` token filter allows to easily handle string
|
Uses a regular expression to match and replace token substrings.
|
||||||
replacements based on a regular expression. The regular expression is
|
|
||||||
defined using the `pattern` parameter, and the replacement string can be
|
The `pattern_replace` filter uses
|
||||||
provided using the `replacement` parameter (supporting referencing the
|
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
|
||||||
original text, as explained
|
regular expression syntax]. By default, the filter replaces matching
|
||||||
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement(java.lang.StringBuffer,%20java.lang.String)[here]).
|
substrings with an empty substring (`""`).
|
||||||
|
|
||||||
|
Regular expressions cannot be anchored to the
|
||||||
|
beginning or end of a token. Replacement substrings can use Java's
|
||||||
|
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-[`$g` syntax] to reference capture groups
|
||||||
|
from the original token text.
|
||||||
|
|
||||||
[WARNING]
|
[WARNING]
|
||||||
.Beware of Pathological Regular Expressions
|
====
|
||||||
========================================
|
A poorly-written regular expression may run slowly or return a
|
||||||
|
StackOverflowError, causing the node running the expression to exit suddenly.
|
||||||
|
|
||||||
The pattern replace token filter uses
|
Read more about
|
||||||
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].
|
http://www.regular-expressions.info/catastrophic.html[pathological regular
|
||||||
|
expressions and how to avoid them].
|
||||||
|
====
|
||||||
|
|
||||||
A badly written regular expression could run very slowly or even throw a
|
This filter uses Lucene's
|
||||||
StackOverflowError and cause the node it is running on to exit suddenly.
|
{lucene-analysis-docs}/pattern/PatternReplaceFilter.html[PatternReplaceFilter].
|
||||||
|
|
||||||
Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].
|
[[analysis-pattern-replace-tokenfilter-analyze-ex]]
|
||||||
|
==== Example
|
||||||
|
|
||||||
========================================
|
The following <<indices-analyze,analyze API>> request uses the `pattern_replace`
|
||||||
|
filter to prepend `watch` to the substring `dog` in `foxes jump lazy dogs`.
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
----
|
||||||
|
GET /_analyze
|
||||||
|
{
|
||||||
|
"tokenizer": "whitespace",
|
||||||
|
"filter": [
|
||||||
|
{
|
||||||
|
"type": "pattern_replace",
|
||||||
|
"pattern": "(dog)",
|
||||||
|
"replacement": "watch$1"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"text": "foxes jump lazy dogs"
|
||||||
|
}
|
||||||
|
----
|
||||||
|
|
||||||
|
The filter produces the following tokens.
|
||||||
|
|
||||||
|
[source,text]
|
||||||
|
----
|
||||||
|
[ foxes, jump, lazy, watchdogs ]
|
||||||
|
----
|
||||||
|
|
||||||
|
////
|
||||||
|
[source,console-result]
|
||||||
|
----
|
||||||
|
{
|
||||||
|
"tokens": [
|
||||||
|
{
|
||||||
|
"token": "foxes",
|
||||||
|
"start_offset": 0,
|
||||||
|
"end_offset": 5,
|
||||||
|
"type": "word",
|
||||||
|
"position": 0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token": "jump",
|
||||||
|
"start_offset": 6,
|
||||||
|
"end_offset": 10,
|
||||||
|
"type": "word",
|
||||||
|
"position": 1
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token": "lazy",
|
||||||
|
"start_offset": 11,
|
||||||
|
"end_offset": 15,
|
||||||
|
"type": "word",
|
||||||
|
"position": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token": "watchdogs",
|
||||||
|
"start_offset": 16,
|
||||||
|
"end_offset": 20,
|
||||||
|
"type": "word",
|
||||||
|
"position": 3
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
----
|
||||||
|
////
|
||||||
|
|
||||||
|
[[analysis-pattern-replace-tokenfilter-configure-parms]]
|
||||||
|
==== Configurable parameters
|
||||||
|
|
||||||
|
`all`::
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, all substrings matching the `pattern` parameter's regular expression
|
||||||
|
are replaced. If `false`, the filter replaces only the first matching substring
|
||||||
|
in each token. Defaults to `true`.
|
||||||
|
|
||||||
|
`pattern`::
|
||||||
|
(Required, string)
|
||||||
|
Regular expression, written in
|
||||||
|
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
|
||||||
|
regular expression syntax]. The filter replaces token substrings matching this
|
||||||
|
pattern with the substring in the `replacement` parameter.
|
||||||
|
|
||||||
|
`replacement`::
|
||||||
|
(Optional, string)
|
||||||
|
Replacement substring. Defaults to an empty substring (`""`).
|
||||||
|
|
||||||
|
[[analysis-pattern-replace-tokenfilter-customize]]
|
||||||
|
==== Customize and add to an analyzer
|
||||||
|
|
||||||
|
To customize the `pattern_replace` filter, duplicate it to create the basis
|
||||||
|
for a new custom token filter. You can modify the filter using its configurable
|
||||||
|
parameters.
|
||||||
|
|
||||||
|
The following <<indices-create-index,create index API>> request
|
||||||
|
configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom
|
||||||
|
`pattern_replace` filter, `my_pattern_replace_filter`.
|
||||||
|
|
||||||
|
The `my_pattern_replace_filter` filter uses the regular expression `[£|€]` to
|
||||||
|
match and remove the currency symbols `£` and `€`. The filter's `all`
|
||||||
|
parameter is `false`, meaning only the first matching symbol in each token is
|
||||||
|
removed.
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
----
|
||||||
|
PUT /my_index
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"analyzer": {
|
||||||
|
"my_analyzer": {
|
||||||
|
"tokenizer": "keyword",
|
||||||
|
"filter": [
|
||||||
|
"my_pattern_replace_filter"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter": {
|
||||||
|
"my_pattern_replace_filter": {
|
||||||
|
"type": "pattern_replace",
|
||||||
|
"pattern": "[£|€]",
|
||||||
|
"replacement": "",
|
||||||
|
"all": false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----
|
Loading…
Reference in New Issue