[DOCS] Rewrite `regexp` query (#42711)
This commit is contained in:
parent
bfb2e323e9
commit
ad7c164dd0
|
@ -205,6 +205,7 @@ specific index module:
|
|||
The maximum number of terms that can be used in Terms Query.
|
||||
Defaults to `65536`.
|
||||
|
||||
[[index-max-regex-length]]
|
||||
`index.max_regex_length`::
|
||||
|
||||
The maximum length of regex that can be used in Regexp Query.
|
||||
|
|
|
@ -48,3 +48,5 @@ include::query-dsl/term-level-queries.asciidoc[]
|
|||
include::query-dsl/minimum-should-match.asciidoc[]
|
||||
|
||||
include::query-dsl/multi-term-rewrite.asciidoc[]
|
||||
|
||||
include::query-dsl/regexp-syntax.asciidoc[]
|
|
@ -4,98 +4,86 @@
|
|||
<titleabbrev>Regexp</titleabbrev>
|
||||
++++
|
||||
|
||||
The `regexp` query allows you to use regular expression term queries.
|
||||
See <<regexp-syntax>> for details of the supported regular expression language.
|
||||
The "term queries" in that first sentence means that Elasticsearch will apply
|
||||
the regexp to the terms produced by the tokenizer for that field, and not
|
||||
to the original text of the field.
|
||||
Returns documents that contain terms matching a
|
||||
https://en.wikipedia.org/wiki/Regular_expression[regular expression].
|
||||
|
||||
*Note*: The performance of a `regexp` query heavily depends on the
|
||||
regular expression chosen. Matching everything like `.*` is very slow as
|
||||
well as using lookaround regular expressions. If possible, you should
|
||||
try to use a long prefix before your regular expression starts. Wildcard
|
||||
matchers like `.*?+` will mostly lower performance.
|
||||
A regular expression is a way to match patterns in data using placeholder
|
||||
characters, called operators. For a list of operators supported by the
|
||||
`regexp` query, see <<regexp-syntax, Regular expression syntax>>.
|
||||
|
||||
[[regexp-query-ex-request]]
|
||||
==== Example request
|
||||
|
||||
The following search returns documents where the `user` field contains any term
|
||||
that begins with `k` and ends with `y`. The `.*` operators match any
|
||||
characters of any length, including no characters. Matching
|
||||
terms can include `ky`, `kay`, and `kimchy`.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
----
|
||||
GET /_search
|
||||
{
|
||||
"query": {
|
||||
"regexp":{
|
||||
"name.first": "s.*y"
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
Boosting is also supported
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
GET /_search
|
||||
{
|
||||
"query": {
|
||||
"regexp":{
|
||||
"name.first":{
|
||||
"value":"s.*y",
|
||||
"boost":1.2
|
||||
"regexp": {
|
||||
"user": {
|
||||
"value": "k.*y",
|
||||
"flags" : "ALL",
|
||||
"max_determinized_states": 10000,
|
||||
"rewrite": "constant_score"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
----
|
||||
// CONSOLE
|
||||
|
||||
You can also use special flags
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
GET /_search
|
||||
{
|
||||
"query": {
|
||||
"regexp":{
|
||||
"name.first": {
|
||||
"value": "s.*y",
|
||||
"flags" : "INTERSECTION|COMPLEMENT|EMPTY"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
[[regexp-top-level-params]]
|
||||
==== Top-level parameters for `regexp`
|
||||
`<field>`::
|
||||
(Required, object) Field you wish to search.
|
||||
|
||||
Possible flags are `ALL` (default), `ANYSTRING`, `COMPLEMENT`,
|
||||
`EMPTY`, `INTERSECTION`, `INTERVAL`, or `NONE`. Please check the
|
||||
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/util/automaton/RegExp.html[Lucene
|
||||
documentation] for their meaning
|
||||
[[regexp-query-field-params]]
|
||||
==== Parameters for `<field>`
|
||||
`value`::
|
||||
(Required, string) Regular expression for terms you wish to find in the provided
|
||||
`<field>`. For a list of supported operators, see <<regexp-syntax, Regular
|
||||
expression syntax>>.
|
||||
+
|
||||
--
|
||||
By default, regular expressions are limited to 1,000 characters. You can change
|
||||
this limit using the <<index-max-regex-length, `index.max_regex_length`>>
|
||||
setting.
|
||||
|
||||
Regular expressions are dangerous because it's easy to accidentally
|
||||
create an innocuous looking one that requires an exponential number of
|
||||
internal determinized automaton states (and corresponding RAM and CPU)
|
||||
for Lucene to execute. Lucene prevents these using the
|
||||
`max_determinized_states` setting (defaults to 10000). You can raise
|
||||
this limit to allow more complex regular expressions to execute.
|
||||
[WARNING]
|
||||
=====
|
||||
The performance of the `regexp` query can vary based on the regular expression
|
||||
provided. To improve performance, avoid using wildcard patterns, such as `.*` or
|
||||
`.*?+`, without a prefix or suffix.
|
||||
=====
|
||||
--
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
GET /_search
|
||||
{
|
||||
"query": {
|
||||
"regexp":{
|
||||
"name.first": {
|
||||
"value": "s.*y",
|
||||
"flags" : "INTERSECTION|COMPLEMENT|EMPTY",
|
||||
"max_determinized_states": 20000
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
`flags`::
|
||||
(Optional, string) Enables optional operators for the regular expression. For
|
||||
valid values and more information, see <<regexp-optional-operators, Regular
|
||||
expression syntax>>.
|
||||
|
||||
NOTE: By default the maximum length of regex string allowed in a Regexp Query
|
||||
is limited to 1000. You can update the `index.max_regex_length` index setting
|
||||
to bypass this limit.
|
||||
`max_determinized_states`::
|
||||
+
|
||||
--
|
||||
(Optional, integer) Maximum number of
|
||||
https://en.wikipedia.org/wiki/Deterministic_finite_automaton[automaton states]
|
||||
required for the query. Default is `10000`.
|
||||
|
||||
include::regexp-syntax.asciidoc[]
|
||||
{es} uses https://lucene.apache.org/core/[Apache Lucene] internally to parse
|
||||
regular expressions. Lucene converts each regular expression to a finite
|
||||
automaton containing a number of determinized states.
|
||||
|
||||
You can use this parameter to prevent that conversion from unintentionally
|
||||
consuming too many resources. You may need to increase this limit to run complex
|
||||
regular expressions.
|
||||
--
|
||||
|
||||
`rewrite`::
|
||||
(Optional, string) Method used to rewrite the query. For valid values and more
|
||||
information, see the <<query-dsl-multi-term-rewrite, `rewrite` parameter>>.
|
||||
|
|
|
@ -1,286 +1,224 @@
|
|||
[[regexp-syntax]]
|
||||
==== Regular expression syntax
|
||||
== Regular expression syntax
|
||||
|
||||
Regular expression queries are supported by the `regexp` and the `query_string`
|
||||
queries. The Lucene regular expression engine
|
||||
is not Perl-compatible but supports a smaller range of operators.
|
||||
A https://en.wikipedia.org/wiki/Regular_expression[regular expression] is a way to
|
||||
match patterns in data using placeholder characters, called operators.
|
||||
|
||||
[NOTE]
|
||||
=====
|
||||
We will not attempt to explain regular expressions, but
|
||||
just explain the supported operators.
|
||||
=====
|
||||
{es} supports regular expressions in the following queries:
|
||||
|
||||
===== Standard operators
|
||||
* <<query-dsl-regexp-query, `regexp`>>
|
||||
* <<query-dsl-query-string-query, `query_string`>>
|
||||
|
||||
Anchoring::
|
||||
+
|
||||
--
|
||||
{es} uses https://lucene.apache.org/core/[Apache Lucene]'s regular expression
|
||||
engine to parse these queries.
|
||||
|
||||
Most regular expression engines allow you to match any part of a string.
|
||||
If you want the regexp pattern to start at the beginning of the string or
|
||||
finish at the end of the string, then you have to _anchor_ it specifically,
|
||||
using `^` to indicate the beginning or `$` to indicate the end.
|
||||
|
||||
Lucene's patterns are always anchored. The pattern provided must match
|
||||
the entire string. For string `"abcde"`:
|
||||
|
||||
ab.* # match
|
||||
abcd # no match
|
||||
|
||||
--
|
||||
|
||||
Allowed characters::
|
||||
+
|
||||
--
|
||||
|
||||
Any Unicode characters may be used in the pattern, but certain characters
|
||||
are reserved and must be escaped. The standard reserved characters are:
|
||||
[float]
|
||||
[[regexp-reserved-characters]]
|
||||
=== Reserved characters
|
||||
Lucene's regular expression engine supports all Unicode characters. However, the
|
||||
following characters are reserved as operators:
|
||||
|
||||
....
|
||||
. ? + * | { } [ ] ( ) " \
|
||||
....
|
||||
|
||||
If you enable optional features (see below) then these characters may
|
||||
also be reserved:
|
||||
Depending on the <<regexp-optional-operators, optional operators>> enabled, the
|
||||
following characters may also be reserved:
|
||||
|
||||
# @ & < > ~
|
||||
....
|
||||
# @ & < > ~
|
||||
....
|
||||
|
||||
Any reserved character can be escaped with a backslash `"\*"` including
|
||||
a literal backslash character: `"\\"`
|
||||
To use one of these characters literally, escape it with a preceding
|
||||
backslash or surround it with double quotes. For example:
|
||||
|
||||
Additionally, any characters (except double quotes) are interpreted literally
|
||||
when surrounded by double quotes:
|
||||
|
||||
john"@smith.com"
|
||||
....
|
||||
\@ # renders as a literal '@'
|
||||
\\ # renders as a literal '\'
|
||||
"john@smith.com" # renders as 'john@smith.com'
|
||||
....
|
||||
|
||||
|
||||
--
|
||||
[float]
|
||||
[[regexp-standard-operators]]
|
||||
=== Standard operators
|
||||
|
||||
Match any character::
|
||||
Lucene's regular expression engine does not use the
|
||||
https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions[Perl
|
||||
Compatible Regular Expressions (PCRE)] library, but it does support the
|
||||
following standard operators.
|
||||
|
||||
`.`::
|
||||
+
|
||||
--
|
||||
Matches any character. For example:
|
||||
|
||||
The period `"."` can be used to represent any character. For string `"abcde"`:
|
||||
|
||||
ab... # match
|
||||
a.c.e # match
|
||||
|
||||
....
|
||||
ab. # matches 'aba', 'abb', 'abz', etc.
|
||||
....
|
||||
--
|
||||
|
||||
One-or-more::
|
||||
`?`::
|
||||
+
|
||||
--
|
||||
Repeat the preceding character zero or one times. Often used to make the
|
||||
preceding character optional. For example:
|
||||
|
||||
The plus sign `"+"` can be used to repeat the preceding shortest pattern
|
||||
once or more times. For string `"aaabbb"`:
|
||||
|
||||
a+b+ # match
|
||||
aa+bb+ # match
|
||||
a+.+ # match
|
||||
aa+bbb+ # match
|
||||
|
||||
....
|
||||
abc? # matches 'ab' and 'abc'
|
||||
....
|
||||
--
|
||||
|
||||
Zero-or-more::
|
||||
`+`::
|
||||
+
|
||||
--
|
||||
Repeat the preceding character one or more times. For example:
|
||||
|
||||
The asterisk `"*"` can be used to match the preceding shortest pattern
|
||||
zero-or-more times. For string `"aaabbb`":
|
||||
|
||||
a*b* # match
|
||||
a*b*c* # match
|
||||
.*bbb.* # match
|
||||
aaa*bbb* # match
|
||||
|
||||
....
|
||||
ab+ # matches 'abb', 'abbb', 'abbbb', etc.
|
||||
....
|
||||
--
|
||||
|
||||
Zero-or-one::
|
||||
`*`::
|
||||
+
|
||||
--
|
||||
Repeat the preceding character zero or more times. For example:
|
||||
|
||||
The question mark `"?"` makes the preceding shortest pattern optional. It
|
||||
matches zero or one times. For string `"aaabbb"`:
|
||||
|
||||
aaa?bbb? # match
|
||||
aaaa?bbbb? # match
|
||||
.....?.? # match
|
||||
aa?bb? # no match
|
||||
|
||||
....
|
||||
ab* # matches 'ab', 'abb', 'abbb', 'abbbb', etc.
|
||||
....
|
||||
--
|
||||
|
||||
Min-to-max::
|
||||
`{}`::
|
||||
+
|
||||
--
|
||||
Minimum and maximum number of times the preceding character can repeat. For
|
||||
example:
|
||||
|
||||
Curly brackets `"{}"` can be used to specify a minimum and (optionally)
|
||||
a maximum number of times the preceding shortest pattern can repeat. The
|
||||
allowed forms are:
|
||||
|
||||
{5} # repeat exactly 5 times
|
||||
{2,5} # repeat at least twice and at most 5 times
|
||||
{2,} # repeat at least twice
|
||||
|
||||
For string `"aaabbb"`:
|
||||
|
||||
a{3}b{3} # match
|
||||
a{2,4}b{2,4} # match
|
||||
a{2,}b{2,} # match
|
||||
.{3}.{3} # match
|
||||
a{4}b{4} # no match
|
||||
a{4,6}b{4,6} # no match
|
||||
a{4,}b{4,} # no match
|
||||
|
||||
....
|
||||
a{2} # matches 'aa'
|
||||
a{2,4} # matches 'aa', 'aaa', and 'aaaa'
|
||||
a{2,} # matches 'a` repeated two or more times
|
||||
....
|
||||
--
|
||||
|
||||
Grouping::
|
||||
`|`::
|
||||
+
|
||||
--
|
||||
|
||||
Parentheses `"()"` can be used to form sub-patterns. The quantity operators
|
||||
listed above operate on the shortest previous pattern, which can be a group.
|
||||
For string `"ababab"`:
|
||||
|
||||
(ab)+ # match
|
||||
ab(ab)+ # match
|
||||
(..)+ # match
|
||||
(...)+ # no match
|
||||
(ab)* # match
|
||||
abab(ab)? # match
|
||||
ab(ab)? # no match
|
||||
(ab){3} # match
|
||||
(ab){1,2} # no match
|
||||
|
||||
OR operator. The match will succeed if the longest pattern on either the left
|
||||
side OR the right side matches. For example:
|
||||
....
|
||||
abc|xyz # matches 'abc' and 'xyz'
|
||||
....
|
||||
--
|
||||
|
||||
Alternation::
|
||||
`( … )`::
|
||||
+
|
||||
--
|
||||
Forms a group. You can use a group to treat part of the expression as a single
|
||||
character. For example:
|
||||
|
||||
The pipe symbol `"|"` acts as an OR operator. The match will succeed if
|
||||
the pattern on either the left-hand side OR the right-hand side matches.
|
||||
The alternation applies to the _longest pattern_, not the shortest.
|
||||
For string `"aabb"`:
|
||||
|
||||
aabb|bbaa # match
|
||||
aacc|bb # no match
|
||||
aa(cc|bb) # match
|
||||
a+|b+ # no match
|
||||
a+b+|b+a+ # match
|
||||
a+(b|c)+ # match
|
||||
|
||||
....
|
||||
abc(def)? # matches 'abc' and 'abcdef' but not 'abcd'
|
||||
....
|
||||
--
|
||||
|
||||
Character classes::
|
||||
`[ … ]`::
|
||||
+
|
||||
--
|
||||
Match one of the characters in the brackets. For example:
|
||||
|
||||
Ranges of potential characters may be represented as character classes
|
||||
by enclosing them in square brackets `"[]"`. A leading `^`
|
||||
negates the character class. The allowed forms are:
|
||||
....
|
||||
[abc] # matches 'a', 'b', 'c'
|
||||
....
|
||||
|
||||
[abc] # 'a' or 'b' or 'c'
|
||||
[a-c] # 'a' or 'b' or 'c'
|
||||
[-abc] # '-' or 'a' or 'b' or 'c'
|
||||
[abc\-] # '-' or 'a' or 'b' or 'c'
|
||||
[^abc] # any character except 'a' or 'b' or 'c'
|
||||
[^a-c] # any character except 'a' or 'b' or 'c'
|
||||
[^-abc] # any character except '-' or 'a' or 'b' or 'c'
|
||||
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'
|
||||
Inside the brackets, `-` indicates a range unless `-` is the first character or
|
||||
escaped. For example:
|
||||
|
||||
Note that the dash `"-"` indicates a range of characters, unless it is
|
||||
the first character or if it is escaped with a backslash.
|
||||
....
|
||||
[a-c] # matches 'a', 'b', or 'c'
|
||||
[-abc] # '-' is first character. Matches '-', 'a', 'b', or 'c'
|
||||
[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
|
||||
....
|
||||
|
||||
For string `"abcd"`:
|
||||
|
||||
ab[cd]+ # match
|
||||
[a-d]+ # match
|
||||
[^a-d]+ # no match
|
||||
A `^` before a character in the brackets negates the character or range. For
|
||||
example:
|
||||
|
||||
....
|
||||
[^abc] # matches any character except 'a', 'b', or 'c'
|
||||
[^a-c] # matches any character except 'a', 'b', or 'c'
|
||||
[^-abc] # matches any character except '-', 'a', 'b', or 'c'
|
||||
[^abc\-] # matches any character except 'a', 'b', 'c', or '-'
|
||||
....
|
||||
--
|
||||
|
||||
===== Optional operators
|
||||
[float]
|
||||
[[regexp-optional-operators]]
|
||||
=== Optional operators
|
||||
|
||||
These operators are available by default as the `flags` parameter defaults to `ALL`.
|
||||
Different flag combinations (concatenated with `"|"`) can be used to enable/disable
|
||||
specific operators:
|
||||
You can use the `flags` parameter to enable more optional operators for
|
||||
Lucene's regular expression engine.
|
||||
|
||||
{
|
||||
"regexp": {
|
||||
"username": {
|
||||
"value": "john~athon<1-5>",
|
||||
"flags": "COMPLEMENT|INTERVAL"
|
||||
}
|
||||
}
|
||||
}
|
||||
To enable multiple operators, use a `|` separator. For example, a `flags` value
|
||||
of `COMPLEMENT|INTERVAL` enables the `COMPLEMENT` and `INTERVAL` operators.
|
||||
|
||||
Complement::
|
||||
[float]
|
||||
==== Valid values
|
||||
|
||||
`ALL` (Default)::
|
||||
Enables all optional operators.
|
||||
|
||||
`COMPLEMENT`::
|
||||
+
|
||||
--
|
||||
Enables the `~` operator. You can use `~` to negate the shortest following
|
||||
pattern. For example:
|
||||
|
||||
The complement is probably the most useful option. The shortest pattern that
|
||||
follows a tilde `"~"` is negated. For instance, `"ab~cd" means:
|
||||
|
||||
* Starts with `a`
|
||||
* Followed by `b`
|
||||
* Followed by a string of any length that is anything but `c`
|
||||
* Ends with `d`
|
||||
|
||||
For the string `"abcdef"`:
|
||||
|
||||
ab~df # match
|
||||
ab~cf # match
|
||||
ab~cdef # no match
|
||||
a~(cb)def # match
|
||||
a~(bc)def # no match
|
||||
|
||||
Enabled with the `COMPLEMENT` or `ALL` flags.
|
||||
|
||||
....
|
||||
a~bc # matches 'adc' and 'aec' but not 'abc'
|
||||
....
|
||||
--
|
||||
|
||||
Interval::
|
||||
`INTERVAL`::
|
||||
+
|
||||
--
|
||||
Enables the `<>` operators. You can use `<>` to match a numeric range. For
|
||||
example:
|
||||
|
||||
The interval option enables the use of numeric ranges, enclosed by angle
|
||||
brackets `"<>"`. For string: `"foo80"`:
|
||||
|
||||
foo<1-100> # match
|
||||
foo<01-100> # match
|
||||
foo<001-100> # no match
|
||||
|
||||
Enabled with the `INTERVAL` or `ALL` flags.
|
||||
|
||||
|
||||
....
|
||||
foo<1-100> # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
|
||||
foo<01-100> # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
|
||||
....
|
||||
--
|
||||
|
||||
Intersection::
|
||||
`INTERSECTION`::
|
||||
+
|
||||
--
|
||||
Enables the `&` operator, which acts as an AND operator. The match will succeed
|
||||
if patterns on both the left side AND the right side matches. For example:
|
||||
|
||||
The ampersand `"&"` joins two patterns in a way that both of them have to
|
||||
match. For string `"aaabbb"`:
|
||||
|
||||
aaa.+&.+bbb # match
|
||||
aaa&bbb # no match
|
||||
|
||||
Using this feature usually means that you should rewrite your regular
|
||||
expression.
|
||||
|
||||
Enabled with the `INTERSECTION` or `ALL` flags.
|
||||
|
||||
....
|
||||
aaa.+&.+bbb # matches 'aaabbb'
|
||||
....
|
||||
--
|
||||
|
||||
Any string::
|
||||
`ANYSTRING`::
|
||||
+
|
||||
--
|
||||
Enables the `@` operator. You can use `@` to match any entire
|
||||
string.
|
||||
|
||||
The at sign `"@"` matches any string in its entirety. This could be combined
|
||||
with the intersection and complement above to express ``everything except''.
|
||||
For instance:
|
||||
You can combine the `@` operator with `&` and `~` operators to create an
|
||||
"everything except" logic. For example:
|
||||
|
||||
@&~(foo.+) # anything except string beginning with "foo"
|
||||
|
||||
Enabled with the `ANYSTRING` or `ALL` flags.
|
||||
....
|
||||
@&~(abc.+) # matches everything except terms beginning with 'abc'
|
||||
....
|
||||
--
|
||||
|
||||
[float]
|
||||
[[regexp-unsupported-operators]]
|
||||
=== Unsupported operators
|
||||
Lucene's regular expression engine does not support anchor operators, such as
|
||||
`^` (beginning of line) or `$` (end of line). To match a term, the regular
|
||||
expression must match the entire string.
|
|
@ -49,7 +49,7 @@ The value specified in the field rule can be one of the following types:
|
|||
| Simple String | Exactly matches the provided value. | "esadmin"
|
||||
| Wildcard String | Matches the provided value using a wildcard. | "*,dc=example,dc=com"
|
||||
| Regular Expression | Matches the provided value using a
|
||||
{ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp]. | "/.\*-admin[0-9]*/"
|
||||
{ref}/regexp-syntax.html[Lucene regexp]. | "/.\*-admin[0-9]*/"
|
||||
| Number | Matches an equivalent numerical value. | 7
|
||||
| Null | Matches a null or missing value. | null
|
||||
| Array | Tests each element in the array in
|
||||
|
|
|
@ -132,7 +132,7 @@ Please take time to review these policies whenever your system architecture chan
|
|||
|
||||
A policy is a named set of filter rules. Each filter rule applies to a single event attribute,
|
||||
one of the `users`, `realms`, `roles` or `indices` attributes. The filter rule defines
|
||||
a list of {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp], *any* of which has to match the value of the audit
|
||||
a list of {ref}/regexp-syntax.html[Lucene regexp], *any* of which has to match the value of the audit
|
||||
event attribute for the rule to match.
|
||||
A policy matches an event if *all* the rules comprising it match the event.
|
||||
An audit event is ignored, therefore not printed, if it matches *any* policy. All other
|
||||
|
|
Loading…
Reference in New Issue