[DOCS] Rewrite `regexp` query (#42711)

This commit is contained in:
James Rodewig 2019-07-24 08:37:37 -04:00
parent bfb2e323e9
commit ad7c164dd0
6 changed files with 212 additions and 283 deletions

View File

@ -205,6 +205,7 @@ specific index module:
The maximum number of terms that can be used in Terms Query. The maximum number of terms that can be used in Terms Query.
Defaults to `65536`. Defaults to `65536`.
[[index-max-regex-length]]
`index.max_regex_length`:: `index.max_regex_length`::
The maximum length of regex that can be used in Regexp Query. The maximum length of regex that can be used in Regexp Query.

View File

@ -47,4 +47,6 @@ include::query-dsl/term-level-queries.asciidoc[]
include::query-dsl/minimum-should-match.asciidoc[] include::query-dsl/minimum-should-match.asciidoc[]
include::query-dsl/multi-term-rewrite.asciidoc[] include::query-dsl/multi-term-rewrite.asciidoc[]
include::query-dsl/regexp-syntax.asciidoc[]

View File

@ -4,98 +4,86 @@
<titleabbrev>Regexp</titleabbrev> <titleabbrev>Regexp</titleabbrev>
++++ ++++
The `regexp` query allows you to use regular expression term queries. Returns documents that contain terms matching a
See <<regexp-syntax>> for details of the supported regular expression language. https://en.wikipedia.org/wiki/Regular_expression[regular expression].
The "term queries" in that first sentence means that Elasticsearch will apply
the regexp to the terms produced by the tokenizer for that field, and not
to the original text of the field.
*Note*: The performance of a `regexp` query heavily depends on the A regular expression is a way to match patterns in data using placeholder
regular expression chosen. Matching everything like `.*` is very slow as characters, called operators. For a list of operators supported by the
well as using lookaround regular expressions. If possible, you should `regexp` query, see <<regexp-syntax, Regular expression syntax>>.
try to use a long prefix before your regular expression starts. Wildcard
matchers like `.*?+` will mostly lower performance. [[regexp-query-ex-request]]
==== Example request
The following search returns documents where the `user` field contains any term
that begins with `k` and ends with `y`. The `.*` operators match any
characters of any length, including no characters. Matching
terms can include `ky`, `kay`, and `kimchy`.
[source,js] [source,js]
-------------------------------------------------- ----
GET /_search GET /_search
{ {
"query": { "query": {
"regexp":{ "regexp": {
"name.first": "s.*y" "user": {
} "value": "k.*y",
} "flags" : "ALL",
} "max_determinized_states": 10000,
-------------------------------------------------- "rewrite": "constant_score"
// CONSOLE
Boosting is also supported
[source,js]
--------------------------------------------------
GET /_search
{
"query": {
"regexp":{
"name.first":{
"value":"s.*y",
"boost":1.2
} }
} }
} }
} }
-------------------------------------------------- ----
// CONSOLE // CONSOLE
You can also use special flags
[source,js] [[regexp-top-level-params]]
-------------------------------------------------- ==== Top-level parameters for `regexp`
GET /_search `<field>`::
{ (Required, object) Field you wish to search.
"query": {
"regexp":{
"name.first": {
"value": "s.*y",
"flags" : "INTERSECTION|COMPLEMENT|EMPTY"
}
}
}
}
--------------------------------------------------
// CONSOLE
Possible flags are `ALL` (default), `ANYSTRING`, `COMPLEMENT`, [[regexp-query-field-params]]
`EMPTY`, `INTERSECTION`, `INTERVAL`, or `NONE`. Please check the ==== Parameters for `<field>`
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/util/automaton/RegExp.html[Lucene `value`::
documentation] for their meaning (Required, string) Regular expression for terms you wish to find in the provided
`<field>`. For a list of supported operators, see <<regexp-syntax, Regular
expression syntax>>.
+
--
By default, regular expressions are limited to 1,000 characters. You can change
this limit using the <<index-max-regex-length, `index.max_regex_length`>>
setting.
Regular expressions are dangerous because it's easy to accidentally [WARNING]
create an innocuous looking one that requires an exponential number of =====
internal determinized automaton states (and corresponding RAM and CPU) The performance of the `regexp` query can vary based on the regular expression
for Lucene to execute. Lucene prevents these using the provided. To improve performance, avoid using wildcard patterns, such as `.*` or
`max_determinized_states` setting (defaults to 10000). You can raise `.*?+`, without a prefix or suffix.
this limit to allow more complex regular expressions to execute. =====
--
[source,js] `flags`::
-------------------------------------------------- (Optional, string) Enables optional operators for the regular expression. For
GET /_search valid values and more information, see <<regexp-optional-operators, Regular
{ expression syntax>>.
"query": {
"regexp":{
"name.first": {
"value": "s.*y",
"flags" : "INTERSECTION|COMPLEMENT|EMPTY",
"max_determinized_states": 20000
}
}
}
}
--------------------------------------------------
// CONSOLE
NOTE: By default the maximum length of regex string allowed in a Regexp Query `max_determinized_states`::
is limited to 1000. You can update the `index.max_regex_length` index setting +
to bypass this limit. --
(Optional, integer) Maximum number of
https://en.wikipedia.org/wiki/Deterministic_finite_automaton[automaton states]
required for the query. Default is `10000`.
include::regexp-syntax.asciidoc[] {es} uses https://lucene.apache.org/core/[Apache Lucene] internally to parse
regular expressions. Lucene converts each regular expression to a finite
automaton containing a number of determinized states.
You can use this parameter to prevent that conversion from unintentionally
consuming too many resources. You may need to increase this limit to run complex
regular expressions.
--
`rewrite`::
(Optional, string) Method used to rewrite the query. For valid values and more
information, see the <<query-dsl-multi-term-rewrite, `rewrite` parameter>>.

View File

@ -1,286 +1,224 @@
[[regexp-syntax]] [[regexp-syntax]]
==== Regular expression syntax == Regular expression syntax
Regular expression queries are supported by the `regexp` and the `query_string` A https://en.wikipedia.org/wiki/Regular_expression[regular expression] is a way to
queries. The Lucene regular expression engine match patterns in data using placeholder characters, called operators.
is not Perl-compatible but supports a smaller range of operators.
[NOTE] {es} supports regular expressions in the following queries:
=====
We will not attempt to explain regular expressions, but
just explain the supported operators.
=====
===== Standard operators * <<query-dsl-regexp-query, `regexp`>>
* <<query-dsl-query-string-query, `query_string`>>
Anchoring:: {es} uses https://lucene.apache.org/core/[Apache Lucene]'s regular expression
+ engine to parse these queries.
--
Most regular expression engines allow you to match any part of a string. [float]
If you want the regexp pattern to start at the beginning of the string or [[regexp-reserved-characters]]
finish at the end of the string, then you have to _anchor_ it specifically, === Reserved characters
using `^` to indicate the beginning or `$` to indicate the end. Lucene's regular expression engine supports all Unicode characters. However, the
following characters are reserved as operators:
Lucene's patterns are always anchored. The pattern provided must match
the entire string. For string `"abcde"`:
ab.* # match
abcd # no match
--
Allowed characters::
+
--
Any Unicode characters may be used in the pattern, but certain characters
are reserved and must be escaped. The standard reserved characters are:
.... ....
. ? + * | { } [ ] ( ) " \ . ? + * | { } [ ] ( ) " \
.... ....
If you enable optional features (see below) then these characters may Depending on the <<regexp-optional-operators, optional operators>> enabled, the
also be reserved: following characters may also be reserved:
# @ & < > ~ ....
# @ & < > ~
....
Any reserved character can be escaped with a backslash `"\*"` including To use one of these characters literally, escape it with a preceding
a literal backslash character: `"\\"` backslash or surround it with double quotes. For example:
Additionally, any characters (except double quotes) are interpreted literally ....
when surrounded by double quotes: \@ # renders as a literal '@'
\\ # renders as a literal '\'
"john@smith.com" # renders as 'john@smith.com'
....
john"@smith.com" [float]
[[regexp-standard-operators]]
=== Standard operators
Lucene's regular expression engine does not use the
https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions[Perl
Compatible Regular Expressions (PCRE)] library, but it does support the
following standard operators.
-- `.`::
Match any character::
+ +
-- --
Matches any character. For example:
The period `"."` can be used to represent any character. For string `"abcde"`: ....
ab. # matches 'aba', 'abb', 'abz', etc.
ab... # match ....
a.c.e # match
-- --
One-or-more:: `?`::
+ +
-- --
Repeat the preceding character zero or one times. Often used to make the
preceding character optional. For example:
The plus sign `"+"` can be used to repeat the preceding shortest pattern ....
once or more times. For string `"aaabbb"`: abc? # matches 'ab' and 'abc'
....
a+b+ # match
aa+bb+ # match
a+.+ # match
aa+bbb+ # match
-- --
Zero-or-more:: `+`::
+ +
-- --
Repeat the preceding character one or more times. For example:
The asterisk `"*"` can be used to match the preceding shortest pattern ....
zero-or-more times. For string `"aaabbb`": ab+ # matches 'abb', 'abbb', 'abbbb', etc.
....
a*b* # match
a*b*c* # match
.*bbb.* # match
aaa*bbb* # match
-- --
Zero-or-one:: `*`::
+ +
-- --
Repeat the preceding character zero or more times. For example:
The question mark `"?"` makes the preceding shortest pattern optional. It ....
matches zero or one times. For string `"aaabbb"`: ab* # matches 'ab', 'abb', 'abbb', 'abbbb', etc.
....
aaa?bbb? # match
aaaa?bbbb? # match
.....?.? # match
aa?bb? # no match
-- --
Min-to-max:: `{}`::
+ +
-- --
Minimum and maximum number of times the preceding character can repeat. For
example:
Curly brackets `"{}"` can be used to specify a minimum and (optionally) ....
a maximum number of times the preceding shortest pattern can repeat. The a{2} # matches 'aa'
allowed forms are: a{2,4} # matches 'aa', 'aaa', and 'aaaa'
a{2,} # matches 'a` repeated two or more times
{5} # repeat exactly 5 times ....
{2,5} # repeat at least twice and at most 5 times
{2,} # repeat at least twice
For string `"aaabbb"`:
a{3}b{3} # match
a{2,4}b{2,4} # match
a{2,}b{2,} # match
.{3}.{3} # match
a{4}b{4} # no match
a{4,6}b{4,6} # no match
a{4,}b{4,} # no match
-- --
Grouping:: `|`::
+ +
-- --
OR operator. The match will succeed if the longest pattern on either the left
Parentheses `"()"` can be used to form sub-patterns. The quantity operators side OR the right side matches. For example:
listed above operate on the shortest previous pattern, which can be a group. ....
For string `"ababab"`: abc|xyz # matches 'abc' and 'xyz'
....
(ab)+ # match
ab(ab)+ # match
(..)+ # match
(...)+ # no match
(ab)* # match
abab(ab)? # match
ab(ab)? # no match
(ab){3} # match
(ab){1,2} # no match
-- --
Alternation:: `( … )`::
+ +
-- --
Forms a group. You can use a group to treat part of the expression as a single
character. For example:
The pipe symbol `"|"` acts as an OR operator. The match will succeed if ....
the pattern on either the left-hand side OR the right-hand side matches. abc(def)? # matches 'abc' and 'abcdef' but not 'abcd'
The alternation applies to the _longest pattern_, not the shortest. ....
For string `"aabb"`:
aabb|bbaa # match
aacc|bb # no match
aa(cc|bb) # match
a+|b+ # no match
a+b+|b+a+ # match
a+(b|c)+ # match
-- --
Character classes:: `[ … ]`::
+ +
-- --
Match one of the characters in the brackets. For example:
Ranges of potential characters may be represented as character classes ....
by enclosing them in square brackets `"[]"`. A leading `^` [abc] # matches 'a', 'b', 'c'
negates the character class. The allowed forms are: ....
[abc] # 'a' or 'b' or 'c' Inside the brackets, `-` indicates a range unless `-` is the first character or
[a-c] # 'a' or 'b' or 'c' escaped. For example:
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^abc] # any character except 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[^-abc] # any character except '-' or 'a' or 'b' or 'c'
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'
Note that the dash `"-"` indicates a range of characters, unless it is ....
the first character or if it is escaped with a backslash. [a-c] # matches 'a', 'b', or 'c'
[-abc] # '-' is first character. Matches '-', 'a', 'b', or 'c'
[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
....
For string `"abcd"`: A `^` before a character in the brackets negates the character or range. For
example:
ab[cd]+ # match
[a-d]+ # match
[^a-d]+ # no match
....
[^abc] # matches any character except 'a', 'b', or 'c'
[^a-c] # matches any character except 'a', 'b', or 'c'
[^-abc] # matches any character except '-', 'a', 'b', or 'c'
[^abc\-] # matches any character except 'a', 'b', 'c', or '-'
....
-- --
===== Optional operators [float]
[[regexp-optional-operators]]
=== Optional operators
These operators are available by default as the `flags` parameter defaults to `ALL`. You can use the `flags` parameter to enable more optional operators for
Different flag combinations (concatenated with `"|"`) can be used to enable/disable Lucene's regular expression engine.
specific operators:
{ To enable multiple operators, use a `|` separator. For example, a `flags` value
"regexp": { of `COMPLEMENT|INTERVAL` enables the `COMPLEMENT` and `INTERVAL` operators.
"username": {
"value": "john~athon<1-5>",
"flags": "COMPLEMENT|INTERVAL"
}
}
}
Complement:: [float]
==== Valid values
`ALL` (Default)::
Enables all optional operators.
`COMPLEMENT`::
+ +
-- --
Enables the `~` operator. You can use `~` to negate the shortest following
pattern. For example:
The complement is probably the most useful option. The shortest pattern that ....
follows a tilde `"~"` is negated. For instance, `"ab~cd" means: a~bc # matches 'adc' and 'aec' but not 'abc'
....
* Starts with `a`
* Followed by `b`
* Followed by a string of any length that is anything but `c`
* Ends with `d`
For the string `"abcdef"`:
ab~df # match
ab~cf # match
ab~cdef # no match
a~(cb)def # match
a~(bc)def # no match
Enabled with the `COMPLEMENT` or `ALL` flags.
-- --
Interval:: `INTERVAL`::
+ +
-- --
Enables the `<>` operators. You can use `<>` to match a numeric range. For
example:
The interval option enables the use of numeric ranges, enclosed by angle ....
brackets `"<>"`. For string: `"foo80"`: foo<1-100> # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
foo<01-100> # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
foo<1-100> # match ....
foo<01-100> # match
foo<001-100> # no match
Enabled with the `INTERVAL` or `ALL` flags.
-- --
Intersection:: `INTERSECTION`::
+ +
-- --
Enables the `&` operator, which acts as an AND operator. The match will succeed
if patterns on both the left side AND the right side matches. For example:
The ampersand `"&"` joins two patterns in a way that both of them have to ....
match. For string `"aaabbb"`: aaa.+&.+bbb # matches 'aaabbb'
....
aaa.+&.+bbb # match
aaa&bbb # no match
Using this feature usually means that you should rewrite your regular
expression.
Enabled with the `INTERSECTION` or `ALL` flags.
-- --
Any string:: `ANYSTRING`::
+ +
-- --
Enables the `@` operator. You can use `@` to match any entire
string.
The at sign `"@"` matches any string in its entirety. This could be combined You can combine the `@` operator with `&` and `~` operators to create an
with the intersection and complement above to express ``everything except''. "everything except" logic. For example:
For instance:
@&~(foo.+) # anything except string beginning with "foo" ....
@&~(abc.+) # matches everything except terms beginning with 'abc'
Enabled with the `ANYSTRING` or `ALL` flags. ....
-- --
[float]
[[regexp-unsupported-operators]]
=== Unsupported operators
Lucene's regular expression engine does not support anchor operators, such as
`^` (beginning of line) or `$` (end of line). To match a term, the regular
expression must match the entire string.

View File

@ -49,7 +49,7 @@ The value specified in the field rule can be one of the following types:
| Simple String | Exactly matches the provided value. | "esadmin" | Simple String | Exactly matches the provided value. | "esadmin"
| Wildcard String | Matches the provided value using a wildcard. | "*,dc=example,dc=com" | Wildcard String | Matches the provided value using a wildcard. | "*,dc=example,dc=com"
| Regular Expression | Matches the provided value using a | Regular Expression | Matches the provided value using a
{ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp]. | "/.\*-admin[0-9]*/" {ref}/regexp-syntax.html[Lucene regexp]. | "/.\*-admin[0-9]*/"
| Number | Matches an equivalent numerical value. | 7 | Number | Matches an equivalent numerical value. | 7
| Null | Matches a null or missing value. | null | Null | Matches a null or missing value. | null
| Array | Tests each element in the array in | Array | Tests each element in the array in

View File

@ -132,7 +132,7 @@ Please take time to review these policies whenever your system architecture chan
A policy is a named set of filter rules. Each filter rule applies to a single event attribute, A policy is a named set of filter rules. Each filter rule applies to a single event attribute,
one of the `users`, `realms`, `roles` or `indices` attributes. The filter rule defines one of the `users`, `realms`, `roles` or `indices` attributes. The filter rule defines
a list of {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp], *any* of which has to match the value of the audit a list of {ref}/regexp-syntax.html[Lucene regexp], *any* of which has to match the value of the audit
event attribute for the rule to match. event attribute for the rule to match.
A policy matches an event if *all* the rules comprising it match the event. A policy matches an event if *all* the rules comprising it match the event.
An audit event is ignored, therefore not printed, if it matches *any* policy. All other An audit event is ignored, therefore not printed, if it matches *any* policy. All other