[DOCS] Rewrite `regexp` query (#42711)

This commit is contained in:
James Rodewig 2019-07-24 08:37:37 -04:00
parent bfb2e323e9
commit ad7c164dd0
6 changed files with 212 additions and 283 deletions

View File

@ -205,6 +205,7 @@ specific index module:
The maximum number of terms that can be used in Terms Query.
Defaults to `65536`.
[[index-max-regex-length]]
`index.max_regex_length`::
The maximum length of regex that can be used in Regexp Query.

View File

@ -47,4 +47,6 @@ include::query-dsl/term-level-queries.asciidoc[]
include::query-dsl/minimum-should-match.asciidoc[]
include::query-dsl/multi-term-rewrite.asciidoc[]
include::query-dsl/multi-term-rewrite.asciidoc[]
include::query-dsl/regexp-syntax.asciidoc[]

View File

@ -4,98 +4,86 @@
<titleabbrev>Regexp</titleabbrev>
++++
The `regexp` query allows you to use regular expression term queries.
See <<regexp-syntax>> for details of the supported regular expression language.
The "term queries" in that first sentence means that Elasticsearch will apply
the regexp to the terms produced by the tokenizer for that field, and not
to the original text of the field.
Returns documents that contain terms matching a
https://en.wikipedia.org/wiki/Regular_expression[regular expression].
*Note*: The performance of a `regexp` query heavily depends on the
regular expression chosen. Matching everything like `.*` is very slow as
well as using lookaround regular expressions. If possible, you should
try to use a long prefix before your regular expression starts. Wildcard
matchers like `.*?+` will mostly lower performance.
A regular expression is a way to match patterns in data using placeholder
characters, called operators. For a list of operators supported by the
`regexp` query, see <<regexp-syntax, Regular expression syntax>>.
[[regexp-query-ex-request]]
==== Example request
The following search returns documents where the `user` field contains any term
that begins with `k` and ends with `y`. The `.*` operators match any
characters of any length, including no characters. Matching
terms can include `ky`, `kay`, and `kimchy`.
[source,js]
--------------------------------------------------
----
GET /_search
{
"query": {
"regexp":{
"name.first": "s.*y"
}
}
}
--------------------------------------------------
// CONSOLE
Boosting is also supported
[source,js]
--------------------------------------------------
GET /_search
{
"query": {
"regexp":{
"name.first":{
"value":"s.*y",
"boost":1.2
"regexp": {
"user": {
"value": "k.*y",
"flags" : "ALL",
"max_determinized_states": 10000,
"rewrite": "constant_score"
}
}
}
}
--------------------------------------------------
----
// CONSOLE
You can also use special flags
[source,js]
--------------------------------------------------
GET /_search
{
"query": {
"regexp":{
"name.first": {
"value": "s.*y",
"flags" : "INTERSECTION|COMPLEMENT|EMPTY"
}
}
}
}
--------------------------------------------------
// CONSOLE
[[regexp-top-level-params]]
==== Top-level parameters for `regexp`
`<field>`::
(Required, object) Field you wish to search.
Possible flags are `ALL` (default), `ANYSTRING`, `COMPLEMENT`,
`EMPTY`, `INTERSECTION`, `INTERVAL`, or `NONE`. Please check the
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/util/automaton/RegExp.html[Lucene
documentation] for their meaning
[[regexp-query-field-params]]
==== Parameters for `<field>`
`value`::
(Required, string) Regular expression for terms you wish to find in the provided
`<field>`. For a list of supported operators, see <<regexp-syntax, Regular
expression syntax>>.
+
--
By default, regular expressions are limited to 1,000 characters. You can change
this limit using the <<index-max-regex-length, `index.max_regex_length`>>
setting.
Regular expressions are dangerous because it's easy to accidentally
create an innocuous looking one that requires an exponential number of
internal determinized automaton states (and corresponding RAM and CPU)
for Lucene to execute. Lucene prevents these using the
`max_determinized_states` setting (defaults to 10000). You can raise
this limit to allow more complex regular expressions to execute.
[WARNING]
=====
The performance of the `regexp` query can vary based on the regular expression
provided. To improve performance, avoid using wildcard patterns, such as `.*` or
`.*?+`, without a prefix or suffix.
=====
--
[source,js]
--------------------------------------------------
GET /_search
{
"query": {
"regexp":{
"name.first": {
"value": "s.*y",
"flags" : "INTERSECTION|COMPLEMENT|EMPTY",
"max_determinized_states": 20000
}
}
}
}
--------------------------------------------------
// CONSOLE
`flags`::
(Optional, string) Enables optional operators for the regular expression. For
valid values and more information, see <<regexp-optional-operators, Regular
expression syntax>>.
NOTE: By default the maximum length of regex string allowed in a Regexp Query
is limited to 1000. You can update the `index.max_regex_length` index setting
to bypass this limit.
`max_determinized_states`::
+
--
(Optional, integer) Maximum number of
https://en.wikipedia.org/wiki/Deterministic_finite_automaton[automaton states]
required for the query. Default is `10000`.
include::regexp-syntax.asciidoc[]
{es} uses https://lucene.apache.org/core/[Apache Lucene] internally to parse
regular expressions. Lucene converts each regular expression to a finite
automaton containing a number of determinized states.
You can use this parameter to prevent that conversion from unintentionally
consuming too many resources. You may need to increase this limit to run complex
regular expressions.
--
`rewrite`::
(Optional, string) Method used to rewrite the query. For valid values and more
information, see the <<query-dsl-multi-term-rewrite, `rewrite` parameter>>.

View File

@ -1,286 +1,224 @@
[[regexp-syntax]]
==== Regular expression syntax
== Regular expression syntax
Regular expression queries are supported by the `regexp` and the `query_string`
queries. The Lucene regular expression engine
is not Perl-compatible but supports a smaller range of operators.
A https://en.wikipedia.org/wiki/Regular_expression[regular expression] is a way to
match patterns in data using placeholder characters, called operators.
[NOTE]
=====
We will not attempt to explain regular expressions, but
just explain the supported operators.
=====
{es} supports regular expressions in the following queries:
===== Standard operators
* <<query-dsl-regexp-query, `regexp`>>
* <<query-dsl-query-string-query, `query_string`>>
Anchoring::
+
--
{es} uses https://lucene.apache.org/core/[Apache Lucene]'s regular expression
engine to parse these queries.
Most regular expression engines allow you to match any part of a string.
If you want the regexp pattern to start at the beginning of the string or
finish at the end of the string, then you have to _anchor_ it specifically,
using `^` to indicate the beginning or `$` to indicate the end.
Lucene's patterns are always anchored. The pattern provided must match
the entire string. For string `"abcde"`:
ab.* # match
abcd # no match
--
Allowed characters::
+
--
Any Unicode characters may be used in the pattern, but certain characters
are reserved and must be escaped. The standard reserved characters are:
[float]
[[regexp-reserved-characters]]
=== Reserved characters
Lucene's regular expression engine supports all Unicode characters. However, the
following characters are reserved as operators:
....
. ? + * | { } [ ] ( ) " \
....
If you enable optional features (see below) then these characters may
also be reserved:
Depending on the <<regexp-optional-operators, optional operators>> enabled, the
following characters may also be reserved:
# @ & < > ~
....
# @ & < > ~
....
Any reserved character can be escaped with a backslash `"\*"` including
a literal backslash character: `"\\"`
To use one of these characters literally, escape it with a preceding
backslash or surround it with double quotes. For example:
Additionally, any characters (except double quotes) are interpreted literally
when surrounded by double quotes:
....
\@ # renders as a literal '@'
\\ # renders as a literal '\'
"john@smith.com" # renders as 'john@smith.com'
....
john"@smith.com"
[float]
[[regexp-standard-operators]]
=== Standard operators
Lucene's regular expression engine does not use the
https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions[Perl
Compatible Regular Expressions (PCRE)] library, but it does support the
following standard operators.
--
Match any character::
`.`::
+
--
Matches any character. For example:
The period `"."` can be used to represent any character. For string `"abcde"`:
ab... # match
a.c.e # match
....
ab. # matches 'aba', 'abb', 'abz', etc.
....
--
One-or-more::
`?`::
+
--
Repeat the preceding character zero or one times. Often used to make the
preceding character optional. For example:
The plus sign `"+"` can be used to repeat the preceding shortest pattern
once or more times. For string `"aaabbb"`:
a+b+ # match
aa+bb+ # match
a+.+ # match
aa+bbb+ # match
....
abc? # matches 'ab' and 'abc'
....
--
Zero-or-more::
`+`::
+
--
Repeat the preceding character one or more times. For example:
The asterisk `"*"` can be used to match the preceding shortest pattern
zero-or-more times. For string `"aaabbb`":
a*b* # match
a*b*c* # match
.*bbb.* # match
aaa*bbb* # match
....
ab+ # matches 'abb', 'abbb', 'abbbb', etc.
....
--
Zero-or-one::
`*`::
+
--
Repeat the preceding character zero or more times. For example:
The question mark `"?"` makes the preceding shortest pattern optional. It
matches zero or one times. For string `"aaabbb"`:
aaa?bbb? # match
aaaa?bbbb? # match
.....?.? # match
aa?bb? # no match
....
ab* # matches 'ab', 'abb', 'abbb', 'abbbb', etc.
....
--
Min-to-max::
`{}`::
+
--
Minimum and maximum number of times the preceding character can repeat. For
example:
Curly brackets `"{}"` can be used to specify a minimum and (optionally)
a maximum number of times the preceding shortest pattern can repeat. The
allowed forms are:
{5} # repeat exactly 5 times
{2,5} # repeat at least twice and at most 5 times
{2,} # repeat at least twice
For string `"aaabbb"`:
a{3}b{3} # match
a{2,4}b{2,4} # match
a{2,}b{2,} # match
.{3}.{3} # match
a{4}b{4} # no match
a{4,6}b{4,6} # no match
a{4,}b{4,} # no match
....
a{2} # matches 'aa'
a{2,4} # matches 'aa', 'aaa', and 'aaaa'
a{2,} # matches 'a` repeated two or more times
....
--
Grouping::
`|`::
+
--
Parentheses `"()"` can be used to form sub-patterns. The quantity operators
listed above operate on the shortest previous pattern, which can be a group.
For string `"ababab"`:
(ab)+ # match
ab(ab)+ # match
(..)+ # match
(...)+ # no match
(ab)* # match
abab(ab)? # match
ab(ab)? # no match
(ab){3} # match
(ab){1,2} # no match
OR operator. The match will succeed if the longest pattern on either the left
side OR the right side matches. For example:
....
abc|xyz # matches 'abc' and 'xyz'
....
--
Alternation::
`( … )`::
+
--
Forms a group. You can use a group to treat part of the expression as a single
character. For example:
The pipe symbol `"|"` acts as an OR operator. The match will succeed if
the pattern on either the left-hand side OR the right-hand side matches.
The alternation applies to the _longest pattern_, not the shortest.
For string `"aabb"`:
aabb|bbaa # match
aacc|bb # no match
aa(cc|bb) # match
a+|b+ # no match
a+b+|b+a+ # match
a+(b|c)+ # match
....
abc(def)? # matches 'abc' and 'abcdef' but not 'abcd'
....
--
Character classes::
`[ … ]`::
+
--
Match one of the characters in the brackets. For example:
Ranges of potential characters may be represented as character classes
by enclosing them in square brackets `"[]"`. A leading `^`
negates the character class. The allowed forms are:
....
[abc] # matches 'a', 'b', 'c'
....
[abc] # 'a' or 'b' or 'c'
[a-c] # 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^abc] # any character except 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[^-abc] # any character except '-' or 'a' or 'b' or 'c'
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'
Inside the brackets, `-` indicates a range unless `-` is the first character or
escaped. For example:
Note that the dash `"-"` indicates a range of characters, unless it is
the first character or if it is escaped with a backslash.
....
[a-c] # matches 'a', 'b', or 'c'
[-abc] # '-' is first character. Matches '-', 'a', 'b', or 'c'
[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
....
For string `"abcd"`:
ab[cd]+ # match
[a-d]+ # match
[^a-d]+ # no match
A `^` before a character in the brackets negates the character or range. For
example:
....
[^abc] # matches any character except 'a', 'b', or 'c'
[^a-c] # matches any character except 'a', 'b', or 'c'
[^-abc] # matches any character except '-', 'a', 'b', or 'c'
[^abc\-] # matches any character except 'a', 'b', 'c', or '-'
....
--
===== Optional operators
[float]
[[regexp-optional-operators]]
=== Optional operators
These operators are available by default as the `flags` parameter defaults to `ALL`.
Different flag combinations (concatenated with `"|"`) can be used to enable/disable
specific operators:
You can use the `flags` parameter to enable more optional operators for
Lucene's regular expression engine.
{
"regexp": {
"username": {
"value": "john~athon<1-5>",
"flags": "COMPLEMENT|INTERVAL"
}
}
}
To enable multiple operators, use a `|` separator. For example, a `flags` value
of `COMPLEMENT|INTERVAL` enables the `COMPLEMENT` and `INTERVAL` operators.
Complement::
[float]
==== Valid values
`ALL` (Default)::
Enables all optional operators.
`COMPLEMENT`::
+
--
Enables the `~` operator. You can use `~` to negate the shortest following
pattern. For example:
The complement is probably the most useful option. The shortest pattern that
follows a tilde `"~"` is negated. For instance, `"ab~cd" means:
* Starts with `a`
* Followed by `b`
* Followed by a string of any length that is anything but `c`
* Ends with `d`
For the string `"abcdef"`:
ab~df # match
ab~cf # match
ab~cdef # no match
a~(cb)def # match
a~(bc)def # no match
Enabled with the `COMPLEMENT` or `ALL` flags.
....
a~bc # matches 'adc' and 'aec' but not 'abc'
....
--
Interval::
`INTERVAL`::
+
--
Enables the `<>` operators. You can use `<>` to match a numeric range. For
example:
The interval option enables the use of numeric ranges, enclosed by angle
brackets `"<>"`. For string: `"foo80"`:
foo<1-100> # match
foo<01-100> # match
foo<001-100> # no match
Enabled with the `INTERVAL` or `ALL` flags.
....
foo<1-100> # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
foo<01-100> # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
....
--
Intersection::
`INTERSECTION`::
+
--
Enables the `&` operator, which acts as an AND operator. The match will succeed
if patterns on both the left side AND the right side matches. For example:
The ampersand `"&"` joins two patterns in a way that both of them have to
match. For string `"aaabbb"`:
aaa.+&.+bbb # match
aaa&bbb # no match
Using this feature usually means that you should rewrite your regular
expression.
Enabled with the `INTERSECTION` or `ALL` flags.
....
aaa.+&.+bbb # matches 'aaabbb'
....
--
Any string::
`ANYSTRING`::
+
--
Enables the `@` operator. You can use `@` to match any entire
string.
The at sign `"@"` matches any string in its entirety. This could be combined
with the intersection and complement above to express ``everything except''.
For instance:
You can combine the `@` operator with `&` and `~` operators to create an
"everything except" logic. For example:
@&~(foo.+) # anything except string beginning with "foo"
Enabled with the `ANYSTRING` or `ALL` flags.
....
@&~(abc.+) # matches everything except terms beginning with 'abc'
....
--
[float]
[[regexp-unsupported-operators]]
=== Unsupported operators
Lucene's regular expression engine does not support anchor operators, such as
`^` (beginning of line) or `$` (end of line). To match a term, the regular
expression must match the entire string.

View File

@ -49,7 +49,7 @@ The value specified in the field rule can be one of the following types:
| Simple String | Exactly matches the provided value. | "esadmin"
| Wildcard String | Matches the provided value using a wildcard. | "*,dc=example,dc=com"
| Regular Expression | Matches the provided value using a
{ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp]. | "/.\*-admin[0-9]*/"
{ref}/regexp-syntax.html[Lucene regexp]. | "/.\*-admin[0-9]*/"
| Number | Matches an equivalent numerical value. | 7
| Null | Matches a null or missing value. | null
| Array | Tests each element in the array in

View File

@ -132,7 +132,7 @@ Please take time to review these policies whenever your system architecture chan
A policy is a named set of filter rules. Each filter rule applies to a single event attribute,
one of the `users`, `realms`, `roles` or `indices` attributes. The filter rule defines
a list of {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp], *any* of which has to match the value of the audit
a list of {ref}/regexp-syntax.html[Lucene regexp], *any* of which has to match the value of the audit
event attribute for the rule to match.
A policy matches an event if *all* the rules comprising it match the event.
An audit event is ignored, therefore not printed, if it matches *any* policy. All other