[DOCS] Rewrite `regexp` query (#42711)

2019-07-24 08:37:37 -04:00 · 2019-07-24 08:37:37 -04:00 · ad7c164dd0
parent bfb2e323e9
commit ad7c164dd0
6 changed files with 212 additions and 283 deletions
--- a/docs/reference/index-modules.asciidoc
+++ b/docs/reference/index-modules.asciidoc
@ -205,6 +205,7 @@ specific index module:
    The maximum number of terms that can be used in Terms Query.
    Defaults to `65536`.

+[[index-max-regex-length]]
 `index.max_regex_length`::

    The maximum length of regex that can be used in Regexp Query.
--- a/docs/reference/query-dsl.asciidoc
+++ b/docs/reference/query-dsl.asciidoc
@ -48,3 +48,5 @@ include::query-dsl/term-level-queries.asciidoc[]
 include::query-dsl/minimum-should-match.asciidoc[]

 include::query-dsl/multi-term-rewrite.asciidoc[]
+
+include::query-dsl/regexp-syntax.asciidoc[]
--- a/docs/reference/query-dsl/regexp-query.asciidoc
+++ b/docs/reference/query-dsl/regexp-query.asciidoc
@ -4,98 +4,86 @@
 <titleabbrev>Regexp</titleabbrev>
 ++++

-The `regexp` query allows you to use regular expression term queries.
-See <<regexp-syntax>> for details of the supported regular expression language.
-The "term queries" in that first sentence means that Elasticsearch will apply
-the regexp to the terms produced by the tokenizer for that field, and not
-to the original text of the field.
+Returns documents that contain terms matching a
+https://en.wikipedia.org/wiki/Regular_expression[regular expression].

-*Note*: The performance of a `regexp` query heavily depends on the
-regular expression chosen. Matching everything like `.*` is very slow as
-well as using lookaround regular expressions. If possible, you should
-try to use a long prefix before your regular expression starts. Wildcard
-matchers like `.*?+` will mostly lower performance.
+A regular expression is a way to match patterns in data using placeholder
+characters, called operators. For a list of operators supported by the
+`regexp` query, see <<regexp-syntax, Regular expression syntax>>.
+
+[[regexp-query-ex-request]]
+==== Example request
+
+The following search returns documents where the `user` field contains any term
+that begins with `k` and ends with `y`. The `.*` operators match any
+characters of any length, including no characters. Matching
+terms can include `ky`, `kay`, and `kimchy`.

 [source,js]
--------------------------------------------------
+----
 GET /_search
 {
    "query": {
-        "regexp":{
-            "name.first": "s.*y"
-        }
-    }
-}
--------------------------------------------------
-// CONSOLE
-
-Boosting is also supported
-
-[source,js]
--------------------------------------------------
-GET /_search
-{
-    "query": {
-        "regexp":{
-            "name.first":{
-                "value":"s.*y",
-                "boost":1.2
+        "regexp": {
+            "user": {
+                "value": "k.*y",
+                "flags" : "ALL",
+                "max_determinized_states": 10000,
+                "rewrite": "constant_score"
            }
        }
    }
 }
--------------------------------------------------
+----
 // CONSOLE

-You can also use special flags

-[source,js]
--------------------------------------------------
-GET /_search
-{
-    "query": {
-        "regexp":{
-            "name.first": {
-                "value": "s.*y",
-                "flags" : "INTERSECTION|COMPLEMENT|EMPTY"
-            }
-        }
-    }
-}
--------------------------------------------------
-// CONSOLE
+[[regexp-top-level-params]]
+==== Top-level parameters for `regexp`
+`<field>`::
+(Required, object) Field you wish to search.

-Possible flags are `ALL` (default), `ANYSTRING`, `COMPLEMENT`,
-`EMPTY`, `INTERSECTION`, `INTERVAL`, or `NONE`. Please check the
-http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/util/automaton/RegExp.html[Lucene
-documentation] for their meaning
+[[regexp-query-field-params]]
+==== Parameters for `<field>`
+`value`::
+(Required, string) Regular expression for terms you wish to find in the provided
+`<field>`. For a list of supported operators, see <<regexp-syntax, Regular
+expression syntax>>.
+
+--
+By default, regular expressions are limited to 1,000 characters. You can change
+this limit using the <<index-max-regex-length, `index.max_regex_length`>>
+setting.

-Regular expressions are dangerous because it's easy to accidentally
-create an innocuous looking one that requires an exponential number of
-internal determinized automaton states (and corresponding RAM and CPU)
-for Lucene to execute.  Lucene prevents these using the
-`max_determinized_states` setting (defaults to 10000).  You can raise
-this limit to allow more complex regular expressions to execute.
+[WARNING]
+=====
+The performance of the `regexp` query can vary based on the regular expression
+provided. To improve performance, avoid using wildcard patterns, such as `.*` or
+`.*?+`, without a prefix or suffix.
+=====
+--

-[source,js]
--------------------------------------------------
-GET /_search
-{
-    "query": {
-        "regexp":{
-            "name.first": {
-                "value": "s.*y",
-                "flags" : "INTERSECTION|COMPLEMENT|EMPTY",
-                "max_determinized_states": 20000
-            }
-        }
-    }
-}
--------------------------------------------------
-// CONSOLE
+`flags`::
+(Optional, string) Enables optional operators for the regular expression. For
+valid values and more information, see <<regexp-optional-operators, Regular
+expression syntax>>.

-NOTE: By default the maximum length of regex string allowed in a Regexp Query 
-is limited to 1000. You can update the `index.max_regex_length` index setting 
-to bypass this limit.
+`max_determinized_states`::
+
+--
+(Optional, integer) Maximum number of
+https://en.wikipedia.org/wiki/Deterministic_finite_automaton[automaton states]
+required for the query. Default is `10000`.

-include::regexp-syntax.asciidoc[]
+{es} uses https://lucene.apache.org/core/[Apache Lucene] internally to parse
+regular expressions. Lucene converts each regular expression to a finite
+automaton containing a number of determinized states.
+
+You can use this parameter to prevent that conversion from unintentionally
+consuming too many resources. You may need to increase this limit to run complex
+regular expressions.
+--
+
+`rewrite`::
+(Optional, string) Method used to rewrite the query. For valid values and more
+information, see the <<query-dsl-multi-term-rewrite, `rewrite` parameter>>.
--- a/docs/reference/query-dsl/regexp-syntax.asciidoc
+++ b/docs/reference/query-dsl/regexp-syntax.asciidoc
@ -1,286 +1,224 @@
 [[regexp-syntax]]
-==== Regular expression syntax
+== Regular expression syntax

-Regular expression queries are supported by the `regexp` and the `query_string`
-queries.  The Lucene regular expression engine
-is not Perl-compatible but supports a smaller range of operators.
+A https://en.wikipedia.org/wiki/Regular_expression[regular expression] is a way to
+match patterns in data using placeholder characters, called operators.

-[NOTE]
-=====
-We will not attempt to explain regular expressions, but
-just explain the supported operators.
-=====
+{es} supports regular expressions in the following queries:

-===== Standard operators
+* <<query-dsl-regexp-query, `regexp`>>
+* <<query-dsl-query-string-query, `query_string`>>

-Anchoring::
-+
--
+{es} uses https://lucene.apache.org/core/[Apache Lucene]'s regular expression
+engine to parse these queries.

-Most regular expression engines allow you to match any part of a string.
-If you want the regexp pattern to start at the beginning of the string or
-finish at the end of the string, then you have to _anchor_ it specifically,
-using `^` to indicate the beginning or `$` to indicate the end.
-
-Lucene's patterns are always anchored.  The pattern provided must match
-the entire string. For string `"abcde"`:
-
-    ab.*     # match
-    abcd     # no match
-
--
-
-Allowed characters::
-+
--
-
-Any Unicode characters may be used in the pattern, but certain characters
-are reserved and must be escaped.  The standard reserved characters are:
+[float]
+[[regexp-reserved-characters]]
+=== Reserved characters
+Lucene's regular expression engine supports all Unicode characters. However, the
+following characters are reserved as operators:

 ....
 . ? + * | { } [ ] ( ) " \
 ....

-If you enable optional features (see below) then these characters may
-also be reserved:
+Depending on the <<regexp-optional-operators, optional operators>> enabled, the
+following characters may also be reserved:

-    # @ & < >  ~
+....
+# @ & < >  ~
+....

-Any reserved character can be escaped with a backslash `"\*"` including
-a literal backslash character: `"\\"`
+To use one of these characters literally, escape it with a preceding
+backslash or surround it with double quotes. For example:

-Additionally, any characters (except double quotes) are interpreted literally
-when surrounded by double quotes:
-
-    john"@smith.com"
+....
+\@                  # renders as a literal '@'
+\\                  # renders as a literal '\'
+"john@smith.com"    # renders as 'john@smith.com'
+....
    

--
+[float]
+[[regexp-standard-operators]]
+=== Standard operators

-Match any character::
+Lucene's regular expression engine does not use the
+https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions[Perl
+Compatible Regular Expressions (PCRE)] library, but it does support the
+following standard operators.
+
+`.`::
 +
 --
+Matches any character. For example:

-The period `"."` can be used to represent any character.  For string `"abcde"`:
-
-    ab...   # match
-    a.c.e   # match
-
+....
+ab.     # matches 'aba', 'abb', 'abz', etc.
+....
 --

-One-or-more::
+`?`::
 +
 --
+Repeat the preceding character zero or one times. Often used to make the
+preceding character optional. For example:

-The plus sign `"+"` can be used to repeat the preceding shortest pattern
-once or more times. For string `"aaabbb"`:
-
-    a+b+        # match
-    aa+bb+      # match
-    a+.+        # match
-    aa+bbb+     # match
-
+....
+abc?     # matches 'ab' and 'abc'
+....
 --

-Zero-or-more::
+`+`::
 +
 --
+Repeat the preceding character one or more times. For example:

-The asterisk `"*"` can be used to match the preceding shortest pattern
-zero-or-more times.  For string `"aaabbb`":
-
-    a*b*        # match
-    a*b*c*      # match
-    .*bbb.*     # match
-    aaa*bbb*    # match
-
+....
+ab+     # matches 'abb', 'abbb', 'abbbb', etc.
+....
 --

-Zero-or-one::
+`*`::
 +
 --
+Repeat the preceding character zero or more times. For example:

-The question mark `"?"` makes the preceding shortest pattern optional. It
-matches zero or one times.  For string `"aaabbb"`:
-
-    aaa?bbb?    # match
-    aaaa?bbbb?  # match
-    .....?.?    # match
-    aa?bb?      # no match
-
+....
+ab*     # matches 'ab', 'abb', 'abbb', 'abbbb', etc.
+....
 --

-Min-to-max::
+`{}`::
 +
 --
+Minimum and maximum number of times the preceding character can repeat. For
+example:

-Curly brackets `"{}"` can be used to specify a minimum and (optionally)
-a maximum number of times the preceding shortest pattern can repeat.  The
-allowed forms are:
-
-    {5}     # repeat exactly 5 times
-    {2,5}   # repeat at least twice and at most 5 times
-    {2,}    # repeat at least twice
-
-For string `"aaabbb"`:
-
-    a{3}b{3}        # match
-    a{2,4}b{2,4}    # match
-    a{2,}b{2,}      # match
-    .{3}.{3}        # match
-    a{4}b{4}        # no match
-    a{4,6}b{4,6}    # no match
-    a{4,}b{4,}      # no match
-
+....
+a{2}    # matches 'aa'
+a{2,4}  # matches 'aa', 'aaa', and 'aaaa'
+a{2,}   # matches 'a` repeated two or more times
+....
 --

-Grouping::
+`|`::
 +
 --
-
-Parentheses `"()"` can be used to form sub-patterns. The quantity operators
-listed above operate on the shortest previous pattern, which can be a group.
-For string `"ababab"`:
-
-    (ab)+       # match
-    ab(ab)+     # match
-    (..)+       # match
-    (...)+      # no match
-    (ab)*       # match
-    abab(ab)?   # match
-    ab(ab)?     # no match
-    (ab){3}     # match
-    (ab){1,2}   # no match
-
+OR operator. The match will succeed if the longest pattern on either the left
+side OR the right side matches. For example:
+....
+abc|xyz  # matches 'abc' and 'xyz'
+....
 --

-Alternation::
+`( … )`::
 +
 --
+Forms a group. You can use a group to treat part of the expression as a single
+character. For example:

-The pipe symbol `"|"` acts as an OR operator. The match will succeed if
-the pattern on either the left-hand side OR the right-hand side matches.
-The alternation applies to the _longest pattern_, not the shortest.
-For string `"aabb"`:
-
-    aabb|bbaa   # match
-    aacc|bb     # no match
-    aa(cc|bb)   # match
-    a+|b+       # no match
-    a+b+|b+a+   # match
-    a+(b|c)+    # match
-
+....
+abc(def)?  # matches 'abc' and 'abcdef' but not 'abcd'
+....
 --

-Character classes::
+`[ … ]`::
 +
 --
+Match one of the characters in the brackets. For example:

-Ranges of potential characters may be represented as character classes
-by enclosing them in square brackets `"[]"`. A leading `^`
-negates the character class. The allowed forms are:
+....
+[abc]   # matches 'a', 'b', 'c'
+....

-    [abc]   # 'a' or 'b' or 'c'
-    [a-c]   # 'a' or 'b' or 'c'
-    [-abc]  # '-' or 'a' or 'b' or 'c'
-    [abc\-] # '-' or 'a' or 'b' or 'c'
-    [^abc]  # any character except 'a' or 'b' or 'c'
-    [^a-c]  # any character except 'a' or 'b' or 'c'
-    [^-abc]  # any character except '-' or 'a' or 'b' or 'c'
-    [^abc\-] # any character except '-' or 'a' or 'b' or 'c'
+Inside the brackets, `-` indicates a range unless `-` is the first character or
+escaped. For example:

-Note that the dash `"-"` indicates a range of characters, unless it is
-the first character or if it is escaped with a backslash.
+....
+[a-c]   # matches 'a', 'b', or 'c'
+[-abc]  # '-' is first character. Matches '-', 'a', 'b', or 'c'
+[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
+....

-For string `"abcd"`:
-
-    ab[cd]+     # match
-    [a-d]+      # match
-    [^a-d]+     # no match
+A `^` before a character in the brackets negates the character or range. For
+example:

+....
+[^abc]      # matches any character except 'a', 'b', or 'c'
+[^a-c]      # matches any character except 'a', 'b', or 'c'
+[^-abc]     # matches any character except '-', 'a', 'b', or 'c'
+[^abc\-]    # matches any character except 'a', 'b', 'c', or '-'
+....
 --

-===== Optional operators
+[float]
+[[regexp-optional-operators]]
+=== Optional operators

-These operators are available by default as the `flags` parameter defaults to `ALL`.
-Different flag combinations (concatenated with `"|"`) can be used to enable/disable
-specific operators:
+You can use the `flags` parameter to enable more optional operators for
+Lucene's regular expression engine.

-    {
-        "regexp": {
-            "username": {
-                "value": "john~athon<1-5>",
-                "flags": "COMPLEMENT|INTERVAL"
-            }
-        }
-    }
+To enable multiple operators, use a `|` separator. For example, a `flags` value
+of `COMPLEMENT|INTERVAL` enables the `COMPLEMENT` and `INTERVAL` operators.

-Complement::
+[float]
+==== Valid values 
+
+`ALL` (Default)::
+Enables all optional operators.
+
+`COMPLEMENT`::
 +
 --
+Enables the `~` operator. You can use `~` to negate the shortest following
+pattern. For example:

-The complement is probably the most useful option. The shortest pattern that
-follows a tilde `"~"` is negated.  For instance, `"ab~cd" means:
-
-* Starts with `a`
-* Followed by `b`
-* Followed by a string of any length that is anything but `c`
-* Ends with `d`
-
-For the string `"abcdef"`:
-
-    ab~df     # match
-    ab~cf     # match
-    ab~cdef   # no match
-    a~(cb)def # match
-    a~(bc)def # no match
-
-Enabled with the `COMPLEMENT` or `ALL` flags.
-
+....
+a~bc   # matches 'adc' and 'aec' but not 'abc'
+....
 --

-Interval::
+`INTERVAL`::
 +
 --
+Enables the `<>` operators. You can use `<>` to match a numeric range. For
+example:

-The interval option enables the use of numeric ranges, enclosed by angle
-brackets `"<>"`. For string: `"foo80"`:
-
-    foo<1-100>     # match
-    foo<01-100>    # match
-    foo<001-100>   # no match
-
-Enabled with the `INTERVAL` or `ALL` flags.
-
-
+....
+foo<1-100>      # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
+foo<01-100>     # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
+....
 --

-Intersection::
+`INTERSECTION`::
 +
 --
+Enables the `&` operator, which acts as an AND operator. The match will succeed
+if patterns on both the left side AND the right side matches. For example:

-The ampersand `"&"` joins two patterns in a way that both of them have to
-match. For string `"aaabbb"`:
-
-    aaa.+&.+bbb     # match
-    aaa&bbb         # no match
-
-Using this feature usually means that you should rewrite your regular
-expression.
-
-Enabled with the `INTERSECTION` or `ALL` flags.
-
+....
+aaa.+&.+bbb  # matches 'aaabbb'
+....
 --

-Any string::
+`ANYSTRING`::
 +
 --
+Enables the `@` operator. You can use `@` to match any entire
+string.

-The at sign `"@"` matches any string in its entirety.  This could be combined
-with the intersection and complement above to express ``everything except''.
-For instance:
+You can combine the `@` operator with `&` and `~` operators to create an
+"everything except" logic. For example:

-    @&~(foo.+)      # anything except string beginning with "foo"
-
-Enabled with the `ANYSTRING` or `ALL` flags.
+....
+@&~(abc.+)  # matches everything except terms beginning with 'abc'
+....
 --
+
+[float]
+[[regexp-unsupported-operators]]
+=== Unsupported operators
+Lucene's regular expression engine does not support anchor operators, such as
+`^` (beginning of line) or `$` (end of line). To match a term, the regular
+expression must match the entire string.
--- a/x-pack/docs/en/rest-api/security/role-mapping-resources.asciidoc
+++ b/x-pack/docs/en/rest-api/security/role-mapping-resources.asciidoc
@ -49,7 +49,7 @@ The value specified in the field rule can be one of the following types:
 | Simple String      | Exactly matches the provided value.                             | "esadmin"
 | Wildcard String    | Matches the provided value using a wildcard.                    | "*,dc=example,dc=com"
 | Regular Expression | Matches the provided value using a
-                       {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp]. | "/.\*-admin[0-9]*/"
+                       {ref}/regexp-syntax.html[Lucene regexp]. | "/.\*-admin[0-9]*/"
 | Number             | Matches an equivalent numerical value.                          | 7
 | Null               | Matches a null or missing value.                                | null
 | Array              | Tests each element in the array in
--- a/x-pack/docs/en/security/auditing/output-logfile.asciidoc
+++ b/x-pack/docs/en/security/auditing/output-logfile.asciidoc
@ -132,7 +132,7 @@ Please take time to review these policies whenever your system architecture chan

 A policy is a named set of filter rules. Each filter rule applies to a single event attribute,
 one of the `users`, `realms`, `roles` or `indices` attributes. The filter rule defines
-a list of {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp], *any* of which has to match the value of the audit
+a list of {ref}/regexp-syntax.html[Lucene regexp], *any* of which has to match the value of the audit
 event attribute for the rule to match.
 A policy matches an event if *all* the rules comprising it match the event.
 An audit event is ignored, therefore not printed, if it matches *any* policy. All other