[DOCS] Rewrite `regexp` query (#42711)

2019-07-24 08:37:37 -04:00 · 2019-07-24 08:37:37 -04:00 · ad7c164dd0
parent bfb2e323e9
commit ad7c164dd0
6 changed files with 212 additions and 283 deletions
--- a/docs/reference/index-modules.asciidoc
+++ b/docs/reference/index-modules.asciidoc
@ -205,6 +205,7 @@ specific index module:
    The maximum number of terms that can be used in Terms Query.
    Defaults to `65536`.
 [[index-max-regex-length]]
 `index.max_regex_length`::
    The maximum length of regex that can be used in Regexp Query.
--- a/docs/reference/query-dsl.asciidoc
+++ b/docs/reference/query-dsl.asciidoc
@ -47,4 +47,6 @@ include::query-dsl/term-level-queries.asciidoc[]
 include::query-dsl/minimum-should-match.asciidoc[]
-include::query-dsl/multi-term-rewrite.asciidoc[]
+include::query-dsl/multi-term-rewrite.asciidoc[]
 include::query-dsl/regexp-syntax.asciidoc[]
--- a/docs/reference/query-dsl/regexp-query.asciidoc
+++ b/docs/reference/query-dsl/regexp-query.asciidoc
@ -4,98 +4,86 @@
 <titleabbrev>Regexp</titleabbrev>
 ++++
-The `regexp` query allows you to use regular expression term queries.
+Returns documents that contain terms matching a
-See <<regexp-syntax>> for details of the supported regular expression language.
+https://en.wikipedia.org/wiki/Regular_expression[regular expression].
 The "term queries" in that first sentence means that Elasticsearch will apply
 the regexp to the terms produced by the tokenizer for that field, and not
 to the original text of the field.
-*Note*: The performance of a `regexp` query heavily depends on the
+A regular expression is a way to match patterns in data using placeholder
-regular expression chosen. Matching everything like `.*` is very slow as
+characters, called operators. For a list of operators supported by the
-well as using lookaround regular expressions. If possible, you should
+`regexp` query, see <<regexp-syntax, Regular expression syntax>>.
-try to use a long prefix before your regular expression starts. Wildcard
+
-matchers like `.*?+` will mostly lower performance.
+[[regexp-query-ex-request]]
 ==== Example request
 The following search returns documents where the `user` field contains any term
 that begins with `k` and ends with `y`. The `.*` operators match any
 characters of any length, including no characters. Matching
 terms can include `ky`, `kay`, and `kimchy`.
 [source,js]
--------------------------------------------------
+----
 GET /_search
 {
    "query": {
-        "regexp":{
+        "regexp": {
-            "name.first": "s.*y"
+            "user": {
-        }
+                "value": "k.*y",
-    }
+                "flags" : "ALL",
-}
+                "max_determinized_states": 10000,
--------------------------------------------------
+                "rewrite": "constant_score"
 // CONSOLE
 Boosting is also supported
 [source,js]
 --------------------------------------------------
 GET /_search
 {
    "query": {
        "regexp":{
            "name.first":{
                "value":"s.*y",
                "boost":1.2
            }
        }
    }
 }
--------------------------------------------------
+----
 // CONSOLE
 You can also use special flags
-[source,js]
+[[regexp-top-level-params]]
--------------------------------------------------
+==== Top-level parameters for `regexp`
-GET /_search
+`<field>`::
-{
+(Required, object) Field you wish to search.
    "query": {
        "regexp":{
            "name.first": {
                "value": "s.*y",
                "flags" : "INTERSECTION|COMPLEMENT|EMPTY"
            }
        }
    }
 }
 --------------------------------------------------
 // CONSOLE
-Possible flags are `ALL` (default), `ANYSTRING`, `COMPLEMENT`,
+[[regexp-query-field-params]]
-`EMPTY`, `INTERSECTION`, `INTERVAL`, or `NONE`. Please check the
+==== Parameters for `<field>`
-http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/util/automaton/RegExp.html[Lucene
+`value`::
-documentation] for their meaning
+(Required, string) Regular expression for terms you wish to find in the provided
 `<field>`. For a list of supported operators, see <<regexp-syntax, Regular
 expression syntax>>.
 +
 --
 By default, regular expressions are limited to 1,000 characters. You can change
 this limit using the <<index-max-regex-length, `index.max_regex_length`>>
 setting.
-Regular expressions are dangerous because it's easy to accidentally
+[WARNING]
-create an innocuous looking one that requires an exponential number of
+=====
-internal determinized automaton states (and corresponding RAM and CPU)
+The performance of the `regexp` query can vary based on the regular expression
-for Lucene to execute.  Lucene prevents these using the
+provided. To improve performance, avoid using wildcard patterns, such as `.*` or
-`max_determinized_states` setting (defaults to 10000).  You can raise
+`.*?+`, without a prefix or suffix.
-this limit to allow more complex regular expressions to execute.
+=====
 --
-[source,js]
+`flags`::
--------------------------------------------------
+(Optional, string) Enables optional operators for the regular expression. For
-GET /_search
+valid values and more information, see <<regexp-optional-operators, Regular
-{
+expression syntax>>.
    "query": {
        "regexp":{
            "name.first": {
                "value": "s.*y",
                "flags" : "INTERSECTION|COMPLEMENT|EMPTY",
                "max_determinized_states": 20000
            }
        }
    }
 }
 --------------------------------------------------
 // CONSOLE
-NOTE: By default the maximum length of regex string allowed in a Regexp Query 
+`max_determinized_states`::
-is limited to 1000. You can update the `index.max_regex_length` index setting 
+
-to bypass this limit.
+--
 (Optional, integer) Maximum number of
 https://en.wikipedia.org/wiki/Deterministic_finite_automaton[automaton states]
 required for the query. Default is `10000`.
-include::regexp-syntax.asciidoc[]
+{es} uses https://lucene.apache.org/core/[Apache Lucene] internally to parse
 regular expressions. Lucene converts each regular expression to a finite
 automaton containing a number of determinized states.
 You can use this parameter to prevent that conversion from unintentionally
 consuming too many resources. You may need to increase this limit to run complex
 regular expressions.
 --
 `rewrite`::
 (Optional, string) Method used to rewrite the query. For valid values and more
 information, see the <<query-dsl-multi-term-rewrite, `rewrite` parameter>>.
--- a/docs/reference/query-dsl/regexp-syntax.asciidoc
+++ b/docs/reference/query-dsl/regexp-syntax.asciidoc
@ -1,286 +1,224 @@
 [[regexp-syntax]]
-==== Regular expression syntax
+== Regular expression syntax
-Regular expression queries are supported by the `regexp` and the `query_string`
+A https://en.wikipedia.org/wiki/Regular_expression[regular expression] is a way to
-queries.  The Lucene regular expression engine
+match patterns in data using placeholder characters, called operators.
 is not Perl-compatible but supports a smaller range of operators.
-[NOTE]
+{es} supports regular expressions in the following queries:
 =====
 We will not attempt to explain regular expressions, but
 just explain the supported operators.
 =====
-===== Standard operators
+* <<query-dsl-regexp-query, `regexp`>>
 * <<query-dsl-query-string-query, `query_string`>>
-Anchoring::
+{es} uses https://lucene.apache.org/core/[Apache Lucene]'s regular expression
-+
+engine to parse these queries.
 --
-Most regular expression engines allow you to match any part of a string.
+[float]
-If you want the regexp pattern to start at the beginning of the string or
+[[regexp-reserved-characters]]
-finish at the end of the string, then you have to _anchor_ it specifically,
+=== Reserved characters
-using `^` to indicate the beginning or `$` to indicate the end.
+Lucene's regular expression engine supports all Unicode characters. However, the
-
+following characters are reserved as operators:
 Lucene's patterns are always anchored.  The pattern provided must match
 the entire string. For string `"abcde"`:
    ab.*     # match
    abcd     # no match
 --
 Allowed characters::
 +
 --
 Any Unicode characters may be used in the pattern, but certain characters
 are reserved and must be escaped.  The standard reserved characters are:
 ....
 . ? + * | { } [ ] ( ) " \
 ....
-If you enable optional features (see below) then these characters may
+Depending on the <<regexp-optional-operators, optional operators>> enabled, the
-also be reserved:
+following characters may also be reserved:
-    # @ & < >  ~
+....
 # @ & < >  ~
 ....
-Any reserved character can be escaped with a backslash `"\*"` including
+To use one of these characters literally, escape it with a preceding
-a literal backslash character: `"\\"`
+backslash or surround it with double quotes. For example:
-Additionally, any characters (except double quotes) are interpreted literally
+....
-when surrounded by double quotes:
+\@                  # renders as a literal '@'
 \\                  # renders as a literal '\'
 "john@smith.com"    # renders as 'john@smith.com'
 ....
-    john"@smith.com"
+[float]
 [[regexp-standard-operators]]
 === Standard operators
 Lucene's regular expression engine does not use the
 https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions[Perl
 Compatible Regular Expressions (PCRE)] library, but it does support the
 following standard operators.
--
+`.`::
 Match any character::
 +
 --
 Matches any character. For example:
-The period `"."` can be used to represent any character.  For string `"abcde"`:
+....
-
+ab.     # matches 'aba', 'abb', 'abz', etc.
-    ab...   # match
+....
    a.c.e   # match
 --
-One-or-more::
+`?`::
 +
 --
 Repeat the preceding character zero or one times. Often used to make the
 preceding character optional. For example:
-The plus sign `"+"` can be used to repeat the preceding shortest pattern
+....
-once or more times. For string `"aaabbb"`:
+abc?     # matches 'ab' and 'abc'
-
+....
    a+b+        # match
    aa+bb+      # match
    a+.+        # match
    aa+bbb+     # match
 --
-Zero-or-more::
+`+`::
 +
 --
 Repeat the preceding character one or more times. For example:
-The asterisk `"*"` can be used to match the preceding shortest pattern
+....
-zero-or-more times.  For string `"aaabbb`":
+ab+     # matches 'abb', 'abbb', 'abbbb', etc.
-
+....
    a*b*        # match
    a*b*c*      # match
    .*bbb.*     # match
    aaa*bbb*    # match
 --
-Zero-or-one::
+`*`::
 +
 --
 Repeat the preceding character zero or more times. For example:
-The question mark `"?"` makes the preceding shortest pattern optional. It
+....
-matches zero or one times.  For string `"aaabbb"`:
+ab*     # matches 'ab', 'abb', 'abbb', 'abbbb', etc.
-
+....
    aaa?bbb?    # match
    aaaa?bbbb?  # match
    .....?.?    # match
    aa?bb?      # no match
 --
-Min-to-max::
+`{}`::
 +
 --
 Minimum and maximum number of times the preceding character can repeat. For
 example:
-Curly brackets `"{}"` can be used to specify a minimum and (optionally)
+....
-a maximum number of times the preceding shortest pattern can repeat.  The
+a{2}    # matches 'aa'
-allowed forms are:
+a{2,4}  # matches 'aa', 'aaa', and 'aaaa'
-
+a{2,}   # matches 'a` repeated two or more times
-    {5}     # repeat exactly 5 times
+....
    {2,5}   # repeat at least twice and at most 5 times
    {2,}    # repeat at least twice
 For string `"aaabbb"`:
    a{3}b{3}        # match
    a{2,4}b{2,4}    # match
    a{2,}b{2,}      # match
    .{3}.{3}        # match
    a{4}b{4}        # no match
    a{4,6}b{4,6}    # no match
    a{4,}b{4,}      # no match
 --
-Grouping::
+`|`::
 +
 --
-
+OR operator. The match will succeed if the longest pattern on either the left
-Parentheses `"()"` can be used to form sub-patterns. The quantity operators
+side OR the right side matches. For example:
-listed above operate on the shortest previous pattern, which can be a group.
+....
-For string `"ababab"`:
+abc|xyz  # matches 'abc' and 'xyz'
-
+....
    (ab)+       # match
    ab(ab)+     # match
    (..)+       # match
    (...)+      # no match
    (ab)*       # match
    abab(ab)?   # match
    ab(ab)?     # no match
    (ab){3}     # match
    (ab){1,2}   # no match
 --
-Alternation::
+`( … )`::
 +
 --
 Forms a group. You can use a group to treat part of the expression as a single
 character. For example:
-The pipe symbol `"|"` acts as an OR operator. The match will succeed if
+....
-the pattern on either the left-hand side OR the right-hand side matches.
+abc(def)?  # matches 'abc' and 'abcdef' but not 'abcd'
-The alternation applies to the _longest pattern_, not the shortest.
+....
 For string `"aabb"`:
    aabb|bbaa   # match
    aacc|bb     # no match
    aa(cc|bb)   # match
    a+|b+       # no match
    a+b+|b+a+   # match
    a+(b|c)+    # match
 --
-Character classes::
+`[ … ]`::
 +
 --
 Match one of the characters in the brackets. For example:
-Ranges of potential characters may be represented as character classes
+....
-by enclosing them in square brackets `"[]"`. A leading `^`
+[abc]   # matches 'a', 'b', 'c'
-negates the character class. The allowed forms are:
+....
-    [abc]   # 'a' or 'b' or 'c'
+Inside the brackets, `-` indicates a range unless `-` is the first character or
-    [a-c]   # 'a' or 'b' or 'c'
+escaped. For example:
    [-abc]  # '-' or 'a' or 'b' or 'c'
    [abc\-] # '-' or 'a' or 'b' or 'c'
    [^abc]  # any character except 'a' or 'b' or 'c'
    [^a-c]  # any character except 'a' or 'b' or 'c'
    [^-abc]  # any character except '-' or 'a' or 'b' or 'c'
    [^abc\-] # any character except '-' or 'a' or 'b' or 'c'
-Note that the dash `"-"` indicates a range of characters, unless it is
+....
-the first character or if it is escaped with a backslash.
+[a-c]   # matches 'a', 'b', or 'c'
 [-abc]  # '-' is first character. Matches '-', 'a', 'b', or 'c'
 [abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
 ....
-For string `"abcd"`:
+A `^` before a character in the brackets negates the character or range. For
-
+example:
    ab[cd]+     # match
    [a-d]+      # match
    [^a-d]+     # no match
 ....
 [^abc]      # matches any character except 'a', 'b', or 'c'
 [^a-c]      # matches any character except 'a', 'b', or 'c'
 [^-abc]     # matches any character except '-', 'a', 'b', or 'c'
 [^abc\-]    # matches any character except 'a', 'b', 'c', or '-'
 ....
 --
-===== Optional operators
+[float]
 [[regexp-optional-operators]]
 === Optional operators
-These operators are available by default as the `flags` parameter defaults to `ALL`.
+You can use the `flags` parameter to enable more optional operators for
-Different flag combinations (concatenated with `"|"`) can be used to enable/disable
+Lucene's regular expression engine.
 specific operators:
-    {
+To enable multiple operators, use a `|` separator. For example, a `flags` value
-        "regexp": {
+of `COMPLEMENT|INTERVAL` enables the `COMPLEMENT` and `INTERVAL` operators.
            "username": {
                "value": "john~athon<1-5>",
                "flags": "COMPLEMENT|INTERVAL"
            }
        }
    }
-Complement::
+[float]
 ==== Valid values 
 `ALL` (Default)::
 Enables all optional operators.
 `COMPLEMENT`::
 +
 --
 Enables the `~` operator. You can use `~` to negate the shortest following
 pattern. For example:
-The complement is probably the most useful option. The shortest pattern that
+....
-follows a tilde `"~"` is negated.  For instance, `"ab~cd" means:
+a~bc   # matches 'adc' and 'aec' but not 'abc'
-
+....
 * Starts with `a`
 * Followed by `b`
 * Followed by a string of any length that is anything but `c`
 * Ends with `d`
 For the string `"abcdef"`:
    ab~df     # match
    ab~cf     # match
    ab~cdef   # no match
    a~(cb)def # match
    a~(bc)def # no match
 Enabled with the `COMPLEMENT` or `ALL` flags.
 --
-Interval::
+`INTERVAL`::
 +
 --
 Enables the `<>` operators. You can use `<>` to match a numeric range. For
 example:
-The interval option enables the use of numeric ranges, enclosed by angle
+....
-brackets `"<>"`. For string: `"foo80"`:
+foo<1-100>      # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
-
+foo<01-100>     # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
-    foo<1-100>     # match
+....
    foo<01-100>    # match
    foo<001-100>   # no match
 Enabled with the `INTERVAL` or `ALL` flags.
 --
-Intersection::
+`INTERSECTION`::
 +
 --
 Enables the `&` operator, which acts as an AND operator. The match will succeed
 if patterns on both the left side AND the right side matches. For example:
-The ampersand `"&"` joins two patterns in a way that both of them have to
+....
-match. For string `"aaabbb"`:
+aaa.+&.+bbb  # matches 'aaabbb'
-
+....
    aaa.+&.+bbb     # match
    aaa&bbb         # no match
 Using this feature usually means that you should rewrite your regular
 expression.
 Enabled with the `INTERSECTION` or `ALL` flags.
 --
-Any string::
+`ANYSTRING`::
 +
 --
 Enables the `@` operator. You can use `@` to match any entire
 string.
-The at sign `"@"` matches any string in its entirety.  This could be combined
+You can combine the `@` operator with `&` and `~` operators to create an
-with the intersection and complement above to express ``everything except''.
+"everything except" logic. For example:
 For instance:
-    @&~(foo.+)      # anything except string beginning with "foo"
+....
-
+@&~(abc.+)  # matches everything except terms beginning with 'abc'
-Enabled with the `ANYSTRING` or `ALL` flags.
+....
 --
 [float]
 [[regexp-unsupported-operators]]
 === Unsupported operators
 Lucene's regular expression engine does not support anchor operators, such as
 `^` (beginning of line) or `$` (end of line). To match a term, the regular
 expression must match the entire string.
--- a/x-pack/docs/en/rest-api/security/role-mapping-resources.asciidoc
+++ b/x-pack/docs/en/rest-api/security/role-mapping-resources.asciidoc
@ -49,7 +49,7 @@ The value specified in the field rule can be one of the following types:
 | Simple String      | Exactly matches the provided value.                             | "esadmin"
 | Wildcard String    | Matches the provided value using a wildcard.                    | "*,dc=example,dc=com"
 | Regular Expression | Matches the provided value using a
-                       {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp]. | "/.\*-admin[0-9]*/"
+                       {ref}/regexp-syntax.html[Lucene regexp]. | "/.\*-admin[0-9]*/"
 | Number             | Matches an equivalent numerical value.                          | 7
 | Null               | Matches a null or missing value.                                | null
 | Array              | Tests each element in the array in
--- a/x-pack/docs/en/security/auditing/output-logfile.asciidoc
+++ b/x-pack/docs/en/security/auditing/output-logfile.asciidoc
@ -132,7 +132,7 @@ Please take time to review these policies whenever your system architecture chan
 A policy is a named set of filter rules. Each filter rule applies to a single event attribute,
 one of the `users`, `realms`, `roles` or `indices` attributes. The filter rule defines
-a list of {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp], *any* of which has to match the value of the audit
+a list of {ref}/regexp-syntax.html[Lucene regexp], *any* of which has to match the value of the audit
 event attribute for the rule to match.
 A policy matches an event if *all* the rules comprising it match the event.
 An audit event is ignored, therefore not printed, if it matches *any* policy. All other