OpenSearch/docs/reference/query-dsl/regexp-syntax.asciidoc

[[regexp-syntax]]
=== Regular expression syntax

Regular expression queries are supported by the `regexp` and the `query_string`
queries.  The Lucene regular expression engine
is not Perl-compatible but supports a smaller range of operators.

[NOTE]
====
We will not attempt to explain regular expressions, but
just explain the supported operators.
====

==== Standard operators

Anchoring::
+
--

Most regular expression engines allow you to match any part of a string.
If you want the regexp pattern to start at the beginning of the string or
finish at the end of the string, then you have to _anchor_ it specifically,
using `^` to indicate the beginning or `$` to indicate the end.

Lucene's patterns are always anchored.  The pattern provided must match
the entire string. For string `"abcde"`:

    ab.*     # match
    abcd     # no match

--

Allowed characters::
+
--

Any Unicode characters may be used in the pattern, but certain characters
are reserved and must be escaped.  The standard reserved characters are:

....
. ? + * | { } [ ] ( ) " \
....

If you enable optional features (see below) then these characters may
also be reserved:

    # @ & < >  ~

Any reserved character can be escaped with a backslash `"\*"` including
a literal backslash character: `"\\"`

Additionally, any characters (except double quotes) are interpreted literally
when surrounded by double quotes:

    john"@smith.com"


--

Match any character::
+
--

The period `"."` can be used to represent any character.  For string `"abcde"`:

    ab...   # match
    a.c.e   # match

--

One-or-more::
+
--

The plus sign `"+"` can be used to repeat the preceding shortest pattern
once or more times. For string `"aaabbb"`:

    a+b+        # match
    aa+bb+      # match
    a+.+        # match
    aa+bbb+     # match

--

Zero-or-more::
+
--

The asterisk `"*"` can be used to match the preceding shortest pattern
zero-or-more times.  For string `"aaabbb`":

    a*b*        # match
    a*b*c*      # match
    .*bbb.*     # match
    aaa*bbb*    # match

--

Zero-or-one::
+
--

The question mark `"?"` makes the preceding shortest pattern optional. It
matches zero or one times.  For string `"aaabbb"`:

    aaa?bbb?    # match
    aaaa?bbbb?  # match
    .....?.?    # match
    aa?bb?      # no match

--

Min-to-max::
+
--

Curly brackets `"{}"` can be used to specify a minimum and (optionally)
a maximum number of times the preceding shortest pattern can repeat.  The
allowed forms are:

    {5}     # repeat exactly 5 times
    {2,5}   # repeat at least twice and at most 5 times
    {2,}    # repeat at least twice

For string `"aaabbb"`:

    a{3}b{3}        # match
    a{2,4}b{2,4}    # match
    a{2,}b{2,}      # match
    .{3}.{3}        # match
    a{4}b{4}        # no match
    a{4,6}b{4,6}    # no match
    a{4,}b{4,}      # no match

--

Grouping::
+
--

Parentheses `"()"` can be used to form sub-patterns. The quantity operators
listed above operate on the shortest previous pattern, which can be a group.
For string `"ababab"`:

    (ab)+       # match
    ab(ab)+     # match
    (..)+       # match
    (...)+      # no match
    (ab)*       # match
    abab(ab)?   # match
    ab(ab)?     # no match
    (ab){3}     # match
    (ab){1,2}   # no match

--

Alternation::
+
--

The pipe symbol `"|"` acts as an OR operator. The match will succeed if
the pattern on either the left-hand side OR the right-hand side matches.
The alternation applies to the _longest pattern_, not the shortest.
For string `"aabb"`:

    aabb|bbaa   # match
    aacc|bb     # no match
    aa(cc|bb)   # match
    a+|b+       # no match
    a+b+|b+a+   # match
    a+(b|c)+    # match

--

Character classes::
+
--

Ranges of potential characters may be represented as character classes
by enclosing them in square brackets `"[]"`. A leading `^`
negates the character class. The allowed forms are:

    [abc]   # 'a' or 'b' or 'c'
    [a-c]   # 'a' or 'b' or 'c'
    [-abc]  # '-' or 'a' or 'b' or 'c'
    [abc\-] # '-' or 'a' or 'b' or 'c'
    [^abc]  # any character except 'a' or 'b' or 'c'
    [^a-c]  # any character except 'a' or 'b' or 'c'
    [^-abc]  # any character except '-' or 'a' or 'b' or 'c'
    [^abc\-] # any character except '-' or 'a' or 'b' or 'c'

Note that the dash `"-"` indicates a range of characters, unless it is
the first character or if it is escaped with a backslash.

For string `"abcd"`:

    ab[cd]+     # match
    [a-d]+      # match
    [^a-d]+     # no match

--

===== Optional operators

These operators are available by default as the `flags` parameter defaults to `ALL`.
Different flag combinations (concatened with `"\"`) can be used to enable/disable
specific operators:

    {
        "regexp": {
            "username": {
                "value": "john~athon<1-5>",
                "flags": "COMPLEMENT|INTERVAL"
            }
        }
    }

Complement::
+
--

The complement is probably the most useful option. The shortest pattern that
follows a tilde `"~"` is negated.  For the string `"abcdef"`:

    ab~df     # match
    ab~cf     # no match
    a~(cd)f   # match
    a~(bc)f   # no match

Enabled with the `COMPLEMENT` or `ALL` flags.

--

Interval::
+
--

The interval option enables the use of numeric ranges, enclosed by angle
brackets `"<>"`. For string: `"foo80"`:

    foo<1-100>     # match
    foo<01-100>    # match
    foo<001-100>   # no match

Enabled with the `INTERVAL` or `ALL` flags.


--

Intersection::
+
--

The ampersand `"&"` joins two patterns in a way that both of them have to
match. For string `"aaabbb"`:

    aaa.+&.+bbb     # match
    aaa&bbb         # no match

Using this feature usually means that you should rewrite your regular
expression.

Enabled with the `INTERSECTION` or `ALL` flags.

--

Any string::
+
--

The at sign `"@"` matches any string in its entirety.  This could be combined
with the intersection and complement above to express ``everything except''.
For instance:

    @&~(foo.+)      # anything except string beginning with "foo"

Enabled with the `ANYSTRING` or `ALL` flags.
--
[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax 2013-10-07 08:42:13 -04:00			`[[regexp-syntax]]`
Query DSL: Remove filter parsers. This commit makes queries and filters parsed the same way using the QueryParser abstraction. This allowed to remove duplicate code that we had for similar queries/filters such as `range`, `prefix` or `term`. 2015-05-05 02:27:52 -04:00			`=== Regular expression syntax`
[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax 2013-10-07 08:42:13 -04:00
			Regular expression queries are supported by the `regexp` and the `query_string`
			`queries. The Lucene regular expression engine`
			`is not Perl-compatible but supports a smaller range of operators.`

			`[NOTE]`
			`====`
			`We will not attempt to explain regular expressions, but`
			`just explain the supported operators.`
			`====`

Query DSL: Remove filter parsers. This commit makes queries and filters parsed the same way using the QueryParser abstraction. This allowed to remove duplicate code that we had for similar queries/filters such as `range`, `prefix` or `term`. 2015-05-05 02:27:52 -04:00			`==== Standard operators`
[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax 2013-10-07 08:42:13 -04:00
			`Anchoring::`
			`+`
			`--`

			`Most regular expression engines allow you to match any part of a string.`
			`If you want the regexp pattern to start at the beginning of the string or`
			`finish at the end of the string, then you have to _anchor_ it specifically,`
			using `^` to indicate the beginning or `$` to indicate the end.

			`Lucene's patterns are always anchored. The pattern provided must match`
			the entire string. For string `"abcde"`:

			`ab.* # match`
			`abcd # no match`

			`--`

			`Allowed characters::`
			`+`
			`--`

			`Any Unicode characters may be used in the pattern, but certain characters`
			`are reserved and must be escaped. The standard reserved characters are:`

			`....`
			`. ? + * \| { } [ ] ( ) " \`
			`....`

			`If you enable optional features (see below) then these characters may`
			`also be reserved:`

			`# @ & < > ~`

			Any reserved character can be escaped with a backslash `"\*"` including
			a literal backslash character: `"\\"`

			`Additionally, any characters (except double quotes) are interpreted literally`
			`when surrounded by double quotes:`

			`john"@smith.com"`


			`--`

			`Match any character::`
			`+`
			`--`

			The period `"."` can be used to represent any character. For string `"abcde"`:

			`ab... # match`
			`a.c.e # match`

			`--`

			`One-or-more::`
			`+`
			`--`

			The plus sign `"+"` can be used to repeat the preceding shortest pattern
			once or more times. For string `"aaabbb"`:

			`a+b+ # match`
			`aa+bb+ # match`
			`a+.+ # match`
Docs: Update regexp-syntax.asciidoc Closes #7419 2014-09-07 05:42:28 -04:00			`aa+bbb+ # match`
[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax 2013-10-07 08:42:13 -04:00
			`--`

			`Zero-or-more::`
			`+`
			`--`

			The asterisk `"*"` can be used to match the preceding shortest pattern
			zero-or-more times. For string `"aaabbb`":

			`ab # match`
			`abc* # match`
			`.bbb. # match`
			`aaabbb # match`

			`--`

			`Zero-or-one::`
			`+`
			`--`

			The question mark `"?"` makes the preceding shortest pattern optional. It
			matches zero or one times. For string `"aaabbb"`:

			`aaa?bbb? # match`
			`aaaa?bbbb? # match`
			`.....?.? # match`
			`aa?bb? # no match`

			`--`

			`Min-to-max::`
			`+`
			`--`

			Curly brackets `"{}"` can be used to specify a minimum and (optionally)
			`a maximum number of times the preceding shortest pattern can repeat. The`
			`allowed forms are:`

			`{5} # repeat exactly 5 times`
			`{2,5} # repeat at least twice and at most 5 times`
			`{2,} # repeat at least twice`

			For string `"aaabbb"`:

			`a{3}b{3} # match`
			`a{2,4}b{2,4} # match`
			`a{2,}b{2,} # match`
			`.{3}.{3} # match`
			`a{4}b{4} # no match`
			`a{4,6}b{4,6} # no match`
			`a{4,}b{4,} # no match`

			`--`

			`Grouping::`
			`+`
			`--`

			Parentheses `"()"` can be used to form sub-patterns. The quantity operators
			`listed above operate on the shortest previous pattern, which can be a group.`
			For string `"ababab"`:

			`(ab)+ # match`
			`ab(ab)+ # match`
			`(..)+ # match`
			`(...)+ # no match`
			`(ab)* # match`
			`abab(ab)? # match`
			`ab(ab)? # no match`
			`(ab){3} # match`
			`(ab){1,2} # no match`

			`--`

			`Alternation::`
			`+`
			`--`

			The pipe symbol `"\|"` acts as an OR operator. The match will succeed if
			`the pattern on either the left-hand side OR the right-hand side matches.`
			`The alternation applies to the _longest pattern_, not the shortest.`
			For string `"aabb"`:

			`aabb\|bbaa # match`
			`aacc\|bb # no match`
			`aa(cc\|bb) # match`
			`a+\|b+ # no match`
			`a+b+\|b+a+ # match`
			`a+(b\|c)+ # match`

			`--`

			`Character classes::`
			`+`
			`--`

			`Ranges of potential characters may be represented as character classes`
			by enclosing them in square brackets `"[]"`. A leading `^`
			`negates the character class. The allowed forms are:`

			`[abc] # 'a' or 'b' or 'c'`
			`[a-c] # 'a' or 'b' or 'c'`
			`[-abc] # '-' or 'a' or 'b' or 'c'`
			`[abc\-] # '-' or 'a' or 'b' or 'c'`
Update "Character classes" part 2014-04-08 08:45:47 -04:00			`[^abc] # any character except 'a' or 'b' or 'c'`
[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax 2013-10-07 08:42:13 -04:00			`[^a-c] # any character except 'a' or 'b' or 'c'`
Update "Character classes" part 2014-04-08 08:45:47 -04:00			`[^-abc] # any character except '-' or 'a' or 'b' or 'c'`
			`[^abc\-] # any character except '-' or 'a' or 'b' or 'c'`
[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax 2013-10-07 08:42:13 -04:00
typo fixes - https://github.com/vlajos/misspell_fixer Closes #8323 2014-11-02 18:36:06 -05:00			Note that the dash `"-"` indicates a range of characters, unless it is
[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax 2013-10-07 08:42:13 -04:00			`the first character or if it is escaped with a backslash.`

			For string `"abcd"`:

			`ab[cd]+ # match`
			`[a-d]+ # match`
			`[^a-d]+ # no match`

			`--`

			`===== Optional operators`

Docs: The regexp query defaults to the `ALL` flag, and removed the `AUTOMATON` flag which is not used in Elasticsearch. Closes #6180 2014-12-30 13:53:15 -05:00			These operators are available by default as the `flags` parameter defaults to `ALL`.
			Different flag combinations (concatened with `"\"`) can be used to enable/disable
			`specific operators:`
[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax 2013-10-07 08:42:13 -04:00
			`{`
			`"regexp": {`
			`"username": {`
			`"value": "john~athon<1-5>",`
			`"flags": "COMPLEMENT\|INTERVAL"`
			`}`
			`}`
			`}`

			`Complement::`
			`+`
			`--`

			`The complement is probably the most useful option. The shortest pattern that`
			follows a tilde `"~"` is negated. For the string `"abcdef"`:

			`ab~df # match`
			`ab~cf # no match`
			`a~(cd)f # match`
			`a~(bc)f # no match`

			Enabled with the `COMPLEMENT` or `ALL` flags.

			`--`

			`Interval::`
			`+`
			`--`

			`The interval option enables the use of numeric ranges, enclosed by angle`
			brackets `"<>"`. For string: `"foo80"`:

			`foo<1-100> # match`
			`foo<01-100> # match`
			`foo<001-100> # no match`

			Enabled with the `INTERVAL` or `ALL` flags.


			`--`

			`Intersection::`
			`+`
			`--`

			The ampersand `"&"` joins two patterns in a way that both of them have to
			match. For string `"aaabbb"`:

			`aaa.+&.+bbb # match`
			`aaa&bbb # no match`

			`Using this feature usually means that you should rewrite your regular`
			`expression.`

			Enabled with the `INTERSECTION` or `ALL` flags.

			`--`

			`Any string::`
			`+`
			`--`

			The at sign `"@"` matches any string in its entirety. This could be combined
			with the intersection and complement above to express ``everything except''.
			`For instance:`

			`@&~(foo.+) # anything except string beginning with "foo"`

			Enabled with the `ANYSTRING` or `ALL` flags.
			`--`