OpenSearch/docs/reference/query-dsl/queries/regexp-syntax.asciidoc

[[regexp-syntax]]
==== Regular expression syntax

Regular expression queries are supported by the `regexp` and the `query_string`
queries.  The Lucene regular expression engine
is not Perl-compatible but supports a smaller range of operators.

[NOTE]
====
We will not attempt to explain regular expressions, but
just explain the supported operators.
====

===== Standard operators

Anchoring::
+
--

Most regular expression engines allow you to match any part of a string.
If you want the regexp pattern to start at the beginning of the string or
finish at the end of the string, then you have to _anchor_ it specifically,
using `^` to indicate the beginning or `$` to indicate the end.

Lucene's patterns are always anchored.  The pattern provided must match
the entire string. For string `"abcde"`:

    ab.*     # match
    abcd     # no match

--

Allowed characters::
+
--

Any Unicode characters may be used in the pattern, but certain characters
are reserved and must be escaped.  The standard reserved characters are:

....
. ? + * | { } [ ] ( ) " \
....

If you enable optional features (see below) then these characters may
also be reserved:

    # @ & < >  ~

Any reserved character can be escaped with a backslash `"\*"` including
a literal backslash character: `"\\"`

Additionally, any characters (except double quotes) are interpreted literally
when surrounded by double quotes:

    john"@smith.com"


--

Match any character::
+
--

The period `"."` can be used to represent any character.  For string `"abcde"`:

    ab...   # match
    a.c.e   # match

--

One-or-more::
+
--

The plus sign `"+"` can be used to repeat the preceding shortest pattern
once or more times. For string `"aaabbb"`:

    a+b+        # match
    aa+bb+      # match
    a+.+        # match
    aa+bbb+     # no match

--

Zero-or-more::
+
--

The asterisk `"*"` can be used to match the preceding shortest pattern
zero-or-more times.  For string `"aaabbb`":

    a*b*        # match
    a*b*c*      # match
    .*bbb.*     # match
    aaa*bbb*    # match

--

Zero-or-one::
+
--

The question mark `"?"` makes the preceding shortest pattern optional. It
matches zero or one times.  For string `"aaabbb"`:

    aaa?bbb?    # match
    aaaa?bbbb?  # match
    .....?.?    # match
    aa?bb?      # no match

--

Min-to-max::
+
--

Curly brackets `"{}"` can be used to specify a minimum and (optionally)
a maximum number of times the preceding shortest pattern can repeat.  The
allowed forms are:

    {5}     # repeat exactly 5 times
    {2,5}   # repeat at least twice and at most 5 times
    {2,}    # repeat at least twice

For string `"aaabbb"`:

    a{3}b{3}        # match
    a{2,4}b{2,4}    # match
    a{2,}b{2,}      # match
    .{3}.{3}        # match
    a{4}b{4}        # no match
    a{4,6}b{4,6}    # no match
    a{4,}b{4,}      # no match

--

Grouping::
+
--

Parentheses `"()"` can be used to form sub-patterns. The quantity operators
listed above operate on the shortest previous pattern, which can be a group.
For string `"ababab"`:

    (ab)+       # match
    ab(ab)+     # match
    (..)+       # match
    (...)+      # no match
    (ab)*       # match
    abab(ab)?   # match
    ab(ab)?     # no match
    (ab){3}     # match
    (ab){1,2}   # no match

--

Alternation::
+
--

The pipe symbol `"|"` acts as an OR operator. The match will succeed if
the pattern on either the left-hand side OR the right-hand side matches.
The alternation applies to the _longest pattern_, not the shortest.
For string `"aabb"`:

    aabb|bbaa   # match
    aacc|bb     # no match
    aa(cc|bb)   # match
    a+|b+       # no match
    a+b+|b+a+   # match
    a+(b|c)+    # match

--

Character classes::
+
--

Ranges of potential characters may be represented as character classes
by enclosing them in square brackets `"[]"`. A leading `^`
negates the character class. The allowed forms are:

    [abc]   # 'a' or 'b' or 'c'
    [a-c]   # 'a' or 'b' or 'c'
    [-abc]  # '-' or 'a' or 'b' or 'c'
    [abc\-] # '-' or 'a' or 'b' or 'c'
    [^a-c]  # any character except 'a' or 'b' or 'c'
    [^a-c]  # any character except 'a' or 'b' or 'c'
    [-abc]  # '-' or 'a' or 'b' or 'c'
    [abc\-] # '-' or 'a' or 'b' or 'c'

Note that the dash `"-"` indicates a range of characeters, unless it is
the first character or if it is escaped with a backslash.

For string `"abcd"`:

    ab[cd]+     # match
    [a-d]+      # match
    [^a-d]+     # no match

--

===== Optional operators

These operators are only available when they are explicitly enabled, by
passing `flags` to the query.

Multiple flags can be enabled either using the `ALL` flag, or by
concatenating flags with a pipe `"|"`:

    {
        "regexp": {
            "username": {
                "value": "john~athon<1-5>",
                "flags": "COMPLEMENT|INTERVAL"
            }
        }
    }

Complement::
+
--

The complement is probably the most useful option. The shortest pattern that
follows a tilde `"~"` is negated.  For the string `"abcdef"`:

    ab~df     # match
    ab~cf     # no match
    a~(cd)f   # match
    a~(bc)f   # no match

Enabled with the `COMPLEMENT` or `ALL` flags.

--

Interval::
+
--

The interval option enables the use of numeric ranges, enclosed by angle
brackets `"<>"`. For string: `"foo80"`:

    foo<1-100>     # match
    foo<01-100>    # match
    foo<001-100>   # no match

Enabled with the `INTERVAL` or `ALL` flags.


--

Intersection::
+
--

The ampersand `"&"` joins two patterns in a way that both of them have to
match. For string `"aaabbb"`:

    aaa.+&.+bbb     # match
    aaa&bbb         # no match

Using this feature usually means that you should rewrite your regular
expression.

Enabled with the `INTERSECTION` or `ALL` flags.

--

Any string::
+
--

The at sign `"@"` matches any string in its entirety.  This could be combined
with the intersection and complement above to express ``everything except''.
For instance:

    @&~(foo.+)      # anything except string beginning with "foo"

Enabled with the `ANYSTRING` or `ALL` flags.
--