OpenSearch/docs/reference/query-dsl/regexp-syntax.asciidoc

[[regexp-syntax]]
== Regular expression syntax

A https://en.wikipedia.org/wiki/Regular_expression[regular expression] is a way to
match patterns in data using placeholder characters, called operators.

{es} supports regular expressions in the following queries:

* <<query-dsl-regexp-query, `regexp`>>
* <<query-dsl-query-string-query, `query_string`>>

{es} uses https://lucene.apache.org/core/[Apache Lucene]'s regular expression
engine to parse these queries.

[discrete]
[[regexp-reserved-characters]]
=== Reserved characters
Lucene's regular expression engine supports all Unicode characters. However, the
following characters are reserved as operators:

....
. ? + * | { } [ ] ( ) " \
....

Depending on the <<regexp-optional-operators, optional operators>> enabled, the
following characters may also be reserved:

....
# @ & < >  ~
....

To use one of these characters literally, escape it with a preceding
backslash or surround it with double quotes. For example:

....
\@                  # renders as a literal '@'
\\                  # renders as a literal '\'
"john@smith.com"    # renders as 'john@smith.com'
....


[discrete]
[[regexp-standard-operators]]
=== Standard operators

Lucene's regular expression engine does not use the
https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions[Perl
Compatible Regular Expressions (PCRE)] library, but it does support the
following standard operators.

`.`::
+
--
Matches any character. For example:

....
ab.     # matches 'aba', 'abb', 'abz', etc.
....
--

`?`::
+
--
Repeat the preceding character zero or one times. Often used to make the
preceding character optional. For example:

....
abc?     # matches 'ab' and 'abc'
....
--

`+`::
+
--
Repeat the preceding character one or more times. For example:

....
ab+     # matches 'ab', 'abb', 'abbb', etc.
....
--

`*`::
+
--
Repeat the preceding character zero or more times. For example:

....
ab*     # matches 'a', 'ab', 'abb', 'abbb', etc.
....
--

`{}`::
+
--
Minimum and maximum number of times the preceding character can repeat. For
example:

....
a{2}    # matches 'aa'
a{2,4}  # matches 'aa', 'aaa', and 'aaaa'
a{2,}   # matches 'a` repeated two or more times
....
--

`|`::
+
--
OR operator. The match will succeed if the longest pattern on either the left
side OR the right side matches. For example:
....
abc|xyz  # matches 'abc' and 'xyz'
....
--

`( … )`::
+
--
Forms a group. You can use a group to treat part of the expression as a single
character. For example:

....
abc(def)?  # matches 'abc' and 'abcdef' but not 'abcd'
....
--

`[ … ]`::
+
--
Match one of the characters in the brackets. For example:

....
[abc]   # matches 'a', 'b', 'c'
....

Inside the brackets, `-` indicates a range unless `-` is the first character or
escaped. For example:

....
[a-c]   # matches 'a', 'b', or 'c'
[-abc]  # '-' is first character. Matches '-', 'a', 'b', or 'c'
[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
....

A `^` before a character in the brackets negates the character or range. For
example:

....
[^abc]      # matches any character except 'a', 'b', or 'c'
[^a-c]      # matches any character except 'a', 'b', or 'c'
[^-abc]     # matches any character except '-', 'a', 'b', or 'c'
[^abc\-]    # matches any character except 'a', 'b', 'c', or '-'
....
--

[discrete]
[[regexp-optional-operators]]
=== Optional operators

You can use the `flags` parameter to enable more optional operators for
Lucene's regular expression engine.

To enable multiple operators, use a `|` separator. For example, a `flags` value
of `COMPLEMENT|INTERVAL` enables the `COMPLEMENT` and `INTERVAL` operators.

[discrete]
==== Valid values

`ALL` (Default)::
Enables all optional operators.

`COMPLEMENT`::
+
--
Enables the `~` operator. You can use `~` to negate the shortest following
pattern. For example:

....
a~bc   # matches 'adc' and 'aec' but not 'abc'
....
--

`INTERVAL`::
+
--
Enables the `<>` operators. You can use `<>` to match a numeric range. For
example:

....
foo<1-100>      # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
foo<01-100>     # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
....
--

`INTERSECTION`::
+
--
Enables the `&` operator, which acts as an AND operator. The match will succeed
if patterns on both the left side AND the right side matches. For example:

....
aaa.+&.+bbb  # matches 'aaabbb'
....
--

`ANYSTRING`::
+
--
Enables the `@` operator. You can use `@` to match any entire
string.

You can combine the `@` operator with `&` and `~` operators to create an
"everything except" logic. For example:

....
@&~(abc.+)  # matches everything except terms beginning with 'abc'
....
--

[discrete]
[[regexp-unsupported-operators]]
=== Unsupported operators
Lucene's regular expression engine does not support anchor operators, such as
`^` (beginning of line) or `$` (end of line). To match a term, the regular
expression must match the entire string.