287 lines
6.1 KiB
Plaintext
287 lines
6.1 KiB
Plaintext
[[regexp-syntax]]
|
|
==== Regular expression syntax
|
|
|
|
Regular expression queries are supported by the `regexp` and the `query_string`
|
|
queries. The Lucene regular expression engine
|
|
is not Perl-compatible but supports a smaller range of operators.
|
|
|
|
[NOTE]
|
|
=====
|
|
We will not attempt to explain regular expressions, but
|
|
just explain the supported operators.
|
|
=====
|
|
|
|
===== Standard operators
|
|
|
|
Anchoring::
|
|
+
|
|
--
|
|
|
|
Most regular expression engines allow you to match any part of a string.
|
|
If you want the regexp pattern to start at the beginning of the string or
|
|
finish at the end of the string, then you have to _anchor_ it specifically,
|
|
using `^` to indicate the beginning or `$` to indicate the end.
|
|
|
|
Lucene's patterns are always anchored. The pattern provided must match
|
|
the entire string. For string `"abcde"`:
|
|
|
|
ab.* # match
|
|
abcd # no match
|
|
|
|
--
|
|
|
|
Allowed characters::
|
|
+
|
|
--
|
|
|
|
Any Unicode characters may be used in the pattern, but certain characters
|
|
are reserved and must be escaped. The standard reserved characters are:
|
|
|
|
....
|
|
. ? + * | { } [ ] ( ) " \
|
|
....
|
|
|
|
If you enable optional features (see below) then these characters may
|
|
also be reserved:
|
|
|
|
# @ & < > ~
|
|
|
|
Any reserved character can be escaped with a backslash `"\*"` including
|
|
a literal backslash character: `"\\"`
|
|
|
|
Additionally, any characters (except double quotes) are interpreted literally
|
|
when surrounded by double quotes:
|
|
|
|
john"@smith.com"
|
|
|
|
|
|
--
|
|
|
|
Match any character::
|
|
+
|
|
--
|
|
|
|
The period `"."` can be used to represent any character. For string `"abcde"`:
|
|
|
|
ab... # match
|
|
a.c.e # match
|
|
|
|
--
|
|
|
|
One-or-more::
|
|
+
|
|
--
|
|
|
|
The plus sign `"+"` can be used to repeat the preceding shortest pattern
|
|
once or more times. For string `"aaabbb"`:
|
|
|
|
a+b+ # match
|
|
aa+bb+ # match
|
|
a+.+ # match
|
|
aa+bbb+ # match
|
|
|
|
--
|
|
|
|
Zero-or-more::
|
|
+
|
|
--
|
|
|
|
The asterisk `"*"` can be used to match the preceding shortest pattern
|
|
zero-or-more times. For string `"aaabbb`":
|
|
|
|
a*b* # match
|
|
a*b*c* # match
|
|
.*bbb.* # match
|
|
aaa*bbb* # match
|
|
|
|
--
|
|
|
|
Zero-or-one::
|
|
+
|
|
--
|
|
|
|
The question mark `"?"` makes the preceding shortest pattern optional. It
|
|
matches zero or one times. For string `"aaabbb"`:
|
|
|
|
aaa?bbb? # match
|
|
aaaa?bbbb? # match
|
|
.....?.? # match
|
|
aa?bb? # no match
|
|
|
|
--
|
|
|
|
Min-to-max::
|
|
+
|
|
--
|
|
|
|
Curly brackets `"{}"` can be used to specify a minimum and (optionally)
|
|
a maximum number of times the preceding shortest pattern can repeat. The
|
|
allowed forms are:
|
|
|
|
{5} # repeat exactly 5 times
|
|
{2,5} # repeat at least twice and at most 5 times
|
|
{2,} # repeat at least twice
|
|
|
|
For string `"aaabbb"`:
|
|
|
|
a{3}b{3} # match
|
|
a{2,4}b{2,4} # match
|
|
a{2,}b{2,} # match
|
|
.{3}.{3} # match
|
|
a{4}b{4} # no match
|
|
a{4,6}b{4,6} # no match
|
|
a{4,}b{4,} # no match
|
|
|
|
--
|
|
|
|
Grouping::
|
|
+
|
|
--
|
|
|
|
Parentheses `"()"` can be used to form sub-patterns. The quantity operators
|
|
listed above operate on the shortest previous pattern, which can be a group.
|
|
For string `"ababab"`:
|
|
|
|
(ab)+ # match
|
|
ab(ab)+ # match
|
|
(..)+ # match
|
|
(...)+ # no match
|
|
(ab)* # match
|
|
abab(ab)? # match
|
|
ab(ab)? # no match
|
|
(ab){3} # match
|
|
(ab){1,2} # no match
|
|
|
|
--
|
|
|
|
Alternation::
|
|
+
|
|
--
|
|
|
|
The pipe symbol `"|"` acts as an OR operator. The match will succeed if
|
|
the pattern on either the left-hand side OR the right-hand side matches.
|
|
The alternation applies to the _longest pattern_, not the shortest.
|
|
For string `"aabb"`:
|
|
|
|
aabb|bbaa # match
|
|
aacc|bb # no match
|
|
aa(cc|bb) # match
|
|
a+|b+ # no match
|
|
a+b+|b+a+ # match
|
|
a+(b|c)+ # match
|
|
|
|
--
|
|
|
|
Character classes::
|
|
+
|
|
--
|
|
|
|
Ranges of potential characters may be represented as character classes
|
|
by enclosing them in square brackets `"[]"`. A leading `^`
|
|
negates the character class. The allowed forms are:
|
|
|
|
[abc] # 'a' or 'b' or 'c'
|
|
[a-c] # 'a' or 'b' or 'c'
|
|
[-abc] # '-' or 'a' or 'b' or 'c'
|
|
[abc\-] # '-' or 'a' or 'b' or 'c'
|
|
[^abc] # any character except 'a' or 'b' or 'c'
|
|
[^a-c] # any character except 'a' or 'b' or 'c'
|
|
[^-abc] # any character except '-' or 'a' or 'b' or 'c'
|
|
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'
|
|
|
|
Note that the dash `"-"` indicates a range of characters, unless it is
|
|
the first character or if it is escaped with a backslash.
|
|
|
|
For string `"abcd"`:
|
|
|
|
ab[cd]+ # match
|
|
[a-d]+ # match
|
|
[^a-d]+ # no match
|
|
|
|
--
|
|
|
|
===== Optional operators
|
|
|
|
These operators are available by default as the `flags` parameter defaults to `ALL`.
|
|
Different flag combinations (concatenated with `"|"`) can be used to enable/disable
|
|
specific operators:
|
|
|
|
{
|
|
"regexp": {
|
|
"username": {
|
|
"value": "john~athon<1-5>",
|
|
"flags": "COMPLEMENT|INTERVAL"
|
|
}
|
|
}
|
|
}
|
|
|
|
Complement::
|
|
+
|
|
--
|
|
|
|
The complement is probably the most useful option. The shortest pattern that
|
|
follows a tilde `"~"` is negated. For instance, `"ab~cd" means:
|
|
|
|
* Starts with `a`
|
|
* Followed by `b`
|
|
* Followed by a string of any length that is anything but `c`
|
|
* Ends with `d`
|
|
|
|
For the string `"abcdef"`:
|
|
|
|
ab~df # match
|
|
ab~cf # match
|
|
ab~cdef # no match
|
|
a~(cb)def # match
|
|
a~(bc)def # no match
|
|
|
|
Enabled with the `COMPLEMENT` or `ALL` flags.
|
|
|
|
--
|
|
|
|
Interval::
|
|
+
|
|
--
|
|
|
|
The interval option enables the use of numeric ranges, enclosed by angle
|
|
brackets `"<>"`. For string: `"foo80"`:
|
|
|
|
foo<1-100> # match
|
|
foo<01-100> # match
|
|
foo<001-100> # no match
|
|
|
|
Enabled with the `INTERVAL` or `ALL` flags.
|
|
|
|
|
|
--
|
|
|
|
Intersection::
|
|
+
|
|
--
|
|
|
|
The ampersand `"&"` joins two patterns in a way that both of them have to
|
|
match. For string `"aaabbb"`:
|
|
|
|
aaa.+&.+bbb # match
|
|
aaa&bbb # no match
|
|
|
|
Using this feature usually means that you should rewrite your regular
|
|
expression.
|
|
|
|
Enabled with the `INTERSECTION` or `ALL` flags.
|
|
|
|
--
|
|
|
|
Any string::
|
|
+
|
|
--
|
|
|
|
The at sign `"@"` matches any string in its entirety. This could be combined
|
|
with the intersection and complement above to express ``everything except''.
|
|
For instance:
|
|
|
|
@&~(foo.+) # anything except string beginning with "foo"
|
|
|
|
Enabled with the `ANYSTRING` or `ALL` flags.
|
|
--
|