[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax
This commit is contained in:
parent
e9d9ade10f
commit
264a00a40f
|
@ -6,6 +6,8 @@ The `regexp` filter is similar to the
|
|||
that it is cacheable and can speedup performance in case you are reusing
|
||||
this filter in your queries.
|
||||
|
||||
See <<regexp-syntax>> for details of the supported regular expression language.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
|
|
|
@ -87,4 +87,3 @@ include::queries/wildcard-query.asciidoc[]
|
|||
include::queries/minimum-should-match.asciidoc[]
|
||||
|
||||
include::queries/multi-term-rewrite.asciidoc[]
|
||||
|
||||
|
|
|
@ -19,7 +19,7 @@ The `query_string` top level parameters include:
|
|||
[cols="<,<",options="header",]
|
||||
|=======================================================================
|
||||
|Parameter |Description
|
||||
|`query` |The actual query to be parsed.
|
||||
|`query` |The actual query to be parsed. See <<query-string-syntax>>.
|
||||
|
||||
|`default_field` |The default field for query terms if no prefix field
|
||||
is specified. Defaults to the `index.query.default_field` index
|
||||
|
@ -158,16 +158,4 @@ introduced fields included). For example:
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
[[Syntax_Extension]]
|
||||
[float]
|
||||
==== Syntax Extension
|
||||
|
||||
There are several syntax extensions to the Lucene query language.
|
||||
|
||||
[float]
|
||||
===== missing / exists
|
||||
|
||||
The `_exists_` and `_missing_` syntax allows to control docs that have
|
||||
fields that exists within them (have a value) and missing. The syntax
|
||||
is: `_exists_:field1`, `_missing_:field` and can be used anywhere a
|
||||
query string is used.
|
||||
include::query-string-syntax.asciidoc[]
|
||||
|
|
|
@ -0,0 +1,266 @@
|
|||
[[query-string-syntax]]
|
||||
|
||||
==== Query string syntax
|
||||
|
||||
The query string ``mini-language'' is used by the
|
||||
<<query-dsl-query-string-query>> and <<query-dsl-field-query>>, by the
|
||||
`q` query string parameter in the <<search-search,`search` API>> and
|
||||
by the `percolate` parameter in the <<docs-index_,`index`>> and
|
||||
<<docs-bulk,`bulk`>> APIs.
|
||||
|
||||
The query string is parsed into a series of _terms_ and _operators_. A
|
||||
term can be a single word -- `quick` or `brown` -- or a phrase, surrounded by
|
||||
double quotes -- `"quick brown"` -- which searches for all the words in the
|
||||
phrase, in the same order.
|
||||
|
||||
Operators allow you to customize the search -- the available options are
|
||||
explained below.
|
||||
|
||||
===== Field names
|
||||
|
||||
As mentioned in <<query-dsl-query-string-query>>, the `default_field` is searched for the
|
||||
search terms, but it is possible to specify other fields in the query syntax:
|
||||
|
||||
* where the `status` field contains `active`
|
||||
|
||||
status:active
|
||||
|
||||
* where the `title` field contains `quick` or `brown`
|
||||
|
||||
title:(quick brown)
|
||||
|
||||
* where the `author` field contains the exact phrase `"john smith"`
|
||||
|
||||
author:"John Smith"
|
||||
|
||||
* where any of the fields `book.title`, `book.content` or `book.date` contains
|
||||
`quick` or `brown` (note how we need to escape the `*` with a backslash):
|
||||
|
||||
book.\*:(quick brown)
|
||||
|
||||
* where the field `title` has no value (or is missing):
|
||||
|
||||
_missing_:title
|
||||
|
||||
* where the field `title` has any non-null value:
|
||||
|
||||
_exists_:title
|
||||
|
||||
===== Wildcards
|
||||
|
||||
Wildcard searches can be run on individual terms, using `?` to replace
|
||||
a single character, and `*` to replace zero or more characters:
|
||||
|
||||
qu?ck bro*
|
||||
|
||||
Be aware that wildcard queries can use an enormous amount of memory and
|
||||
perform very badly -- just think how many terms need to be queried to
|
||||
match the query string `"a* b* c*"`.
|
||||
|
||||
[WARNING]
|
||||
======
|
||||
Allowing a wildcard at the beginning of a word (eg `"*ing"`) is particularly
|
||||
heavy, because all terms in the index need to be examined, just in case
|
||||
they match. Leading wildcards can be disabled by setting
|
||||
`allow_leading_wildcard` to `false`.
|
||||
======
|
||||
|
||||
Wildcarded terms are not analyzed by default -- they are lowercased
|
||||
(`lowercase_expanded_terms` defaults to `true`) but no further analysis
|
||||
is done, mainly because it is impossible to accurately analyze a word that
|
||||
is missing some of its letters. However, by setting `analyze_wildcard` to
|
||||
`true`, an attempt will be made to analyze wildcarded words before searching
|
||||
the term list for matching terms.
|
||||
|
||||
===== Regular expressions
|
||||
|
||||
Regular expression patterns can be embedded in the query string by
|
||||
wrapping them in forward-slashes (`"/"`):
|
||||
|
||||
name:/joh?n(ath[oa]n)/
|
||||
|
||||
The supported regular expression syntax is explained in <<regexp-syntax>>.
|
||||
|
||||
[WARNING]
|
||||
======
|
||||
The `allow_leading_wildcard` parameter does not have any control over
|
||||
regular expressions. A query string such as the following would force
|
||||
Elasticsearch to visit every term in the index:
|
||||
|
||||
/.*n/
|
||||
|
||||
Use with caution!
|
||||
======
|
||||
|
||||
===== Fuzziness
|
||||
|
||||
We can search for terms that are
|
||||
similar to, but not exactly like our search terms, using the ``fuzzy''
|
||||
operator:
|
||||
|
||||
quikc~ brwn~ foks~
|
||||
|
||||
This uses the
|
||||
http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance[Damerau-Levenshtein distance]
|
||||
to find all terms with a maximum of
|
||||
two changes, where a change is the insertion, deletion
|
||||
or substitution of a single character, or transposition of two adjacent
|
||||
characters.
|
||||
|
||||
The default _edit distance_ is `2`, but an edit distance of `1` should be
|
||||
sufficient to catch 80% of all human misspellings. It can be specified as:
|
||||
|
||||
quikc~1
|
||||
|
||||
===== Proximity searches
|
||||
|
||||
While a phrase query (eg `"john smith"`) expects all of the terms in exactly
|
||||
the same order, a proximity query allows the specified words to be further
|
||||
apart or in a different order. In the same way that fuzzy queries can
|
||||
specify a maximum edit distance for characters in a word, a proximity search
|
||||
allows us to specify a maximum edit distance of words in a phrase:
|
||||
|
||||
"fox quick"~5
|
||||
|
||||
The closer the text in a field is to the original order specified in the
|
||||
query string, the more relevant that document is considered to be. When
|
||||
compared to the above example query, the phrase `"quick fox"` would be
|
||||
considered more relevant than `"quick brown fox"`.
|
||||
|
||||
===== Ranges
|
||||
|
||||
Ranges can be specified for date, numeric or string fields. Inclusive ranges
|
||||
are specified with square brackets `[min TO max]` and exclusive ranges with
|
||||
curly brackets `{min TO max}`.
|
||||
|
||||
* All days in 2012:
|
||||
|
||||
date:[2012/01/01 TO 2012/12/31]
|
||||
|
||||
* Numbers 1..5
|
||||
|
||||
count:[1 TO 5]
|
||||
|
||||
* Tags between `alpha` and `omega`, excluding `alpha` and `omega`:
|
||||
|
||||
tag:{alpha TO omega}
|
||||
|
||||
* Numbers from 10 upwards
|
||||
|
||||
count:[10 TO *]
|
||||
|
||||
* Dates before 2012
|
||||
|
||||
date:{* TO 2012/01/01}
|
||||
|
||||
The parsing of ranges in query strings can be complex and error prone. It is
|
||||
much more reliable to use an explicit <<query-dsl-range-filter,`range` filter>>.
|
||||
|
||||
===== Boosting
|
||||
|
||||
Use the _boost_ operator `^` to make one term more relevant than another.
|
||||
For instance, if we want to find all documents about foxes, but we are
|
||||
especially interested in quick foxes:
|
||||
|
||||
quick^2 fox
|
||||
|
||||
The default `boost` value is 1, but can be any positive floating point number.
|
||||
Boosts between 0 and 1 reduce relevance.
|
||||
|
||||
Boosts can also be applied to phrases or to groups:
|
||||
|
||||
"john smith"^2 (foo bar)^4
|
||||
|
||||
===== Boolean operators
|
||||
|
||||
By default, all terms are optional, as long as one term matches. A search
|
||||
for `foo bar baz` will find any document that contains one or more of
|
||||
`foo` or `bar` or `baz`. We have already discussed the `default_operator`
|
||||
above which allows you to force all terms to be required, but there are
|
||||
also _boolean operators_ which can be used in the query string itself
|
||||
to provide more control.
|
||||
|
||||
The preferred operators are `+` (this term *must* be present) and `-`
|
||||
(this term *must not* be present). All other terms are optional.
|
||||
For example, this query:
|
||||
|
||||
quick brown +fox -news
|
||||
|
||||
states that:
|
||||
|
||||
* `fox` must be present
|
||||
* `news` must not be present
|
||||
* `quick` and `brown` are optional -- their presence increases the relevance
|
||||
|
||||
The familiar operators `AND`, `OR` and `NOT` (also written `&&`, `||` and `!`)
|
||||
are also supported. However, the effects of these operators can be more
|
||||
complicated than is obvious at first glance. `NOT` takes precedence over
|
||||
`AND`, which takes precedence over `OR`. While the `+` and `-` only affect
|
||||
the term to the right of the operator, `AND` and `OR` can affect the terms to
|
||||
the left and right.
|
||||
|
||||
****
|
||||
Rewriting the above query using `AND`, `OR` and `NOT` demonstrates the
|
||||
complexity:
|
||||
|
||||
`quick OR brown AND fox AND NOT news`::
|
||||
|
||||
This is incorrect, because `brown` is now a required term.
|
||||
|
||||
`(quick OR brown) AND fox AND NOT news`::
|
||||
|
||||
This is incorrect because at least one of `quick` or `brown` is now required
|
||||
and the search for those terms would be scored differently from the original
|
||||
query.
|
||||
|
||||
`((quick AND fox) OR (brown AND fox) OR fox) AND NOT news`::
|
||||
|
||||
This form now replicates the logic from the original query correctly, but
|
||||
the relevance scoring bares little resemblance to the original.
|
||||
|
||||
In contrast, the same query rewritten using the <<query-dsl-match-query,`match` query>>
|
||||
would look like this:
|
||||
|
||||
{
|
||||
"bool": {
|
||||
"must": { "match": "fox" },
|
||||
"should": { "match": "quick brown" },
|
||||
"must_not": { "match": "news" }
|
||||
}
|
||||
}
|
||||
|
||||
****
|
||||
|
||||
===== Grouping
|
||||
|
||||
Multiple terms or clauses can be grouped together with parentheses, to form
|
||||
sub-queries:
|
||||
|
||||
(quick OR brown) AND fox
|
||||
|
||||
Groups can be used to target a particular field, or to boost the result
|
||||
of a sub-query:
|
||||
|
||||
status:(active OR pending) title:(full text search)^2
|
||||
|
||||
===== Reserved characters
|
||||
|
||||
If you need to use any of the characters which function as operators in your
|
||||
query itself (and not as operators), then you should escape them with
|
||||
a leading backslash. For instance, to search for `(1+1)=2`, you would
|
||||
need to write your query as `\(1\+1\)=2`.
|
||||
|
||||
The reserved characters are: `+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
|
||||
|
||||
Failing to escape these special characters correctly could lead to a syntax
|
||||
error which prevents your query from running.
|
||||
|
||||
.Watch this space
|
||||
****
|
||||
A space may also be a reserved character. For instance, if you have a
|
||||
synonym list which converts `"wi fi"` to `"wifi"`, a `query_string` search
|
||||
for `"wi fi"` would fail. The query string parser would interpret your
|
||||
query as a search for `"wi OR fi"`, while the token stored in your
|
||||
index is actually `"wifi"`. Escaping the space will protect it from
|
||||
being touched by the query string parser: `"wi\ fi"`.
|
||||
****
|
|
@ -2,6 +2,7 @@
|
|||
=== Regexp Query
|
||||
|
||||
The `regexp` query allows you to use regular expression term queries.
|
||||
See <<regexp-syntax>> for details of the supported regular expression language.
|
||||
|
||||
*Note*: The performance of a `regexp` query heavily depends on the
|
||||
regular expression chosen. Matching everything like `.*` is very slow as
|
||||
|
@ -49,6 +50,5 @@ Possible flags are `ALL`, `ANYSTRING`, `AUTOMATON`, `COMPLEMENT`,
|
|||
http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html[Lucene
|
||||
documentation] for their meaning
|
||||
|
||||
For more information see the
|
||||
http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Fsearch%2FRegexpQuery.html[Lucene
|
||||
RegexpQuery documentation].
|
||||
|
||||
include::regexp-syntax.asciidoc[]
|
||||
|
|
|
@ -0,0 +1,280 @@
|
|||
[[regexp-syntax]]
|
||||
==== Regular expression syntax
|
||||
|
||||
Regular expression queries are supported by the `regexp` and the `query_string`
|
||||
queries. The Lucene regular expression engine
|
||||
is not Perl-compatible but supports a smaller range of operators.
|
||||
|
||||
[NOTE]
|
||||
====
|
||||
We will not attempt to explain regular expressions, but
|
||||
just explain the supported operators.
|
||||
====
|
||||
|
||||
===== Standard operators
|
||||
|
||||
Anchoring::
|
||||
+
|
||||
--
|
||||
|
||||
Most regular expression engines allow you to match any part of a string.
|
||||
If you want the regexp pattern to start at the beginning of the string or
|
||||
finish at the end of the string, then you have to _anchor_ it specifically,
|
||||
using `^` to indicate the beginning or `$` to indicate the end.
|
||||
|
||||
Lucene's patterns are always anchored. The pattern provided must match
|
||||
the entire string. For string `"abcde"`:
|
||||
|
||||
ab.* # match
|
||||
abcd # no match
|
||||
|
||||
--
|
||||
|
||||
Allowed characters::
|
||||
+
|
||||
--
|
||||
|
||||
Any Unicode characters may be used in the pattern, but certain characters
|
||||
are reserved and must be escaped. The standard reserved characters are:
|
||||
|
||||
....
|
||||
. ? + * | { } [ ] ( ) " \
|
||||
....
|
||||
|
||||
If you enable optional features (see below) then these characters may
|
||||
also be reserved:
|
||||
|
||||
# @ & < > ~
|
||||
|
||||
Any reserved character can be escaped with a backslash `"\*"` including
|
||||
a literal backslash character: `"\\"`
|
||||
|
||||
Additionally, any characters (except double quotes) are interpreted literally
|
||||
when surrounded by double quotes:
|
||||
|
||||
john"@smith.com"
|
||||
|
||||
|
||||
--
|
||||
|
||||
Match any character::
|
||||
+
|
||||
--
|
||||
|
||||
The period `"."` can be used to represent any character. For string `"abcde"`:
|
||||
|
||||
ab... # match
|
||||
a.c.e # match
|
||||
|
||||
--
|
||||
|
||||
One-or-more::
|
||||
+
|
||||
--
|
||||
|
||||
The plus sign `"+"` can be used to repeat the preceding shortest pattern
|
||||
once or more times. For string `"aaabbb"`:
|
||||
|
||||
a+b+ # match
|
||||
aa+bb+ # match
|
||||
a+.+ # match
|
||||
aa+bbb+ # no match
|
||||
|
||||
--
|
||||
|
||||
Zero-or-more::
|
||||
+
|
||||
--
|
||||
|
||||
The asterisk `"*"` can be used to match the preceding shortest pattern
|
||||
zero-or-more times. For string `"aaabbb`":
|
||||
|
||||
a*b* # match
|
||||
a*b*c* # match
|
||||
.*bbb.* # match
|
||||
aaa*bbb* # match
|
||||
|
||||
--
|
||||
|
||||
Zero-or-one::
|
||||
+
|
||||
--
|
||||
|
||||
The question mark `"?"` makes the preceding shortest pattern optional. It
|
||||
matches zero or one times. For string `"aaabbb"`:
|
||||
|
||||
aaa?bbb? # match
|
||||
aaaa?bbbb? # match
|
||||
.....?.? # match
|
||||
aa?bb? # no match
|
||||
|
||||
--
|
||||
|
||||
Min-to-max::
|
||||
+
|
||||
--
|
||||
|
||||
Curly brackets `"{}"` can be used to specify a minimum and (optionally)
|
||||
a maximum number of times the preceding shortest pattern can repeat. The
|
||||
allowed forms are:
|
||||
|
||||
{5} # repeat exactly 5 times
|
||||
{2,5} # repeat at least twice and at most 5 times
|
||||
{2,} # repeat at least twice
|
||||
|
||||
For string `"aaabbb"`:
|
||||
|
||||
a{3}b{3} # match
|
||||
a{2,4}b{2,4} # match
|
||||
a{2,}b{2,} # match
|
||||
.{3}.{3} # match
|
||||
a{4}b{4} # no match
|
||||
a{4,6}b{4,6} # no match
|
||||
a{4,}b{4,} # no match
|
||||
|
||||
--
|
||||
|
||||
Grouping::
|
||||
+
|
||||
--
|
||||
|
||||
Parentheses `"()"` can be used to form sub-patterns. The quantity operators
|
||||
listed above operate on the shortest previous pattern, which can be a group.
|
||||
For string `"ababab"`:
|
||||
|
||||
(ab)+ # match
|
||||
ab(ab)+ # match
|
||||
(..)+ # match
|
||||
(...)+ # no match
|
||||
(ab)* # match
|
||||
abab(ab)? # match
|
||||
ab(ab)? # no match
|
||||
(ab){3} # match
|
||||
(ab){1,2} # no match
|
||||
|
||||
--
|
||||
|
||||
Alternation::
|
||||
+
|
||||
--
|
||||
|
||||
The pipe symbol `"|"` acts as an OR operator. The match will succeed if
|
||||
the pattern on either the left-hand side OR the right-hand side matches.
|
||||
The alternation applies to the _longest pattern_, not the shortest.
|
||||
For string `"aabb"`:
|
||||
|
||||
aabb|bbaa # match
|
||||
aacc|bb # no match
|
||||
aa(cc|bb) # match
|
||||
a+|b+ # no match
|
||||
a+b+|b+a+ # match
|
||||
a+(b|c)+ # match
|
||||
|
||||
--
|
||||
|
||||
Character classes::
|
||||
+
|
||||
--
|
||||
|
||||
Ranges of potential characters may be represented as character classes
|
||||
by enclosing them in square brackets `"[]"`. A leading `^`
|
||||
negates the character class. The allowed forms are:
|
||||
|
||||
[abc] # 'a' or 'b' or 'c'
|
||||
[a-c] # 'a' or 'b' or 'c'
|
||||
[-abc] # '-' or 'a' or 'b' or 'c'
|
||||
[abc\-] # '-' or 'a' or 'b' or 'c'
|
||||
[^a-c] # any character except 'a' or 'b' or 'c'
|
||||
[^a-c] # any character except 'a' or 'b' or 'c'
|
||||
[-abc] # '-' or 'a' or 'b' or 'c'
|
||||
[abc\-] # '-' or 'a' or 'b' or 'c'
|
||||
|
||||
Note that the dash `"-"` indicates a range of characeters, unless it is
|
||||
the first character or if it is escaped with a backslash.
|
||||
|
||||
For string `"abcd"`:
|
||||
|
||||
ab[cd]+ # match
|
||||
[a-d]+ # match
|
||||
[^a-d]+ # no match
|
||||
|
||||
--
|
||||
|
||||
===== Optional operators
|
||||
|
||||
These operators are only available when they are explicitly enabled, by
|
||||
passing `flags` to the query.
|
||||
|
||||
Multiple flags can be enabled either using the `ALL` flag, or by
|
||||
concatenating flags with a pipe `"|"`:
|
||||
|
||||
{
|
||||
"regexp": {
|
||||
"username": {
|
||||
"value": "john~athon<1-5>",
|
||||
"flags": "COMPLEMENT|INTERVAL"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Complement::
|
||||
+
|
||||
--
|
||||
|
||||
The complement is probably the most useful option. The shortest pattern that
|
||||
follows a tilde `"~"` is negated. For the string `"abcdef"`:
|
||||
|
||||
ab~df # match
|
||||
ab~cf # no match
|
||||
a~(cd)f # match
|
||||
a~(bc)f # no match
|
||||
|
||||
Enabled with the `COMPLEMENT` or `ALL` flags.
|
||||
|
||||
--
|
||||
|
||||
Interval::
|
||||
+
|
||||
--
|
||||
|
||||
The interval option enables the use of numeric ranges, enclosed by angle
|
||||
brackets `"<>"`. For string: `"foo80"`:
|
||||
|
||||
foo<1-100> # match
|
||||
foo<01-100> # match
|
||||
foo<001-100> # no match
|
||||
|
||||
Enabled with the `INTERVAL` or `ALL` flags.
|
||||
|
||||
|
||||
--
|
||||
|
||||
Intersection::
|
||||
+
|
||||
--
|
||||
|
||||
The ampersand `"&"` joins two patterns in a way that both of them have to
|
||||
match. For string `"aaabbb"`:
|
||||
|
||||
aaa.+&.+bbb # match
|
||||
aaa&bbb # no match
|
||||
|
||||
Using this feature usually means that you should rewrite your regular
|
||||
expression.
|
||||
|
||||
Enabled with the `INTERSECTION` or `ALL` flags.
|
||||
|
||||
--
|
||||
|
||||
Any string::
|
||||
+
|
||||
--
|
||||
|
||||
The at sign `"@"` matches any string in its entirety. This could be combined
|
||||
with the intersection and complement above to express ``everything except''.
|
||||
For instance:
|
||||
|
||||
@&~(foo.+) # anything except string beginning with "foo"
|
||||
|
||||
Enabled with the `ANYSTRING` or `ALL` flags.
|
||||
--
|
|
@ -27,7 +27,7 @@ And here is a sample response:
|
|||
{
|
||||
"_index" : "twitter",
|
||||
"_type" : "tweet",
|
||||
"_id" : "1",
|
||||
"_id" : "1",
|
||||
"_source" : {
|
||||
"user" : "kimchy",
|
||||
"postDate" : "2009-11-15T14:12:12",
|
||||
|
|
Loading…
Reference in New Issue