[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax

This commit is contained in:
Clinton Gormley 2013-10-07 14:42:13 +02:00
parent e9d9ade10f
commit 264a00a40f
7 changed files with 554 additions and 19 deletions

View File

@ -6,6 +6,8 @@ The `regexp` filter is similar to the
that it is cacheable and can speedup performance in case you are reusing
this filter in your queries.
See <<regexp-syntax>> for details of the supported regular expression language.
[source,js]
--------------------------------------------------
{

View File

@ -87,4 +87,3 @@ include::queries/wildcard-query.asciidoc[]
include::queries/minimum-should-match.asciidoc[]
include::queries/multi-term-rewrite.asciidoc[]

View File

@ -19,7 +19,7 @@ The `query_string` top level parameters include:
[cols="<,<",options="header",]
|=======================================================================
|Parameter |Description
|`query` |The actual query to be parsed.
|`query` |The actual query to be parsed. See <<query-string-syntax>>.
|`default_field` |The default field for query terms if no prefix field
is specified. Defaults to the `index.query.default_field` index
@ -158,16 +158,4 @@ introduced fields included). For example:
}
--------------------------------------------------
[[Syntax_Extension]]
[float]
==== Syntax Extension
There are several syntax extensions to the Lucene query language.
[float]
===== missing / exists
The `_exists_` and `_missing_` syntax allows to control docs that have
fields that exists within them (have a value) and missing. The syntax
is: `_exists_:field1`, `_missing_:field` and can be used anywhere a
query string is used.
include::query-string-syntax.asciidoc[]

View File

@ -0,0 +1,266 @@
[[query-string-syntax]]
==== Query string syntax
The query string ``mini-language'' is used by the
<<query-dsl-query-string-query>> and <<query-dsl-field-query>>, by the
`q` query string parameter in the <<search-search,`search` API>> and
by the `percolate` parameter in the <<docs-index_,`index`>> and
<<docs-bulk,`bulk`>> APIs.
The query string is parsed into a series of _terms_ and _operators_. A
term can be a single word -- `quick` or `brown` -- or a phrase, surrounded by
double quotes -- `"quick brown"` -- which searches for all the words in the
phrase, in the same order.
Operators allow you to customize the search -- the available options are
explained below.
===== Field names
As mentioned in <<query-dsl-query-string-query>>, the `default_field` is searched for the
search terms, but it is possible to specify other fields in the query syntax:
* where the `status` field contains `active`
status:active
* where the `title` field contains `quick` or `brown`
title:(quick brown)
* where the `author` field contains the exact phrase `"john smith"`
author:"John Smith"
* where any of the fields `book.title`, `book.content` or `book.date` contains
`quick` or `brown` (note how we need to escape the `*` with a backslash):
book.\*:(quick brown)
* where the field `title` has no value (or is missing):
_missing_:title
* where the field `title` has any non-null value:
_exists_:title
===== Wildcards
Wildcard searches can be run on individual terms, using `?` to replace
a single character, and `*` to replace zero or more characters:
qu?ck bro*
Be aware that wildcard queries can use an enormous amount of memory and
perform very badly -- just think how many terms need to be queried to
match the query string `"a* b* c*"`.
[WARNING]
======
Allowing a wildcard at the beginning of a word (eg `"*ing"`) is particularly
heavy, because all terms in the index need to be examined, just in case
they match. Leading wildcards can be disabled by setting
`allow_leading_wildcard` to `false`.
======
Wildcarded terms are not analyzed by default -- they are lowercased
(`lowercase_expanded_terms` defaults to `true`) but no further analysis
is done, mainly because it is impossible to accurately analyze a word that
is missing some of its letters. However, by setting `analyze_wildcard` to
`true`, an attempt will be made to analyze wildcarded words before searching
the term list for matching terms.
===== Regular expressions
Regular expression patterns can be embedded in the query string by
wrapping them in forward-slashes (`"/"`):
name:/joh?n(ath[oa]n)/
The supported regular expression syntax is explained in <<regexp-syntax>>.
[WARNING]
======
The `allow_leading_wildcard` parameter does not have any control over
regular expressions. A query string such as the following would force
Elasticsearch to visit every term in the index:
/.*n/
Use with caution!
======
===== Fuzziness
We can search for terms that are
similar to, but not exactly like our search terms, using the ``fuzzy''
operator:
quikc~ brwn~ foks~
This uses the
http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance[Damerau-Levenshtein distance]
to find all terms with a maximum of
two changes, where a change is the insertion, deletion
or substitution of a single character, or transposition of two adjacent
characters.
The default _edit distance_ is `2`, but an edit distance of `1` should be
sufficient to catch 80% of all human misspellings. It can be specified as:
quikc~1
===== Proximity searches
While a phrase query (eg `"john smith"`) expects all of the terms in exactly
the same order, a proximity query allows the specified words to be further
apart or in a different order. In the same way that fuzzy queries can
specify a maximum edit distance for characters in a word, a proximity search
allows us to specify a maximum edit distance of words in a phrase:
"fox quick"~5
The closer the text in a field is to the original order specified in the
query string, the more relevant that document is considered to be. When
compared to the above example query, the phrase `"quick fox"` would be
considered more relevant than `"quick brown fox"`.
===== Ranges
Ranges can be specified for date, numeric or string fields. Inclusive ranges
are specified with square brackets `[min TO max]` and exclusive ranges with
curly brackets `{min TO max}`.
* All days in 2012:
date:[2012/01/01 TO 2012/12/31]
* Numbers 1..5
count:[1 TO 5]
* Tags between `alpha` and `omega`, excluding `alpha` and `omega`:
tag:{alpha TO omega}
* Numbers from 10 upwards
count:[10 TO *]
* Dates before 2012
date:{* TO 2012/01/01}
The parsing of ranges in query strings can be complex and error prone. It is
much more reliable to use an explicit <<query-dsl-range-filter,`range` filter>>.
===== Boosting
Use the _boost_ operator `^` to make one term more relevant than another.
For instance, if we want to find all documents about foxes, but we are
especially interested in quick foxes:
quick^2 fox
The default `boost` value is 1, but can be any positive floating point number.
Boosts between 0 and 1 reduce relevance.
Boosts can also be applied to phrases or to groups:
"john smith"^2 (foo bar)^4
===== Boolean operators
By default, all terms are optional, as long as one term matches. A search
for `foo bar baz` will find any document that contains one or more of
`foo` or `bar` or `baz`. We have already discussed the `default_operator`
above which allows you to force all terms to be required, but there are
also _boolean operators_ which can be used in the query string itself
to provide more control.
The preferred operators are `+` (this term *must* be present) and `-`
(this term *must not* be present). All other terms are optional.
For example, this query:
quick brown +fox -news
states that:
* `fox` must be present
* `news` must not be present
* `quick` and `brown` are optional -- their presence increases the relevance
The familiar operators `AND`, `OR` and `NOT` (also written `&&`, `||` and `!`)
are also supported. However, the effects of these operators can be more
complicated than is obvious at first glance. `NOT` takes precedence over
`AND`, which takes precedence over `OR`. While the `+` and `-` only affect
the term to the right of the operator, `AND` and `OR` can affect the terms to
the left and right.
****
Rewriting the above query using `AND`, `OR` and `NOT` demonstrates the
complexity:
`quick OR brown AND fox AND NOT news`::
This is incorrect, because `brown` is now a required term.
`(quick OR brown) AND fox AND NOT news`::
This is incorrect because at least one of `quick` or `brown` is now required
and the search for those terms would be scored differently from the original
query.
`((quick AND fox) OR (brown AND fox) OR fox) AND NOT news`::
This form now replicates the logic from the original query correctly, but
the relevance scoring bares little resemblance to the original.
In contrast, the same query rewritten using the <<query-dsl-match-query,`match` query>>
would look like this:
{
"bool": {
"must": { "match": "fox" },
"should": { "match": "quick brown" },
"must_not": { "match": "news" }
}
}
****
===== Grouping
Multiple terms or clauses can be grouped together with parentheses, to form
sub-queries:
(quick OR brown) AND fox
Groups can be used to target a particular field, or to boost the result
of a sub-query:
status:(active OR pending) title:(full text search)^2
===== Reserved characters
If you need to use any of the characters which function as operators in your
query itself (and not as operators), then you should escape them with
a leading backslash. For instance, to search for `(1+1)=2`, you would
need to write your query as `\(1\+1\)=2`.
The reserved characters are: `+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
Failing to escape these special characters correctly could lead to a syntax
error which prevents your query from running.
.Watch this space
****
A space may also be a reserved character. For instance, if you have a
synonym list which converts `"wi fi"` to `"wifi"`, a `query_string` search
for `"wi fi"` would fail. The query string parser would interpret your
query as a search for `"wi OR fi"`, while the token stored in your
index is actually `"wifi"`. Escaping the space will protect it from
being touched by the query string parser: `"wi\ fi"`.
****

View File

@ -2,6 +2,7 @@
=== Regexp Query
The `regexp` query allows you to use regular expression term queries.
See <<regexp-syntax>> for details of the supported regular expression language.
*Note*: The performance of a `regexp` query heavily depends on the
regular expression chosen. Matching everything like `.*` is very slow as
@ -49,6 +50,5 @@ Possible flags are `ALL`, `ANYSTRING`, `AUTOMATON`, `COMPLEMENT`,
http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html[Lucene
documentation] for their meaning
For more information see the
http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Fsearch%2FRegexpQuery.html[Lucene
RegexpQuery documentation].
include::regexp-syntax.asciidoc[]

View File

@ -0,0 +1,280 @@
[[regexp-syntax]]
==== Regular expression syntax
Regular expression queries are supported by the `regexp` and the `query_string`
queries. The Lucene regular expression engine
is not Perl-compatible but supports a smaller range of operators.
[NOTE]
====
We will not attempt to explain regular expressions, but
just explain the supported operators.
====
===== Standard operators
Anchoring::
+
--
Most regular expression engines allow you to match any part of a string.
If you want the regexp pattern to start at the beginning of the string or
finish at the end of the string, then you have to _anchor_ it specifically,
using `^` to indicate the beginning or `$` to indicate the end.
Lucene's patterns are always anchored. The pattern provided must match
the entire string. For string `"abcde"`:
ab.* # match
abcd # no match
--
Allowed characters::
+
--
Any Unicode characters may be used in the pattern, but certain characters
are reserved and must be escaped. The standard reserved characters are:
....
. ? + * | { } [ ] ( ) " \
....
If you enable optional features (see below) then these characters may
also be reserved:
# @ & < > ~
Any reserved character can be escaped with a backslash `"\*"` including
a literal backslash character: `"\\"`
Additionally, any characters (except double quotes) are interpreted literally
when surrounded by double quotes:
john"@smith.com"
--
Match any character::
+
--
The period `"."` can be used to represent any character. For string `"abcde"`:
ab... # match
a.c.e # match
--
One-or-more::
+
--
The plus sign `"+"` can be used to repeat the preceding shortest pattern
once or more times. For string `"aaabbb"`:
a+b+ # match
aa+bb+ # match
a+.+ # match
aa+bbb+ # no match
--
Zero-or-more::
+
--
The asterisk `"*"` can be used to match the preceding shortest pattern
zero-or-more times. For string `"aaabbb`":
a*b* # match
a*b*c* # match
.*bbb.* # match
aaa*bbb* # match
--
Zero-or-one::
+
--
The question mark `"?"` makes the preceding shortest pattern optional. It
matches zero or one times. For string `"aaabbb"`:
aaa?bbb? # match
aaaa?bbbb? # match
.....?.? # match
aa?bb? # no match
--
Min-to-max::
+
--
Curly brackets `"{}"` can be used to specify a minimum and (optionally)
a maximum number of times the preceding shortest pattern can repeat. The
allowed forms are:
{5} # repeat exactly 5 times
{2,5} # repeat at least twice and at most 5 times
{2,} # repeat at least twice
For string `"aaabbb"`:
a{3}b{3} # match
a{2,4}b{2,4} # match
a{2,}b{2,} # match
.{3}.{3} # match
a{4}b{4} # no match
a{4,6}b{4,6} # no match
a{4,}b{4,} # no match
--
Grouping::
+
--
Parentheses `"()"` can be used to form sub-patterns. The quantity operators
listed above operate on the shortest previous pattern, which can be a group.
For string `"ababab"`:
(ab)+ # match
ab(ab)+ # match
(..)+ # match
(...)+ # no match
(ab)* # match
abab(ab)? # match
ab(ab)? # no match
(ab){3} # match
(ab){1,2} # no match
--
Alternation::
+
--
The pipe symbol `"|"` acts as an OR operator. The match will succeed if
the pattern on either the left-hand side OR the right-hand side matches.
The alternation applies to the _longest pattern_, not the shortest.
For string `"aabb"`:
aabb|bbaa # match
aacc|bb # no match
aa(cc|bb) # match
a+|b+ # no match
a+b+|b+a+ # match
a+(b|c)+ # match
--
Character classes::
+
--
Ranges of potential characters may be represented as character classes
by enclosing them in square brackets `"[]"`. A leading `^`
negates the character class. The allowed forms are:
[abc] # 'a' or 'b' or 'c'
[a-c] # 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
Note that the dash `"-"` indicates a range of characeters, unless it is
the first character or if it is escaped with a backslash.
For string `"abcd"`:
ab[cd]+ # match
[a-d]+ # match
[^a-d]+ # no match
--
===== Optional operators
These operators are only available when they are explicitly enabled, by
passing `flags` to the query.
Multiple flags can be enabled either using the `ALL` flag, or by
concatenating flags with a pipe `"|"`:
{
"regexp": {
"username": {
"value": "john~athon<1-5>",
"flags": "COMPLEMENT|INTERVAL"
}
}
}
Complement::
+
--
The complement is probably the most useful option. The shortest pattern that
follows a tilde `"~"` is negated. For the string `"abcdef"`:
ab~df # match
ab~cf # no match
a~(cd)f # match
a~(bc)f # no match
Enabled with the `COMPLEMENT` or `ALL` flags.
--
Interval::
+
--
The interval option enables the use of numeric ranges, enclosed by angle
brackets `"<>"`. For string: `"foo80"`:
foo<1-100> # match
foo<01-100> # match
foo<001-100> # no match
Enabled with the `INTERVAL` or `ALL` flags.
--
Intersection::
+
--
The ampersand `"&"` joins two patterns in a way that both of them have to
match. For string `"aaabbb"`:
aaa.+&.+bbb # match
aaa&bbb # no match
Using this feature usually means that you should rewrite your regular
expression.
Enabled with the `INTERSECTION` or `ALL` flags.
--
Any string::
+
--
The at sign `"@"` matches any string in its entirety. This could be combined
with the intersection and complement above to express ``everything except''.
For instance:
@&~(foo.+) # anything except string beginning with "foo"
Enabled with the `ANYSTRING` or `ALL` flags.
--

View File

@ -27,7 +27,7 @@ And here is a sample response:
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_id" : "1",
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",