[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax

This commit is contained in:
Clinton Gormley 2013-10-07 14:42:13 +02:00
parent e9d9ade10f
commit 264a00a40f
7 changed files with 554 additions and 19 deletions

View File

@ -6,6 +6,8 @@ The `regexp` filter is similar to the
that it is cacheable and can speedup performance in case you are reusing that it is cacheable and can speedup performance in case you are reusing
this filter in your queries. this filter in your queries.
See <<regexp-syntax>> for details of the supported regular expression language.
[source,js] [source,js]
-------------------------------------------------- --------------------------------------------------
{ {

View File

@ -87,4 +87,3 @@ include::queries/wildcard-query.asciidoc[]
include::queries/minimum-should-match.asciidoc[] include::queries/minimum-should-match.asciidoc[]
include::queries/multi-term-rewrite.asciidoc[] include::queries/multi-term-rewrite.asciidoc[]

View File

@ -19,7 +19,7 @@ The `query_string` top level parameters include:
[cols="<,<",options="header",] [cols="<,<",options="header",]
|======================================================================= |=======================================================================
|Parameter |Description |Parameter |Description
|`query` |The actual query to be parsed. |`query` |The actual query to be parsed. See <<query-string-syntax>>.
|`default_field` |The default field for query terms if no prefix field |`default_field` |The default field for query terms if no prefix field
is specified. Defaults to the `index.query.default_field` index is specified. Defaults to the `index.query.default_field` index
@ -158,16 +158,4 @@ introduced fields included). For example:
} }
-------------------------------------------------- --------------------------------------------------
[[Syntax_Extension]] include::query-string-syntax.asciidoc[]
[float]
==== Syntax Extension
There are several syntax extensions to the Lucene query language.
[float]
===== missing / exists
The `_exists_` and `_missing_` syntax allows to control docs that have
fields that exists within them (have a value) and missing. The syntax
is: `_exists_:field1`, `_missing_:field` and can be used anywhere a
query string is used.

View File

@ -0,0 +1,266 @@
[[query-string-syntax]]
==== Query string syntax
The query string ``mini-language'' is used by the
<<query-dsl-query-string-query>> and <<query-dsl-field-query>>, by the
`q` query string parameter in the <<search-search,`search` API>> and
by the `percolate` parameter in the <<docs-index_,`index`>> and
<<docs-bulk,`bulk`>> APIs.
The query string is parsed into a series of _terms_ and _operators_. A
term can be a single word -- `quick` or `brown` -- or a phrase, surrounded by
double quotes -- `"quick brown"` -- which searches for all the words in the
phrase, in the same order.
Operators allow you to customize the search -- the available options are
explained below.
===== Field names
As mentioned in <<query-dsl-query-string-query>>, the `default_field` is searched for the
search terms, but it is possible to specify other fields in the query syntax:
* where the `status` field contains `active`
status:active
* where the `title` field contains `quick` or `brown`
title:(quick brown)
* where the `author` field contains the exact phrase `"john smith"`
author:"John Smith"
* where any of the fields `book.title`, `book.content` or `book.date` contains
`quick` or `brown` (note how we need to escape the `*` with a backslash):
book.\*:(quick brown)
* where the field `title` has no value (or is missing):
_missing_:title
* where the field `title` has any non-null value:
_exists_:title
===== Wildcards
Wildcard searches can be run on individual terms, using `?` to replace
a single character, and `*` to replace zero or more characters:
qu?ck bro*
Be aware that wildcard queries can use an enormous amount of memory and
perform very badly -- just think how many terms need to be queried to
match the query string `"a* b* c*"`.
[WARNING]
======
Allowing a wildcard at the beginning of a word (eg `"*ing"`) is particularly
heavy, because all terms in the index need to be examined, just in case
they match. Leading wildcards can be disabled by setting
`allow_leading_wildcard` to `false`.
======
Wildcarded terms are not analyzed by default -- they are lowercased
(`lowercase_expanded_terms` defaults to `true`) but no further analysis
is done, mainly because it is impossible to accurately analyze a word that
is missing some of its letters. However, by setting `analyze_wildcard` to
`true`, an attempt will be made to analyze wildcarded words before searching
the term list for matching terms.
===== Regular expressions
Regular expression patterns can be embedded in the query string by
wrapping them in forward-slashes (`"/"`):
name:/joh?n(ath[oa]n)/
The supported regular expression syntax is explained in <<regexp-syntax>>.
[WARNING]
======
The `allow_leading_wildcard` parameter does not have any control over
regular expressions. A query string such as the following would force
Elasticsearch to visit every term in the index:
/.*n/
Use with caution!
======
===== Fuzziness
We can search for terms that are
similar to, but not exactly like our search terms, using the ``fuzzy''
operator:
quikc~ brwn~ foks~
This uses the
http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance[Damerau-Levenshtein distance]
to find all terms with a maximum of
two changes, where a change is the insertion, deletion
or substitution of a single character, or transposition of two adjacent
characters.
The default _edit distance_ is `2`, but an edit distance of `1` should be
sufficient to catch 80% of all human misspellings. It can be specified as:
quikc~1
===== Proximity searches
While a phrase query (eg `"john smith"`) expects all of the terms in exactly
the same order, a proximity query allows the specified words to be further
apart or in a different order. In the same way that fuzzy queries can
specify a maximum edit distance for characters in a word, a proximity search
allows us to specify a maximum edit distance of words in a phrase:
"fox quick"~5
The closer the text in a field is to the original order specified in the
query string, the more relevant that document is considered to be. When
compared to the above example query, the phrase `"quick fox"` would be
considered more relevant than `"quick brown fox"`.
===== Ranges
Ranges can be specified for date, numeric or string fields. Inclusive ranges
are specified with square brackets `[min TO max]` and exclusive ranges with
curly brackets `{min TO max}`.
* All days in 2012:
date:[2012/01/01 TO 2012/12/31]
* Numbers 1..5
count:[1 TO 5]
* Tags between `alpha` and `omega`, excluding `alpha` and `omega`:
tag:{alpha TO omega}
* Numbers from 10 upwards
count:[10 TO *]
* Dates before 2012
date:{* TO 2012/01/01}
The parsing of ranges in query strings can be complex and error prone. It is
much more reliable to use an explicit <<query-dsl-range-filter,`range` filter>>.
===== Boosting
Use the _boost_ operator `^` to make one term more relevant than another.
For instance, if we want to find all documents about foxes, but we are
especially interested in quick foxes:
quick^2 fox
The default `boost` value is 1, but can be any positive floating point number.
Boosts between 0 and 1 reduce relevance.
Boosts can also be applied to phrases or to groups:
"john smith"^2 (foo bar)^4
===== Boolean operators
By default, all terms are optional, as long as one term matches. A search
for `foo bar baz` will find any document that contains one or more of
`foo` or `bar` or `baz`. We have already discussed the `default_operator`
above which allows you to force all terms to be required, but there are
also _boolean operators_ which can be used in the query string itself
to provide more control.
The preferred operators are `+` (this term *must* be present) and `-`
(this term *must not* be present). All other terms are optional.
For example, this query:
quick brown +fox -news
states that:
* `fox` must be present
* `news` must not be present
* `quick` and `brown` are optional -- their presence increases the relevance
The familiar operators `AND`, `OR` and `NOT` (also written `&&`, `||` and `!`)
are also supported. However, the effects of these operators can be more
complicated than is obvious at first glance. `NOT` takes precedence over
`AND`, which takes precedence over `OR`. While the `+` and `-` only affect
the term to the right of the operator, `AND` and `OR` can affect the terms to
the left and right.
****
Rewriting the above query using `AND`, `OR` and `NOT` demonstrates the
complexity:
`quick OR brown AND fox AND NOT news`::
This is incorrect, because `brown` is now a required term.
`(quick OR brown) AND fox AND NOT news`::
This is incorrect because at least one of `quick` or `brown` is now required
and the search for those terms would be scored differently from the original
query.
`((quick AND fox) OR (brown AND fox) OR fox) AND NOT news`::
This form now replicates the logic from the original query correctly, but
the relevance scoring bares little resemblance to the original.
In contrast, the same query rewritten using the <<query-dsl-match-query,`match` query>>
would look like this:
{
"bool": {
"must": { "match": "fox" },
"should": { "match": "quick brown" },
"must_not": { "match": "news" }
}
}
****
===== Grouping
Multiple terms or clauses can be grouped together with parentheses, to form
sub-queries:
(quick OR brown) AND fox
Groups can be used to target a particular field, or to boost the result
of a sub-query:
status:(active OR pending) title:(full text search)^2
===== Reserved characters
If you need to use any of the characters which function as operators in your
query itself (and not as operators), then you should escape them with
a leading backslash. For instance, to search for `(1+1)=2`, you would
need to write your query as `\(1\+1\)=2`.
The reserved characters are: `+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
Failing to escape these special characters correctly could lead to a syntax
error which prevents your query from running.
.Watch this space
****
A space may also be a reserved character. For instance, if you have a
synonym list which converts `"wi fi"` to `"wifi"`, a `query_string` search
for `"wi fi"` would fail. The query string parser would interpret your
query as a search for `"wi OR fi"`, while the token stored in your
index is actually `"wifi"`. Escaping the space will protect it from
being touched by the query string parser: `"wi\ fi"`.
****

View File

@ -2,6 +2,7 @@
=== Regexp Query === Regexp Query
The `regexp` query allows you to use regular expression term queries. The `regexp` query allows you to use regular expression term queries.
See <<regexp-syntax>> for details of the supported regular expression language.
*Note*: The performance of a `regexp` query heavily depends on the *Note*: The performance of a `regexp` query heavily depends on the
regular expression chosen. Matching everything like `.*` is very slow as regular expression chosen. Matching everything like `.*` is very slow as
@ -49,6 +50,5 @@ Possible flags are `ALL`, `ANYSTRING`, `AUTOMATON`, `COMPLEMENT`,
http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html[Lucene http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html[Lucene
documentation] for their meaning documentation] for their meaning
For more information see the
http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Fsearch%2FRegexpQuery.html[Lucene include::regexp-syntax.asciidoc[]
RegexpQuery documentation].

View File

@ -0,0 +1,280 @@
[[regexp-syntax]]
==== Regular expression syntax
Regular expression queries are supported by the `regexp` and the `query_string`
queries. The Lucene regular expression engine
is not Perl-compatible but supports a smaller range of operators.
[NOTE]
====
We will not attempt to explain regular expressions, but
just explain the supported operators.
====
===== Standard operators
Anchoring::
+
--
Most regular expression engines allow you to match any part of a string.
If you want the regexp pattern to start at the beginning of the string or
finish at the end of the string, then you have to _anchor_ it specifically,
using `^` to indicate the beginning or `$` to indicate the end.
Lucene's patterns are always anchored. The pattern provided must match
the entire string. For string `"abcde"`:
ab.* # match
abcd # no match
--
Allowed characters::
+
--
Any Unicode characters may be used in the pattern, but certain characters
are reserved and must be escaped. The standard reserved characters are:
....
. ? + * | { } [ ] ( ) " \
....
If you enable optional features (see below) then these characters may
also be reserved:
# @ & < > ~
Any reserved character can be escaped with a backslash `"\*"` including
a literal backslash character: `"\\"`
Additionally, any characters (except double quotes) are interpreted literally
when surrounded by double quotes:
john"@smith.com"
--
Match any character::
+
--
The period `"."` can be used to represent any character. For string `"abcde"`:
ab... # match
a.c.e # match
--
One-or-more::
+
--
The plus sign `"+"` can be used to repeat the preceding shortest pattern
once or more times. For string `"aaabbb"`:
a+b+ # match
aa+bb+ # match
a+.+ # match
aa+bbb+ # no match
--
Zero-or-more::
+
--
The asterisk `"*"` can be used to match the preceding shortest pattern
zero-or-more times. For string `"aaabbb`":
a*b* # match
a*b*c* # match
.*bbb.* # match
aaa*bbb* # match
--
Zero-or-one::
+
--
The question mark `"?"` makes the preceding shortest pattern optional. It
matches zero or one times. For string `"aaabbb"`:
aaa?bbb? # match
aaaa?bbbb? # match
.....?.? # match
aa?bb? # no match
--
Min-to-max::
+
--
Curly brackets `"{}"` can be used to specify a minimum and (optionally)
a maximum number of times the preceding shortest pattern can repeat. The
allowed forms are:
{5} # repeat exactly 5 times
{2,5} # repeat at least twice and at most 5 times
{2,} # repeat at least twice
For string `"aaabbb"`:
a{3}b{3} # match
a{2,4}b{2,4} # match
a{2,}b{2,} # match
.{3}.{3} # match
a{4}b{4} # no match
a{4,6}b{4,6} # no match
a{4,}b{4,} # no match
--
Grouping::
+
--
Parentheses `"()"` can be used to form sub-patterns. The quantity operators
listed above operate on the shortest previous pattern, which can be a group.
For string `"ababab"`:
(ab)+ # match
ab(ab)+ # match
(..)+ # match
(...)+ # no match
(ab)* # match
abab(ab)? # match
ab(ab)? # no match
(ab){3} # match
(ab){1,2} # no match
--
Alternation::
+
--
The pipe symbol `"|"` acts as an OR operator. The match will succeed if
the pattern on either the left-hand side OR the right-hand side matches.
The alternation applies to the _longest pattern_, not the shortest.
For string `"aabb"`:
aabb|bbaa # match
aacc|bb # no match
aa(cc|bb) # match
a+|b+ # no match
a+b+|b+a+ # match
a+(b|c)+ # match
--
Character classes::
+
--
Ranges of potential characters may be represented as character classes
by enclosing them in square brackets `"[]"`. A leading `^`
negates the character class. The allowed forms are:
[abc] # 'a' or 'b' or 'c'
[a-c] # 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
Note that the dash `"-"` indicates a range of characeters, unless it is
the first character or if it is escaped with a backslash.
For string `"abcd"`:
ab[cd]+ # match
[a-d]+ # match
[^a-d]+ # no match
--
===== Optional operators
These operators are only available when they are explicitly enabled, by
passing `flags` to the query.
Multiple flags can be enabled either using the `ALL` flag, or by
concatenating flags with a pipe `"|"`:
{
"regexp": {
"username": {
"value": "john~athon<1-5>",
"flags": "COMPLEMENT|INTERVAL"
}
}
}
Complement::
+
--
The complement is probably the most useful option. The shortest pattern that
follows a tilde `"~"` is negated. For the string `"abcdef"`:
ab~df # match
ab~cf # no match
a~(cd)f # match
a~(bc)f # no match
Enabled with the `COMPLEMENT` or `ALL` flags.
--
Interval::
+
--
The interval option enables the use of numeric ranges, enclosed by angle
brackets `"<>"`. For string: `"foo80"`:
foo<1-100> # match
foo<01-100> # match
foo<001-100> # no match
Enabled with the `INTERVAL` or `ALL` flags.
--
Intersection::
+
--
The ampersand `"&"` joins two patterns in a way that both of them have to
match. For string `"aaabbb"`:
aaa.+&.+bbb # match
aaa&bbb # no match
Using this feature usually means that you should rewrite your regular
expression.
Enabled with the `INTERSECTION` or `ALL` flags.
--
Any string::
+
--
The at sign `"@"` matches any string in its entirety. This could be combined
with the intersection and complement above to express ``everything except''.
For instance:
@&~(foo.+) # anything except string beginning with "foo"
Enabled with the `ANYSTRING` or `ALL` flags.
--

View File

@ -27,7 +27,7 @@ And here is a sample response:
{ {
"_index" : "twitter", "_index" : "twitter",
"_type" : "tweet", "_type" : "tweet",
"_id" : "1", "_id" : "1",
"_source" : { "_source" : {
"user" : "kimchy", "user" : "kimchy",
"postDate" : "2009-11-15T14:12:12", "postDate" : "2009-11-15T14:12:12",