[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax

2013-10-07 14:42:13 +02:00 · 2013-10-07 14:42:13 +02:00 · 264a00a40f
parent e9d9ade10f
commit 264a00a40f
7 changed files with 554 additions and 19 deletions
--- a/docs/reference/query-dsl/filters/regexp-filter.asciidoc
+++ b/docs/reference/query-dsl/filters/regexp-filter.asciidoc
@ -6,6 +6,8 @@ The `regexp` filter is similar to the
 that it is cacheable and can speedup performance in case you are reusing
 this filter in your queries.

+See <<regexp-syntax>> for details of the supported regular expression language.
+
 [source,js]
 --------------------------------------------------
 {
--- a/docs/reference/query-dsl/queries.asciidoc
+++ b/docs/reference/query-dsl/queries.asciidoc
@ -87,4 +87,3 @@ include::queries/wildcard-query.asciidoc[]
 include::queries/minimum-should-match.asciidoc[]

 include::queries/multi-term-rewrite.asciidoc[]
-
--- a/docs/reference/query-dsl/queries/query-string-query.asciidoc
+++ b/docs/reference/query-dsl/queries/query-string-query.asciidoc
@ -19,7 +19,7 @@ The `query_string` top level parameters include:
 [cols="<,<",options="header",]
 |=======================================================================
 |Parameter |Description
-|`query` |The actual query to be parsed.
+|`query` |The actual query to be parsed. See <<query-string-syntax>>.

 |`default_field` |The default field for query terms if no prefix field
 is specified. Defaults to the `index.query.default_field` index
@ -158,16 +158,4 @@ introduced fields included). For example:
 }
 --------------------------------------------------

-[[Syntax_Extension]]
-[float]
-==== Syntax Extension
-
-There are several syntax extensions to the Lucene query language.
-
-[float]
-===== missing / exists
-
-The `_exists_` and `_missing_` syntax allows to control docs that have
-fields that exists within them (have a value) and missing. The syntax
-is: `_exists_:field1`, `_missing_:field` and can be used anywhere a
-query string is used.
+include::query-string-syntax.asciidoc[]
--- a/docs/reference/query-dsl/queries/query-string-syntax.asciidoc
+++ b/docs/reference/query-dsl/queries/query-string-syntax.asciidoc
@ -0,0 +1,266 @@
+[[query-string-syntax]]
+
+==== Query string syntax
+
+The query string ``mini-language'' is used by the
+<<query-dsl-query-string-query>> and <<query-dsl-field-query>>, by the
+`q` query string parameter in the <<search-search,`search` API>> and
+by the `percolate` parameter  in the <<docs-index_,`index`>> and
+<<docs-bulk,`bulk`>> APIs.
+
+The query string is parsed into a series of _terms_ and _operators_. A
+term can be a single word -- `quick` or `brown` -- or a phrase, surrounded by
+double quotes -- `"quick brown"` -- which searches for all the words in the
+phrase, in the same order.
+
+Operators allow you to customize the search -- the available options are
+explained below.
+
+===== Field names
+
+As mentioned in <<query-dsl-query-string-query>>, the `default_field` is searched for the
+search terms, but it is possible to specify other fields in the query syntax:
+
+* where the `status` field contains `active`
+
+    status:active
+
+* where the `title` field contains `quick` or `brown`
+
+    title:(quick brown)
+
+* where the `author` field contains the exact phrase `"john smith"`
+
+    author:"John Smith"
+
+* where any of the fields `book.title`, `book.content` or `book.date` contains
+  `quick` or `brown` (note how we need to escape the `*` with a backslash):
+
+    book.\*:(quick brown)
+
+* where the field `title` has no value (or is missing):
+
+    _missing_:title
+
+* where the field `title` has any non-null value:
+
+    _exists_:title
+
+===== Wildcards
+
+Wildcard searches can be run on individual terms, using `?` to replace
+a single character, and `*` to replace zero or more characters:
+
+    qu?ck bro*
+
+Be aware that wildcard queries can use an enormous amount of memory and
+perform very badly -- just think how many terms need to be queried to
+match the query string `"a* b* c*"`.
+
+[WARNING]
+======
+Allowing a wildcard at the beginning of a word (eg `"*ing"`) is particularly
+heavy, because all terms in the index need to be examined, just in case
+they match.  Leading wildcards can be disabled by setting
+`allow_leading_wildcard` to `false`.
+======
+
+Wildcarded terms are not analyzed by default -- they are lowercased
+(`lowercase_expanded_terms` defaults to `true`) but no further analysis
+is done, mainly because it is impossible to accurately analyze a word that
+is missing some of its letters.  However, by setting `analyze_wildcard` to
+`true`, an attempt will be made to analyze wildcarded words before searching
+the term list for matching terms.
+
+===== Regular expressions
+
+Regular expression patterns can be embedded in the query string by
+wrapping them in forward-slashes (`"/"`):
+
+    name:/joh?n(ath[oa]n)/
+
+The supported regular expression syntax is explained in <<regexp-syntax>>.
+
+[WARNING]
+======
+The `allow_leading_wildcard` parameter does not have any control over
+regular expressions.  A query string such as the following would force
+Elasticsearch to visit every term in the index:
+
+    /.*n/
+
+Use with caution!
+======
+
+===== Fuzziness
+
+We can search for terms that are
+similar to, but not exactly like our search terms, using the ``fuzzy''
+operator:
+
+    quikc~ brwn~ foks~
+
+This uses the
+http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance[Damerau-Levenshtein distance]
+to find all terms with a maximum of
+two changes, where a change is the insertion, deletion
+or substitution of a single character, or transposition of two adjacent
+characters.
+
+The default _edit distance_ is `2`, but an edit distance of `1` should be
+sufficient to catch 80% of all human misspellings. It can be specified as:
+
+    quikc~1
+
+===== Proximity searches
+
+While a phrase query (eg `"john smith"`) expects all of the terms in exactly
+the same order, a proximity query allows the specified words to be further
+apart or in a different order.  In the same way that fuzzy queries can
+specify a maximum edit distance for characters in a word, a proximity search
+allows us to specify a maximum edit distance of words in a phrase:
+
+    "fox quick"~5
+
+The closer the text in a field is to the original order specified in the
+query string, the more relevant that document is considered to be. When
+compared to the above example query, the phrase `"quick fox"` would be
+considered more relevant than `"quick brown fox"`.
+
+===== Ranges
+
+Ranges can be specified for date, numeric or string fields. Inclusive ranges
+are specified with square brackets `[min TO max]` and exclusive ranges with
+curly brackets `{min TO max}`.
+
+* All days in 2012:
+
+    date:[2012/01/01 TO 2012/12/31]
+
+* Numbers 1..5
+
+    count:[1 TO 5]
+
+* Tags between `alpha` and `omega`, excluding `alpha` and `omega`:
+
+    tag:{alpha TO omega}
+
+* Numbers from 10 upwards
+
+    count:[10 TO *]
+
+* Dates before 2012
+
+    date:{* TO 2012/01/01}
+
+The parsing of ranges in query strings can be complex and error prone. It is
+much more reliable to use an explicit <<query-dsl-range-filter,`range` filter>>.
+
+===== Boosting
+
+Use the _boost_ operator `^` to make one term more relevant than another.
+For instance, if we want to find all documents about foxes, but we are
+especially interested in quick foxes:
+
+    quick^2 fox
+
+The default `boost` value is 1, but can be any positive floating point number.
+Boosts between 0 and 1 reduce relevance.
+
+Boosts can also be applied to phrases or to groups:
+
+    "john smith"^2   (foo bar)^4
+
+===== Boolean operators
+
+By default, all terms are optional, as long as one term matches.  A search
+for `foo bar baz` will find any document that contains one or more of
+`foo` or `bar` or `baz`.  We have already discussed the `default_operator`
+above which allows you to force all terms to be required, but there are
+also _boolean operators_ which can be used in the query string itself
+to provide more control.
+
+The preferred operators are `+` (this term *must* be present) and `-`
+(this term *must not* be present). All other terms are optional.
+For example, this query:
+
+    quick brown +fox -news
+
+states that:
+
+* `fox` must be present
+* `news` must not be present
+* `quick` and `brown` are optional -- their presence increases the relevance
+
+The familiar operators `AND`, `OR` and `NOT` (also written `&&`, `||` and `!`)
+are also supported.  However, the effects of these operators can be more
+complicated than is obvious at first glance.  `NOT` takes precedence over
+`AND`, which takes precedence over `OR`.  While the `+` and `-` only affect
+the term to the right of the operator, `AND` and `OR` can affect the terms to
+the left and right.
+
+****
+Rewriting the above query using `AND`, `OR` and `NOT` demonstrates the
+complexity:
+
+`quick OR brown AND fox AND NOT news`::
+
+This is incorrect, because `brown` is now a required term.
+
+`(quick OR brown) AND fox AND NOT news`::
+
+This is incorrect because at least one of `quick` or `brown` is now required
+and the search for those terms would be scored differently from the original
+query.
+
+`((quick AND fox) OR (brown AND fox) OR fox) AND NOT news`::
+
+This form now replicates the logic from the original query correctly, but
+the relevance scoring bares little resemblance to the original.
+
+In contrast, the same query rewritten using the <<query-dsl-match-query,`match` query>>
+would look like this:
+
+    {
+        "bool": {
+            "must":     { "match": "fox"         },
+            "should":   { "match": "quick brown" },
+            "must_not": { "match": "news"        }
+        }
+    }
+
+****
+
+===== Grouping
+
+Multiple terms or clauses can be grouped together with parentheses, to form
+sub-queries:
+
+    (quick OR brown) AND fox
+
+Groups can be used to target a particular field, or to boost the result
+of a sub-query:
+
+    status:(active OR pending) title:(full text search)^2
+
+===== Reserved characters
+
+If you need to use any of the characters which function as operators in your
+query itself (and not as operators), then you should escape them with
+a leading backslash. For instance, to search for `(1+1)=2`, you would
+need to write your query as `\(1\+1\)=2`.
+
+The reserved characters are:  `+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
+
+Failing to escape these special characters correctly could lead to a syntax
+error which prevents your query from running.
+
+.Watch this space
+****
+A space may also be a reserved character.  For instance, if you have a
+synonym list which converts `"wi fi"` to `"wifi"`, a `query_string` search
+for `"wi fi"` would fail. The query string parser would interpret your
+query as a search for `"wi OR fi"`, while the token stored in your
+index is actually `"wifi"`.  Escaping the space will protect it from
+being touched by the query string parser: `"wi\ fi"`.
+****
--- a/docs/reference/query-dsl/queries/regexp-query.asciidoc
+++ b/docs/reference/query-dsl/queries/regexp-query.asciidoc
@ -2,6 +2,7 @@
 === Regexp Query

 The `regexp` query allows you to use regular expression term queries.
+See <<regexp-syntax>> for details of the supported regular expression language.

 *Note*: The performance of a `regexp` query heavily depends on the
 regular expression chosen. Matching everything like `.*` is very slow as
@ -49,6 +50,5 @@ Possible flags are `ALL`, `ANYSTRING`, `AUTOMATON`, `COMPLEMENT`,
 http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html[Lucene
 documentation] for their meaning

-For more information see the
-http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Fsearch%2FRegexpQuery.html[Lucene
-RegexpQuery documentation].
+
+include::regexp-syntax.asciidoc[]
--- a/docs/reference/query-dsl/queries/regexp-syntax.asciidoc
+++ b/docs/reference/query-dsl/queries/regexp-syntax.asciidoc
@ -0,0 +1,280 @@
+[[regexp-syntax]]
+==== Regular expression syntax
+
+Regular expression queries are supported by the `regexp` and the `query_string`
+queries.  The Lucene regular expression engine
+is not Perl-compatible but supports a smaller range of operators.
+
+[NOTE]
+====
+We will not attempt to explain regular expressions, but
+just explain the supported operators.
+====
+
+===== Standard operators
+
+Anchoring::
+
+--
+
+Most regular expression engines allow you to match any part of a string.
+If you want the regexp pattern to start at the beginning of the string or
+finish at the end of the string, then you have to _anchor_ it specifically,
+using `^` to indicate the beginning or `$` to indicate the end.
+
+Lucene's patterns are always anchored.  The pattern provided must match
+the entire string. For string `"abcde"`:
+
+    ab.*     # match
+    abcd     # no match
+
+--
+
+Allowed characters::
+
+--
+
+Any Unicode characters may be used in the pattern, but certain characters
+are reserved and must be escaped.  The standard reserved characters are:
+
+....
+. ? + * | { } [ ] ( ) " \
+....
+
+If you enable optional features (see below) then these characters may
+also be reserved:
+
+    # @ & < >  ~
+
+Any reserved character can be escaped with a backslash `"\*"` including
+a literal backslash character: `"\\"`
+
+Additionally, any characters (except double quotes) are interpreted literally
+when surrounded by double quotes:
+
+    john"@smith.com"
+
+
+--
+
+Match any character::
+
+--
+
+The period `"."` can be used to represent any character.  For string `"abcde"`:
+
+    ab...   # match
+    a.c.e   # match
+
+--
+
+One-or-more::
+
+--
+
+The plus sign `"+"` can be used to repeat the preceding shortest pattern
+once or more times. For string `"aaabbb"`:
+
+    a+b+        # match
+    aa+bb+      # match
+    a+.+        # match
+    aa+bbb+     # no match
+
+--
+
+Zero-or-more::
+
+--
+
+The asterisk `"*"` can be used to match the preceding shortest pattern
+zero-or-more times.  For string `"aaabbb`":
+
+    a*b*        # match
+    a*b*c*      # match
+    .*bbb.*     # match
+    aaa*bbb*    # match
+
+--
+
+Zero-or-one::
+
+--
+
+The question mark `"?"` makes the preceding shortest pattern optional. It
+matches zero or one times.  For string `"aaabbb"`:
+
+    aaa?bbb?    # match
+    aaaa?bbbb?  # match
+    .....?.?    # match
+    aa?bb?      # no match
+
+--
+
+Min-to-max::
+
+--
+
+Curly brackets `"{}"` can be used to specify a minimum and (optionally)
+a maximum number of times the preceding shortest pattern can repeat.  The
+allowed forms are:
+
+    {5}     # repeat exactly 5 times
+    {2,5}   # repeat at least twice and at most 5 times
+    {2,}    # repeat at least twice
+
+For string `"aaabbb"`:
+
+    a{3}b{3}        # match
+    a{2,4}b{2,4}    # match
+    a{2,}b{2,}      # match
+    .{3}.{3}        # match
+    a{4}b{4}        # no match
+    a{4,6}b{4,6}    # no match
+    a{4,}b{4,}      # no match
+
+--
+
+Grouping::
+
+--
+
+Parentheses `"()"` can be used to form sub-patterns. The quantity operators
+listed above operate on the shortest previous pattern, which can be a group.
+For string `"ababab"`:
+
+    (ab)+       # match
+    ab(ab)+     # match
+    (..)+       # match
+    (...)+      # no match
+    (ab)*       # match
+    abab(ab)?   # match
+    ab(ab)?     # no match
+    (ab){3}     # match
+    (ab){1,2}   # no match
+
+--
+
+Alternation::
+
+--
+
+The pipe symbol `"|"` acts as an OR operator. The match will succeed if
+the pattern on either the left-hand side OR the right-hand side matches.
+The alternation applies to the _longest pattern_, not the shortest.
+For string `"aabb"`:
+
+    aabb|bbaa   # match
+    aacc|bb     # no match
+    aa(cc|bb)   # match
+    a+|b+       # no match
+    a+b+|b+a+   # match
+    a+(b|c)+    # match
+
+--
+
+Character classes::
+
+--
+
+Ranges of potential characters may be represented as character classes
+by enclosing them in square brackets `"[]"`. A leading `^`
+negates the character class. The allowed forms are:
+
+    [abc]   # 'a' or 'b' or 'c'
+    [a-c]   # 'a' or 'b' or 'c'
+    [-abc]  # '-' or 'a' or 'b' or 'c'
+    [abc\-] # '-' or 'a' or 'b' or 'c'
+    [^a-c]  # any character except 'a' or 'b' or 'c'
+    [^a-c]  # any character except 'a' or 'b' or 'c'
+    [-abc]  # '-' or 'a' or 'b' or 'c'
+    [abc\-] # '-' or 'a' or 'b' or 'c'
+
+Note that the dash `"-"` indicates a range of characeters, unless it is
+the first character or if it is escaped with a backslash.
+
+For string `"abcd"`:
+
+    ab[cd]+     # match
+    [a-d]+      # match
+    [^a-d]+     # no match
+
+--
+
+===== Optional operators
+
+These operators are only available when they are explicitly enabled, by
+passing `flags` to the query.
+
+Multiple flags can be enabled either using the `ALL` flag, or by
+concatenating flags with a pipe `"|"`:
+
+    {
+        "regexp": {
+            "username": {
+                "value": "john~athon<1-5>",
+                "flags": "COMPLEMENT|INTERVAL"
+            }
+        }
+    }
+
+Complement::
+
+--
+
+The complement is probably the most useful option. The shortest pattern that
+follows a tilde `"~"` is negated.  For the string `"abcdef"`:
+
+    ab~df     # match
+    ab~cf     # no match
+    a~(cd)f   # match
+    a~(bc)f   # no match
+
+Enabled with the `COMPLEMENT` or `ALL` flags.
+
+--
+
+Interval::
+
+--
+
+The interval option enables the use of numeric ranges, enclosed by angle
+brackets `"<>"`. For string: `"foo80"`:
+
+    foo<1-100>     # match
+    foo<01-100>    # match
+    foo<001-100>   # no match
+
+Enabled with the `INTERVAL` or `ALL` flags.
+
+
+--
+
+Intersection::
+
+--
+
+The ampersand `"&"` joins two patterns in a way that both of them have to
+match. For string `"aaabbb"`:
+
+    aaa.+&.+bbb     # match
+    aaa&bbb         # no match
+
+Using this feature usually means that you should rewrite your regular
+expression.
+
+Enabled with the `INTERSECTION` or `ALL` flags.
+
+--
+
+Any string::
+
+--
+
+The at sign `"@"` matches any string in its entirety.  This could be combined
+with the intersection and complement above to express ``everything except''.
+For instance:
+
+    @&~(foo.+)      # anything except string beginning with "foo"
+
+Enabled with the `ANYSTRING` or `ALL` flags.
+--
--- a/docs/reference/search/uri-request.asciidoc
+++ b/docs/reference/search/uri-request.asciidoc
@ -27,7 +27,7 @@ And here is a sample response:
            {
                "_index" : "twitter",
                "_type" : "tweet",
-                "_id" : "1", 
+                "_id" : "1",
                "_source" : {
                    "user" : "kimchy",
                    "postDate" : "2009-11-15T14:12:12",