[DOCS] Added pages explaining lucene query parser syntax and regular expression syntax

2025-02-18 19:05:06 +00:00 · 2013-10-07 14:42:13 +02:00 · 2013-10-07 14:42:13 +02:00 · 264a00a40f
commit 264a00a40f
parent e9d9ade10f
7 changed files with 554 additions and 19 deletions
--- a/docs/reference/query-dsl/filters/regexp-filter.asciidoc
+++ b/docs/reference/query-dsl/filters/regexp-filter.asciidoc
@ -6,6 +6,8 @@ The `regexp` filter is similar to the
 that it is cacheable and can speedup performance in case you are reusing
 this filter in your queries.
 See <<regexp-syntax>> for details of the supported regular expression language.
 [source,js]
 --------------------------------------------------
 {
--- a/docs/reference/query-dsl/queries.asciidoc
+++ b/docs/reference/query-dsl/queries.asciidoc
@ -87,4 +87,3 @@ include::queries/wildcard-query.asciidoc[]
 include::queries/minimum-should-match.asciidoc[]
 include::queries/multi-term-rewrite.asciidoc[]
--- a/docs/reference/query-dsl/queries/query-string-query.asciidoc
+++ b/docs/reference/query-dsl/queries/query-string-query.asciidoc
@ -19,7 +19,7 @@ The `query_string` top level parameters include:
 [cols="<,<",options="header",]
 |=======================================================================
 |Parameter |Description
-|`query` |The actual query to be parsed.
+|`query` |The actual query to be parsed. See <<query-string-syntax>>.
 |`default_field` |The default field for query terms if no prefix field
 is specified. Defaults to the `index.query.default_field` index
@ -158,16 +158,4 @@ introduced fields included). For example:
 }
 --------------------------------------------------
-[[Syntax_Extension]]
+include::query-string-syntax.asciidoc[]
 [float]
 ==== Syntax Extension
 There are several syntax extensions to the Lucene query language.
 [float]
 ===== missing / exists
 The `_exists_` and `_missing_` syntax allows to control docs that have
 fields that exists within them (have a value) and missing. The syntax
 is: `_exists_:field1`, `_missing_:field` and can be used anywhere a
 query string is used.
--- a/docs/reference/query-dsl/queries/query-string-syntax.asciidoc
+++ b/docs/reference/query-dsl/queries/query-string-syntax.asciidoc
@ -0,0 +1,266 @@
 [[query-string-syntax]]
 ==== Query string syntax
 The query string ``mini-language'' is used by the
 <<query-dsl-query-string-query>> and <<query-dsl-field-query>>, by the
 `q` query string parameter in the <<search-search,`search` API>> and
 by the `percolate` parameter  in the <<docs-index_,`index`>> and
 <<docs-bulk,`bulk`>> APIs.
 The query string is parsed into a series of _terms_ and _operators_. A
 term can be a single word -- `quick` or `brown` -- or a phrase, surrounded by
 double quotes -- `"quick brown"` -- which searches for all the words in the
 phrase, in the same order.
 Operators allow you to customize the search -- the available options are
 explained below.
 ===== Field names
 As mentioned in <<query-dsl-query-string-query>>, the `default_field` is searched for the
 search terms, but it is possible to specify other fields in the query syntax:
 * where the `status` field contains `active`
    status:active
 * where the `title` field contains `quick` or `brown`
    title:(quick brown)
 * where the `author` field contains the exact phrase `"john smith"`
    author:"John Smith"
 * where any of the fields `book.title`, `book.content` or `book.date` contains
  `quick` or `brown` (note how we need to escape the `*` with a backslash):
    book.\*:(quick brown)
 * where the field `title` has no value (or is missing):
    _missing_:title
 * where the field `title` has any non-null value:
    _exists_:title
 ===== Wildcards
 Wildcard searches can be run on individual terms, using `?` to replace
 a single character, and `*` to replace zero or more characters:
    qu?ck bro*
 Be aware that wildcard queries can use an enormous amount of memory and
 perform very badly -- just think how many terms need to be queried to
 match the query string `"a* b* c*"`.
 [WARNING]
 ======
 Allowing a wildcard at the beginning of a word (eg `"*ing"`) is particularly
 heavy, because all terms in the index need to be examined, just in case
 they match.  Leading wildcards can be disabled by setting
 `allow_leading_wildcard` to `false`.
 ======
 Wildcarded terms are not analyzed by default -- they are lowercased
 (`lowercase_expanded_terms` defaults to `true`) but no further analysis
 is done, mainly because it is impossible to accurately analyze a word that
 is missing some of its letters.  However, by setting `analyze_wildcard` to
 `true`, an attempt will be made to analyze wildcarded words before searching
 the term list for matching terms.
 ===== Regular expressions
 Regular expression patterns can be embedded in the query string by
 wrapping them in forward-slashes (`"/"`):
    name:/joh?n(ath[oa]n)/
 The supported regular expression syntax is explained in <<regexp-syntax>>.
 [WARNING]
 ======
 The `allow_leading_wildcard` parameter does not have any control over
 regular expressions.  A query string such as the following would force
 Elasticsearch to visit every term in the index:
    /.*n/
 Use with caution!
 ======
 ===== Fuzziness
 We can search for terms that are
 similar to, but not exactly like our search terms, using the ``fuzzy''
 operator:
    quikc~ brwn~ foks~
 This uses the
 http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance[Damerau-Levenshtein distance]
 to find all terms with a maximum of
 two changes, where a change is the insertion, deletion
 or substitution of a single character, or transposition of two adjacent
 characters.
 The default _edit distance_ is `2`, but an edit distance of `1` should be
 sufficient to catch 80% of all human misspellings. It can be specified as:
    quikc~1
 ===== Proximity searches
 While a phrase query (eg `"john smith"`) expects all of the terms in exactly
 the same order, a proximity query allows the specified words to be further
 apart or in a different order.  In the same way that fuzzy queries can
 specify a maximum edit distance for characters in a word, a proximity search
 allows us to specify a maximum edit distance of words in a phrase:
    "fox quick"~5
 The closer the text in a field is to the original order specified in the
 query string, the more relevant that document is considered to be. When
 compared to the above example query, the phrase `"quick fox"` would be
 considered more relevant than `"quick brown fox"`.
 ===== Ranges
 Ranges can be specified for date, numeric or string fields. Inclusive ranges
 are specified with square brackets `[min TO max]` and exclusive ranges with
 curly brackets `{min TO max}`.
 * All days in 2012:
    date:[2012/01/01 TO 2012/12/31]
 * Numbers 1..5
    count:[1 TO 5]
 * Tags between `alpha` and `omega`, excluding `alpha` and `omega`:
    tag:{alpha TO omega}
 * Numbers from 10 upwards
    count:[10 TO *]
 * Dates before 2012
    date:{* TO 2012/01/01}
 The parsing of ranges in query strings can be complex and error prone. It is
 much more reliable to use an explicit <<query-dsl-range-filter,`range` filter>>.
 ===== Boosting
 Use the _boost_ operator `^` to make one term more relevant than another.
 For instance, if we want to find all documents about foxes, but we are
 especially interested in quick foxes:
    quick^2 fox
 The default `boost` value is 1, but can be any positive floating point number.
 Boosts between 0 and 1 reduce relevance.
 Boosts can also be applied to phrases or to groups:
    "john smith"^2   (foo bar)^4
 ===== Boolean operators
 By default, all terms are optional, as long as one term matches.  A search
 for `foo bar baz` will find any document that contains one or more of
 `foo` or `bar` or `baz`.  We have already discussed the `default_operator`
 above which allows you to force all terms to be required, but there are
 also _boolean operators_ which can be used in the query string itself
 to provide more control.
 The preferred operators are `+` (this term *must* be present) and `-`
 (this term *must not* be present). All other terms are optional.
 For example, this query:
    quick brown +fox -news
 states that:
 * `fox` must be present
 * `news` must not be present
 * `quick` and `brown` are optional -- their presence increases the relevance
 The familiar operators `AND`, `OR` and `NOT` (also written `&&`, `||` and `!`)
 are also supported.  However, the effects of these operators can be more
 complicated than is obvious at first glance.  `NOT` takes precedence over
 `AND`, which takes precedence over `OR`.  While the `+` and `-` only affect
 the term to the right of the operator, `AND` and `OR` can affect the terms to
 the left and right.
 ****
 Rewriting the above query using `AND`, `OR` and `NOT` demonstrates the
 complexity:
 `quick OR brown AND fox AND NOT news`::
 This is incorrect, because `brown` is now a required term.
 `(quick OR brown) AND fox AND NOT news`::
 This is incorrect because at least one of `quick` or `brown` is now required
 and the search for those terms would be scored differently from the original
 query.
 `((quick AND fox) OR (brown AND fox) OR fox) AND NOT news`::
 This form now replicates the logic from the original query correctly, but
 the relevance scoring bares little resemblance to the original.
 In contrast, the same query rewritten using the <<query-dsl-match-query,`match` query>>
 would look like this:
    {
        "bool": {
            "must":     { "match": "fox"         },
            "should":   { "match": "quick brown" },
            "must_not": { "match": "news"        }
        }
    }
 ****
 ===== Grouping
 Multiple terms or clauses can be grouped together with parentheses, to form
 sub-queries:
    (quick OR brown) AND fox
 Groups can be used to target a particular field, or to boost the result
 of a sub-query:
    status:(active OR pending) title:(full text search)^2
 ===== Reserved characters
 If you need to use any of the characters which function as operators in your
 query itself (and not as operators), then you should escape them with
 a leading backslash. For instance, to search for `(1+1)=2`, you would
 need to write your query as `\(1\+1\)=2`.
 The reserved characters are:  `+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
 Failing to escape these special characters correctly could lead to a syntax
 error which prevents your query from running.
 .Watch this space
 ****
 A space may also be a reserved character.  For instance, if you have a
 synonym list which converts `"wi fi"` to `"wifi"`, a `query_string` search
 for `"wi fi"` would fail. The query string parser would interpret your
 query as a search for `"wi OR fi"`, while the token stored in your
 index is actually `"wifi"`.  Escaping the space will protect it from
 being touched by the query string parser: `"wi\ fi"`.
 ****
--- a/docs/reference/query-dsl/queries/regexp-query.asciidoc
+++ b/docs/reference/query-dsl/queries/regexp-query.asciidoc
@ -2,6 +2,7 @@
 === Regexp Query
 The `regexp` query allows you to use regular expression term queries.
 See <<regexp-syntax>> for details of the supported regular expression language.
 *Note*: The performance of a `regexp` query heavily depends on the
 regular expression chosen. Matching everything like `.*` is very slow as
@ -49,6 +50,5 @@ Possible flags are `ALL`, `ANYSTRING`, `AUTOMATON`, `COMPLEMENT`,
 http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html[Lucene
 documentation] for their meaning
-For more information see the
+
-http://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Fsearch%2FRegexpQuery.html[Lucene
+include::regexp-syntax.asciidoc[]
 RegexpQuery documentation].
--- a/docs/reference/query-dsl/queries/regexp-syntax.asciidoc
+++ b/docs/reference/query-dsl/queries/regexp-syntax.asciidoc
@ -0,0 +1,280 @@
 [[regexp-syntax]]
 ==== Regular expression syntax
 Regular expression queries are supported by the `regexp` and the `query_string`
 queries.  The Lucene regular expression engine
 is not Perl-compatible but supports a smaller range of operators.
 [NOTE]
 ====
 We will not attempt to explain regular expressions, but
 just explain the supported operators.
 ====
 ===== Standard operators
 Anchoring::
 +
 --
 Most regular expression engines allow you to match any part of a string.
 If you want the regexp pattern to start at the beginning of the string or
 finish at the end of the string, then you have to _anchor_ it specifically,
 using `^` to indicate the beginning or `$` to indicate the end.
 Lucene's patterns are always anchored.  The pattern provided must match
 the entire string. For string `"abcde"`:
    ab.*     # match
    abcd     # no match
 --
 Allowed characters::
 +
 --
 Any Unicode characters may be used in the pattern, but certain characters
 are reserved and must be escaped.  The standard reserved characters are:
 ....
 . ? + * | { } [ ] ( ) " \
 ....
 If you enable optional features (see below) then these characters may
 also be reserved:
    # @ & < >  ~
 Any reserved character can be escaped with a backslash `"\*"` including
 a literal backslash character: `"\\"`
 Additionally, any characters (except double quotes) are interpreted literally
 when surrounded by double quotes:
    john"@smith.com"
 --
 Match any character::
 +
 --
 The period `"."` can be used to represent any character.  For string `"abcde"`:
    ab...   # match
    a.c.e   # match
 --
 One-or-more::
 +
 --
 The plus sign `"+"` can be used to repeat the preceding shortest pattern
 once or more times. For string `"aaabbb"`:
    a+b+        # match
    aa+bb+      # match
    a+.+        # match
    aa+bbb+     # no match
 --
 Zero-or-more::
 +
 --
 The asterisk `"*"` can be used to match the preceding shortest pattern
 zero-or-more times.  For string `"aaabbb`":
    a*b*        # match
    a*b*c*      # match
    .*bbb.*     # match
    aaa*bbb*    # match
 --
 Zero-or-one::
 +
 --
 The question mark `"?"` makes the preceding shortest pattern optional. It
 matches zero or one times.  For string `"aaabbb"`:
    aaa?bbb?    # match
    aaaa?bbbb?  # match
    .....?.?    # match
    aa?bb?      # no match
 --
 Min-to-max::
 +
 --
 Curly brackets `"{}"` can be used to specify a minimum and (optionally)
 a maximum number of times the preceding shortest pattern can repeat.  The
 allowed forms are:
    {5}     # repeat exactly 5 times
    {2,5}   # repeat at least twice and at most 5 times
    {2,}    # repeat at least twice
 For string `"aaabbb"`:
    a{3}b{3}        # match
    a{2,4}b{2,4}    # match
    a{2,}b{2,}      # match
    .{3}.{3}        # match
    a{4}b{4}        # no match
    a{4,6}b{4,6}    # no match
    a{4,}b{4,}      # no match
 --
 Grouping::
 +
 --
 Parentheses `"()"` can be used to form sub-patterns. The quantity operators
 listed above operate on the shortest previous pattern, which can be a group.
 For string `"ababab"`:
    (ab)+       # match
    ab(ab)+     # match
    (..)+       # match
    (...)+      # no match
    (ab)*       # match
    abab(ab)?   # match
    ab(ab)?     # no match
    (ab){3}     # match
    (ab){1,2}   # no match
 --
 Alternation::
 +
 --
 The pipe symbol `"|"` acts as an OR operator. The match will succeed if
 the pattern on either the left-hand side OR the right-hand side matches.
 The alternation applies to the _longest pattern_, not the shortest.
 For string `"aabb"`:
    aabb|bbaa   # match
    aacc|bb     # no match
    aa(cc|bb)   # match
    a+|b+       # no match
    a+b+|b+a+   # match
    a+(b|c)+    # match
 --
 Character classes::
 +
 --
 Ranges of potential characters may be represented as character classes
 by enclosing them in square brackets `"[]"`. A leading `^`
 negates the character class. The allowed forms are:
    [abc]   # 'a' or 'b' or 'c'
    [a-c]   # 'a' or 'b' or 'c'
    [-abc]  # '-' or 'a' or 'b' or 'c'
    [abc\-] # '-' or 'a' or 'b' or 'c'
    [^a-c]  # any character except 'a' or 'b' or 'c'
    [^a-c]  # any character except 'a' or 'b' or 'c'
    [-abc]  # '-' or 'a' or 'b' or 'c'
    [abc\-] # '-' or 'a' or 'b' or 'c'
 Note that the dash `"-"` indicates a range of characeters, unless it is
 the first character or if it is escaped with a backslash.
 For string `"abcd"`:
    ab[cd]+     # match
    [a-d]+      # match
    [^a-d]+     # no match
 --
 ===== Optional operators
 These operators are only available when they are explicitly enabled, by
 passing `flags` to the query.
 Multiple flags can be enabled either using the `ALL` flag, or by
 concatenating flags with a pipe `"|"`:
    {
        "regexp": {
            "username": {
                "value": "john~athon<1-5>",
                "flags": "COMPLEMENT|INTERVAL"
            }
        }
    }
 Complement::
 +
 --
 The complement is probably the most useful option. The shortest pattern that
 follows a tilde `"~"` is negated.  For the string `"abcdef"`:
    ab~df     # match
    ab~cf     # no match
    a~(cd)f   # match
    a~(bc)f   # no match
 Enabled with the `COMPLEMENT` or `ALL` flags.
 --
 Interval::
 +
 --
 The interval option enables the use of numeric ranges, enclosed by angle
 brackets `"<>"`. For string: `"foo80"`:
    foo<1-100>     # match
    foo<01-100>    # match
    foo<001-100>   # no match
 Enabled with the `INTERVAL` or `ALL` flags.
 --
 Intersection::
 +
 --
 The ampersand `"&"` joins two patterns in a way that both of them have to
 match. For string `"aaabbb"`:
    aaa.+&.+bbb     # match
    aaa&bbb         # no match
 Using this feature usually means that you should rewrite your regular
 expression.
 Enabled with the `INTERSECTION` or `ALL` flags.
 --
 Any string::
 +
 --
 The at sign `"@"` matches any string in its entirety.  This could be combined
 with the intersection and complement above to express ``everything except''.
 For instance:
    @&~(foo.+)      # anything except string beginning with "foo"
 Enabled with the `ANYSTRING` or `ALL` flags.
 --
--- a/docs/reference/search/uri-request.asciidoc
+++ b/docs/reference/search/uri-request.asciidoc
@ -27,7 +27,7 @@ And here is a sample response:
            {
                "_index" : "twitter",
                "_type" : "tweet",
-                "_id" : "1", 
+                "_id" : "1",
                "_source" : {
                    "user" : "kimchy",
                    "postDate" : "2009-11-15T14:12:12",
`@ -87,4 +87,3 @@ include::queries/wildcard-query.asciidoc[]`
	`include::queries/minimum-should-match.asciidoc[]`	`include::queries/minimum-should-match.asciidoc[]`

	`include::queries/multi-term-rewrite.asciidoc[]`	`include::queries/multi-term-rewrite.asciidoc[]`