[[query-string-syntax]] ==== Query string syntax The query string ``mini-language'' is used by the <> and <>, by the `q` query string parameter in the <>. The query string is parsed into a series of _terms_ and _operators_. A term can be a single word -- `quick` or `brown` -- or a phrase, surrounded by double quotes -- `"quick brown"` -- which searches for all the words in the phrase, in the same order. Operators allow you to customize the search -- the available options are explained below. ===== Field names As mentioned in <>, the `default_field` is searched for the search terms, but it is possible to specify other fields in the query syntax: * where the `status` field contains `active` status:active * where the `title` field contains `quick` or `brown` title:(quick brown) * where the `author` field contains the exact phrase `"john smith"` author:"John Smith" * where any of the fields `book.title`, `book.content` or `book.date` contains `quick` or `brown` (note how we need to escape the `*` with a backslash): book.\*:(quick brown) * where the field `title` has no value (or is missing): _missing_:title * where the field `title` has any non-null value: _exists_:title ===== Wildcards Wildcard searches can be run on individual terms, using `?` to replace a single character, and `*` to replace zero or more characters: qu?ck bro* Be aware that wildcard queries can use an enormous amount of memory and perform very badly -- just think how many terms need to be queried to match the query string `"a* b* c*"`. [WARNING] ====== Allowing a wildcard at the beginning of a word (eg `"*ing"`) is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting `allow_leading_wildcard` to `false`. ====== Wildcarded terms are not analyzed by default -- they are lowercased (`lowercase_expanded_terms` defaults to `true`) but no further analysis is done, mainly because it is impossible to accurately analyze a word that is missing some of its letters. However, by setting `analyze_wildcard` to `true`, an attempt will be made to analyze wildcarded words before searching the term list for matching terms. ===== Regular expressions Regular expression patterns can be embedded in the query string by wrapping them in forward-slashes (`"/"`): name:/joh?n(ath[oa]n)/ The supported regular expression syntax is explained in <>. [WARNING] ====== The `allow_leading_wildcard` parameter does not have any control over regular expressions. A query string such as the following would force Elasticsearch to visit every term in the index: /.*n/ Use with caution! ====== ===== Fuzziness We can search for terms that are similar to, but not exactly like our search terms, using the ``fuzzy'' operator: quikc~ brwn~ foks~ This uses the http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance[Damerau-Levenshtein distance] to find all terms with a maximum of two changes, where a change is the insertion, deletion or substitution of a single character, or transposition of two adjacent characters. The default _edit distance_ is `2`, but an edit distance of `1` should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1 ===== Proximity searches While a phrase query (eg `"john smith"`) expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. In the same way that fuzzy queries can specify a maximum edit distance for characters in a word, a proximity search allows us to specify a maximum edit distance of words in a phrase: "fox quick"~5 The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase `"quick fox"` would be considered more relevant than `"quick brown fox"`. ===== Ranges Ranges can be specified for date, numeric or string fields. Inclusive ranges are specified with square brackets `[min TO max]` and exclusive ranges with curly brackets `{min TO max}`. * All days in 2012: date:[2012/01/01 TO 2012/12/31] * Numbers 1..5 count:[1 TO 5] * Tags between `alpha` and `omega`, excluding `alpha` and `omega`: tag:{alpha TO omega} * Numbers from 10 upwards count:[10 TO *] * Dates before 2012 date:{* TO 2012/01/01} The parsing of ranges in query strings can be complex and error prone. It is much more reliable to use an explicit <>. ===== Boosting Use the _boost_ operator `^` to make one term more relevant than another. For instance, if we want to find all documents about foxes, but we are especially interested in quick foxes: quick^2 fox The default `boost` value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can also be applied to phrases or to groups: "john smith"^2 (foo bar)^4 ===== Boolean operators By default, all terms are optional, as long as one term matches. A search for `foo bar baz` will find any document that contains one or more of `foo` or `bar` or `baz`. We have already discussed the `default_operator` above which allows you to force all terms to be required, but there are also _boolean operators_ which can be used in the query string itself to provide more control. The preferred operators are `+` (this term *must* be present) and `-` (this term *must not* be present). All other terms are optional. For example, this query: quick brown +fox -news states that: * `fox` must be present * `news` must not be present * `quick` and `brown` are optional -- their presence increases the relevance The familiar operators `AND`, `OR` and `NOT` (also written `&&`, `||` and `!`) are also supported. However, the effects of these operators can be more complicated than is obvious at first glance. `NOT` takes precedence over `AND`, which takes precedence over `OR`. While the `+` and `-` only affect the term to the right of the operator, `AND` and `OR` can affect the terms to the left and right. **** Rewriting the above query using `AND`, `OR` and `NOT` demonstrates the complexity: `quick OR brown AND fox AND NOT news`:: This is incorrect, because `brown` is now a required term. `(quick OR brown) AND fox AND NOT news`:: This is incorrect because at least one of `quick` or `brown` is now required and the search for those terms would be scored differently from the original query. `((quick AND fox) OR (brown AND fox) OR fox) AND NOT news`:: This form now replicates the logic from the original query correctly, but the relevance scoring bares little resemblance to the original. In contrast, the same query rewritten using the <> would look like this: { "bool": { "must": { "match": "fox" }, "should": { "match": "quick brown" }, "must_not": { "match": "news" } } } **** ===== Grouping Multiple terms or clauses can be grouped together with parentheses, to form sub-queries: (quick OR brown) AND fox Groups can be used to target a particular field, or to boost the result of a sub-query: status:(active OR pending) title:(full text search)^2 ===== Reserved characters If you need to use any of the characters which function as operators in your query itself (and not as operators), then you should escape them with a leading backslash. For instance, to search for `(1+1)=2`, you would need to write your query as `\(1\+1\)=2`. The reserved characters are: `+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /` Failing to escape these special characters correctly could lead to a syntax error which prevents your query from running. .Watch this space **** A space may also be a reserved character. For instance, if you have a synonym list which converts `"wi fi"` to `"wifi"`, a `query_string` search for `"wi fi"` would fail. The query string parser would interpret your query as a search for `"wi OR fi"`, while the token stored in your index is actually `"wifi"`. Escaping the space will protect it from being touched by the query string parser: `"wi\ fi"`. ****