From c092af6cfa3041801fdd1737ba443b93721c53c5 Mon Sep 17 00:00:00 2001 From: Peter Carlson Date: Wed, 15 May 2002 13:51:50 +0000 Subject: [PATCH] Adding Query Parser Syntax to give people a better idea of what Lucene can do and does "out of the box". git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149753 13f79535-47bb-0310-9956-ffa450edef68 --- docs/queryparsersyntax.html | 680 ++++++++++++++++++++++++++++++++++++ xdocs/queryparsersyntax.xml | 133 +++++++ 2 files changed, 813 insertions(+) create mode 100644 docs/queryparsersyntax.html create mode 100644 xdocs/queryparsersyntax.xml diff --git a/docs/queryparsersyntax.html b/docs/queryparsersyntax.html new file mode 100644 index 00000000000..92e89541340 --- /dev/null +++ b/docs/queryparsersyntax.html @@ -0,0 +1,680 @@ + + + + + + + + + + + + + + + + + + + Jakarta Lucene - + Query Parser Syntax - Jakarta Lucene + + + + + + + + + + +
+ + +Jakarta Lucene +
+ + + + + + + + + + + + +
+
+
+

About

+ +

Resources

+ +

Plans

+ +

Download

+ +

Jakarta

+ +
+ + + + +
+ + Overview + +
+
+

Although Lucene provides the ability to create your own query's though its API, it also provides a rich query language through the QueryParser.

+

This page provides syntax of Lucene's Query Parser, a lexer which interprets a string into a Lucene Query using JavaCC.

+
+

+

+ + + + +
+ + Terms + +
+
+

A query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases.

+

A Single Term is a single word such as "test" or "hello".

+

A Phrase is a group of words surrounded by double quotes such as "hello dolly".

+

Multiple terms can be combined together with Boolean operators to form a more complex query (see below).

+
+

+

+ + + + +
+ + Fields + +
+
+

Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific.

+

You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

+

As an example, let's assume a Lucene index contains two fields, title and text and text is the default field. + If you want to find the document entitled "The Right Way" which contains the text "don't go this way", you can enter:

+
+ + + + + + + + + + + + + + + + +
title:"The Right Way" AND text:go
+
+

or

+
+ + + + + + + + + + + + + + + + +
title:"Do it right" AND right
+
+

Since text is the default field, the field indicator is not required.

+

Note: The field is only valid for the term that it directly precedes, so the query

+
+ + + + + + + + + + + + + + + + +
title:Do it right
+
+

Will only find "Do" in the title field. It will find "it" and "right" in the default field (in this case the text field).

+
+

+

+ + + + +
+ + Term Modifiers + +
+
+

Lucene supports modifying query terms to provide a wide range of searching options.

+ + + + +
+ + Wildcard Searches + +
+
+

Lucene supports single and multiple character wildcard searches.

+

To perform a single character wildcard search use the "?" symbol.

+

To perform a multiple character wildcard search use the "*" symbol.

+

The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:

+
+ + + + + + + + + + + + + + + + +
te?t
+
+

Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:

+
+ + + + + + + + + + + + + + + + +
test*
+
+

You can also use the wildcard searches in the middle of a term.

+
+ + + + + + + + + + + + + + + + +
te*t
+
+

Note: You cannot use a * or ? symbol as the first character of a search.

+
+

+ + + + +
+ + Fuzzy Searches + +
+
+

Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a term. For example to search for a term similar in spelling to "roam" use the fuzzy search:

+
+ + + + + + + + + + + + + + + + +
roam~
+
+

This search will find terms like foam and roams

+

Note:Terms found by the fuzzy search will automatically get a boost factor of 0.2

+
+

+ + + + +
+ + Boosting a Term + +
+
+

Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

+

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for

+
+ + + + + + + + + + + + + + + + +
IBM Microsoft
+
+

and you want the term "IBM" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:

+
+ + + + + + + + + + + + + + + + +
IBM^4 Microsoft
+
+

This will make documents with the term IBM appear more relevant. You can also boost Phrase Terms as in the example:

+
+ + + + + + + + + + + + + + + + +
"Microsoft Word"^4 "Microsoft Excel"
+
+

By default, the boost factor is 1.

+
+

+
+

+

+ + + + +
+ + Boolean operators + +
+
+

Boolean operators allow terms to be combined through logic operators. + Lucene supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators must be ALL CAPS).

+ + + + +
+ + OR + +
+
+

The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. + The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. + For example to search for documents that contain either "Microsoft Word" or just "Microsoft":

+
+ + + + + + + + + + + + + + + + +
"Microsoft Word" Microsoft
+
+

or

+
+ + + + + + + + + + + + + + + + +
"Microsoft Word" OR Microsoft
+
+
+

+ + + + +
+ + AND + +
+
+

The AND operator matches documents where both terms exist anywhere in the text of a single document. + This is equivalent to an intersection using sets. + For example to search for documents that contain "Microsoft Word" and "Microsoft Excel":

+
+ + + + + + + + + + + + + + + + +
"Microsoft Word" AND "Microsoft Excel"
+
+
+

+ + + + +
+ + + + +
+
+

The "+" or required operator requires that the term after the "+" symbol exist somewhere in a the field of a single document. For example, to search for documents that contain jakarta or lucene:

+
+ + + + + + + + + + + + + + + + +
+jakarta apache
+
+
+

+ + + + +
+ + NOT + +
+
+

The NOT operator excludes documents that contain the term after NOT. + This is equivalent to a difference using sets. + For example to search for documents that contain "Microsoft Word" but not "Microsoft Excel":

+
+ + + + + + + + + + + + + + + + +
"Microsoft Word" NOT "Microsoft Excel"
+
+
+

+ + + + +
+ + - + +
+
+

The "-" or prohibit operator excludes documents that contain the term after the "-" symbol. For example to search for documents that contain "Microsoft Word" but not "Microsoft Excel":

+
+ + + + + + + + + + + + + + + + +
"Microsoft Word" -"Microsoft Excel"
+
+
+

+
+

+

+ + + + +
+ + Grouping + +
+
+

Lucene supports using parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query. + For example, to search for either "jakarta" or "apache" and "website":

+
+ + + + + + + + + + + + + + + + +
(jakarta OR apache) AND website
+
+

This eliminates any confusion and makes sure you that website must exist and either term jakarta or apache may exist.

+
+

+

+
+
+
+
+ Copyright © 1999-2002, Apache Software Foundation +
+
+ + + + + + + + + + + + + + + + + + + + + + + diff --git a/xdocs/queryparsersyntax.xml b/xdocs/queryparsersyntax.xml new file mode 100644 index 00000000000..208e39ef5c0 --- /dev/null +++ b/xdocs/queryparsersyntax.xml @@ -0,0 +1,133 @@ + + + + Peter Carlson + + Query Parser Syntax - Jakarta Lucene + + + +
+

Although Lucene provides the ability to create your own query's though its API, it also provides a rich query language through the QueryParser.

+

This page provides syntax of Lucene's Query Parser, a lexer which interprets a string into a Lucene Query using JavaCC.

+
+
+

A query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases.

+

A Single Term is a single word such as "test" or "hello".

+

A Phrase is a group of words surrounded by double quotes such as "hello dolly".

+

Multiple terms can be combined together with Boolean operators to form a more complex query (see below).

+
+ +
+

Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific.

+

You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

+

As an example, let's assume a Lucene index contains two fields, title and text and text is the default field. + If you want to find the document entitled "The Right Way" which contains the text "don't go this way", you can enter:

+ + title:"The Right Way" AND text:go +

or

+ title:"Do it right" AND right +

Since text is the default field, the field indicator is not required.

+ +

Note: The field is only valid for the term that it directly precedes, so the query

+ title:Do it right +

Will only find "Do" in the title field. It will find "it" and "right" in the default field (in this case the text field).

+
+ +
+

Lucene supports modifying query terms to provide a wide range of searching options.

+ + +

Lucene supports single and multiple character wildcard searches.

+

To perform a single character wildcard search use the "?" symbol.

+

To perform a multiple character wildcard search use the "*" symbol.

+

The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:

+ + te?t + +

Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:

+ test* +

You can also use the wildcard searches in the middle of a term.

+ te*t +

Note: You cannot use a * or ? symbol as the first character of a search.

+
+ + + +

Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a term. For example to search for a term similar in spelling to "roam" use the fuzzy search:

+ + roam~ +

This search will find terms like foam and roams

+

Note:Terms found by the fuzzy search will automatically get a boost factor of 0.2

+
+ + + +

Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

+

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for

+ + IBM Microsoft +

and you want the term "IBM" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:

+ IBM^4 Microsoft +

This will make documents with the term IBM appear more relevant. You can also boost Phrase Terms as in the example:

+ + "Microsoft Word"^4 "Microsoft Excel" +

By default, the boost factor is 1.

+
+
+ +
+

Boolean operators allow terms to be combined through logic operators. + Lucene supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators must be ALL CAPS).

+ + +

The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. + The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. + For example to search for documents that contain either "Microsoft Word" or just "Microsoft":

+ + "Microsoft Word" Microsoft + +

or

+ + "Microsoft Word" OR Microsoft + +
+ +

The AND operator matches documents where both terms exist anywhere in the text of a single document. + This is equivalent to an intersection using sets. + For example to search for documents that contain "Microsoft Word" and "Microsoft Excel":

+ + "Microsoft Word" AND "Microsoft Excel" +
+ + +

The "+" or required operator requires that the term after the "+" symbol exist somewhere in a the field of a single document. For example, to search for documents that contain jakarta or lucene:

+ + +jakarta apache +
+ + +

The NOT operator excludes documents that contain the term after NOT. + This is equivalent to a difference using sets. + For example to search for documents that contain "Microsoft Word" but not "Microsoft Excel":

+ + "Microsoft Word" NOT "Microsoft Excel" +
+ + +

The "-" or prohibit operator excludes documents that contain the term after the "-" symbol. For example to search for documents that contain "Microsoft Word" but not "Microsoft Excel":

+ + "Microsoft Word" -"Microsoft Excel" +
+ +
+ +
+

Lucene supports using parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query. + For example, to search for either "jakarta" or "apache" and "website":

+ (jakarta OR apache) AND website +

This eliminates any confusion and makes sure you that website must exist and either term jakarta or apache may exist.

+
+ + +