LuceneCoreQuery.dtd: Elements - Entities - Source | Intro - Index
FRAMES / NO FRAMES

Core Lucene

Background

This DTD describes the XML syntax used to perform advanced searches using the core Lucene search engine. The motivation behind the XML query syntax is:
  1. To open up Lucene functionality to clients other than Java
  2. To offer a form of expressing queries that can easily be
  3. To provide a shorthand way of expressing query logic which echos the logical tree structure of query objects more closely than reading procedural Java query construction code
  4. To bridge the growing gap between Lucene query/filtering functionality and the set of functionality accessible throught the standard Lucene QueryParser syntax
  5. To provide a simply extensible syntax that does not require complex parser skills such as knowledge of JavaCC syntax

Syntax overview

Search syntax consists of two types of elements:

Queries

The root of any XML search must be a Query type element used to select content. Queries typically score matches on documents using a number of different factors in order to provide relevant results first. One common example of a query tag is the UserQuery element which uses the standard Lucene QueryParser to parse Google-style search syntax provided by end users.

Filters

Unlike Queries, Filters are not used to select or score content - they are simply used to filter Query output (see FilteredQuery for an example use of query filtering). Because Filters simply offer a yes/no decision for each document in the index their output can be efficiently cached in memory as a Bitset for subsequent reuse (see CachedFilter tag).

Nesting elements

Many of the the elements can nest other elements to produce queries/filters of an arbitrary depth and complexity. The BooleanQuery element is one such example which provides a means for combining other queries (including other BooleanQueries) using Boolean logic to determine mandatory or optional elements.

Advanced topics

Advanced positional testing - span queries

The SpanQuery class of queries allow for complex positional tests which not only look for certain combinations of words but in particular positions in relation to each other and the documents containing them.

CoreParser.java is the Java class that encapsulates this parser behaviour.



<BooleanQuery> Child of Query, Clause, CachedFilter

BooleanQuerys implement Boolean logic which controls how multiple Clauses should be interpreted. Some clauses may represent optional Query criteria while others represent mandatory criteria.

Example: Find articles about banks, preferably talking about mergers but nothing to do with "sumitomo"

	          
            <BooleanQuery fieldName="contents">
	             <Clause occurs="should">
		              <TermQuery>merger</TermQuery>
	             </Clause>
	             <Clause occurs="mustnot">
		              <TermQuery>sumitomo</TermQuery>
	             </Clause>
	             <Clause occurs="must">
		              <TermQuery>bank</TermQuery>
	             </Clause>
            </BooleanQuery>

	         

<BooleanQuery>'s children
NameCardinality
ClauseAt least one
<BooleanQuery>'s attributes
NameValuesDefault
boost1.0
disableCoordtrue, falsefalse
fieldName
minimumNumberShouldMatch0
Element's model:

(Clause)+


@boost Attribute of BooleanQuery

Optional boost for matches on this query. Values > 1

Default value: 1.0


@fieldName Attribute of BooleanQuery

fieldName can optionally be defined here as a default attribute used by all child elements


@disableCoord Attribute of BooleanQuery

The "Coordination factor" rewards documents that contain more of the optional clauses in this list. This flag can be used to turn off this factor.

Possible values: true, false - Default value: false


@minimumNumberShouldMatch Attribute of BooleanQuery

The minimum number of optional clauses that should be present in any one document before it is considered to be a match.

Default value: 0


<Clause> Child of BooleanQuery

NOTE: "Clause" tag has 2 modes of use - inside <BooleanQuery> in which case only "query" types can be child elements - while in a <BooleanFilter> clause only "filter" types can be contained.

<Clause>'s children
NameCardinality
BooleanQueryOne or none
CachedFilterOne or none
ConstantScoreQueryOne or none
FilteredQueryOne or none
MatchAllDocsQueryOne or none
RangeFilterOne or none
SpanFirstOne or none
SpanNearOne or none
SpanNotOne or none
SpanOrOne or none
SpanOrTermsOne or none
SpanTermOne or none
TermQueryOne or none
TermsQueryOne or none
UserQueryOne or none
<Clause>'s attributes
NameValuesDefault
occursshould, must, mustnotshould
Element's model:

(BooleanQuery | UserQuery | FilteredQuery | TermQuery | TermsQuery | MatchAllDocsQuery | ConstantScoreQuery | SpanOr | SpanNear | SpanOrTerms | SpanFirst | SpanNot | SpanTerm | RangeFilter | CachedFilter)


@occurs Attribute of Clause

Controls if the clause is optional (should), mandatory (must) or unacceptable (mustNot)

Possible values: should, must, mustnot - Default value: should


<CachedFilter> Child of ConstantScoreQuery, Clause, Filter

Caches any nested query or filter in an LRU (Least recently used) Cache. Cached queries, like filters, are turned into Bitsets at a cost of 1 bit per document in the index. The memory cost of a cached query/filter is therefore numberOfDocsinIndex/8 bytes. Queries that are cached as filters obviously retain none of the scoring information associated with results - they retain just a Boolean yes/no record of which documents matched.

Example: Search for documents about banks from the last 10 years - caching the commonly-used "last 10 year" filter as a BitSet in RAM to eliminate the cost of building this filter from disk for every query

	          
            <FilteredQuery>
               <Query>
                  <UserQuery>bank</UserQuery>
               </Query>	
               <Filter>
                  <CachedFilter>
                     <RangeFilter fieldName="date" lowerTerm="19970101" upperTerm="20070101"/>
                  </CachedFilter>
               </Filter>	
            </FilteredQuery>
	         

<CachedFilter>'s children
NameCardinality
BooleanQueryOne or none
CachedFilterOne or none
ConstantScoreQueryOne or none
FilteredQueryOne or none
MatchAllDocsQueryOne or none
RangeFilterOne or none
SpanFirstOne or none
SpanNearOne or none
SpanNotOne or none
SpanOrOne or none
SpanOrTermsOne or none
SpanTermOne or none
TermQueryOne or none
TermsQueryOne or none
UserQueryOne or none
Element's model:

(BooleanQuery | UserQuery | FilteredQuery | TermQuery | TermsQuery | MatchAllDocsQuery | ConstantScoreQuery | SpanOr | SpanNear | SpanOrTerms | SpanFirst | SpanNot | SpanTerm | RangeFilter | CachedFilter)


<UserQuery> Child of Query, Clause, CachedFilter

Passes content directly through to the standard LuceneQuery parser see "Lucene Query Syntax"

Example: Search for documents about John Smith or John Doe using standard LuceneQuerySyntax

	          
               <UserQuery>"John Smith" OR "John Doe"</UserQuery>
	         

<UserQuery>'s attributes
NameValuesDefault
boost1.0

@boost Attribute of UserQuery

Optional boost for matches on this query. Values > 1

Default value: 1.0


<MatchAllDocsQuery/> Child of Query, Clause, CachedFilter

A query which is used to match all documents. This has a couple of uses:

  1. as a Clause in a BooleanQuery who's only other clause is a "mustNot" match (Lucene requires at least one positive clause) and..
  2. in a FilteredQuery where a Filter tag is effectively being used to select content rather than it's usual role of filtering the results of a query.

Example: Effectively use a Filter as a query

	          
               <FilteredQuery>
                 <Query>
                    <MatchAllDocsQuery/>
                 </Query>
                 <Filter>
                     <RangeFilter fieldName="date" lowerTerm="19870409" upperTerm="19870412"/>
                 </Filter>	
               </FilteredQuery>	         
	       

This element is always empty.


<TermQuery> Child of Query, Clause, CachedFilter

a single term query - no analysis is done of the child text

Example: Match on a primary key

	          
               <TermQuery fieldName="primaryKey">13424</TermQuery>
	       

<TermQuery>'s attributes
NameValuesDefault
boost1.0
fieldName

@boost Attribute of TermQuery

Optional boost for matches on this query. Values > 1

Default value: 1.0


@fieldName Attribute of TermQuery

fieldName must be defined here or is taken from the most immediate parent XML element that defines a "fieldName" attribute


<TermsQuery> Child of Query, Clause, CachedFilter

The equivalent of a BooleanQuery with multiple optional TermQuery clauses. Child text is analyzed using a field-specific choice of Analyzer to produce a set of terms that are ORed together in Boolean logic. Unlike UserQuery element, this does not parse any special characters to control fuzzy/phrase/boolean logic and as such is incapable of producing a Query parse error given any user input

Example: Match on text from a database description (which may contain characters that are illegal characters in the standard Lucene Query syntax used in the UserQuery tag

	          
               <TermsQuery fieldName="description">Smith & Sons (Ltd) : incorporated 1982</TermsQuery>
	       

<TermsQuery>'s attributes
NameValuesDefault
boost1.0
disableCoordtrue, falsefalse
fieldName
minimumNumberShouldMatch0

@boost Attribute of TermsQuery

Optional boost for matches on this query. Values > 1

Default value: 1.0


@fieldName Attribute of TermsQuery

fieldName must be defined here or is taken from the most immediate parent XML element that defines a "fieldName" attribute


@disableCoord Attribute of TermsQuery

The "Coordination factor" rewards documents that contain more of the terms in this list. This flag can be used to turn off this factor.

Possible values: true, false - Default value: false


@minimumNumberShouldMatch Attribute of TermsQuery

The minimum number of terms that should be present in any one document before it is considered to be a match.

Default value: 0


<FilteredQuery> Child of Query, Clause, CachedFilter

Runs a Query and filters results to only those query matches that also match the Filter element.

Example: Find all documents about Lucene that have a status of "published"

	          
               <FilteredQuery>
                 <Query>
                    <UserQuery>Lucene</UserQuery>
                 </Query>
                 <Filter>
                     <TermsFilter fieldName="status">published</TermsFilter>
                 </Filter>	
               </FilteredQuery>	         
	       

<FilteredQuery>'s children
NameCardinality
FilterOnly one
QueryOnly one
<FilteredQuery>'s attributes
NameValuesDefault
boost1.0
Element's model:

(Query, Filter)


@boost Attribute of FilteredQuery

Optional boost for matches on this query. Values > 1

Default value: 1.0


<Query> Child of FilteredQuery

Used to identify a nested Query element inside another container element. NOT a top-level query tag

<Query>'s children
NameCardinality
BooleanQueryOne or none
ConstantScoreQueryOne or none
FilteredQueryOne or none
MatchAllDocsQueryOne or none
SpanFirstOne or none
SpanNearOne or none
SpanNotOne or none
SpanOrOne or none
SpanOrTermsOne or none
SpanTermOne or none
TermQueryOne or none
TermsQueryOne or none
UserQueryOne or none
Element's model:

(BooleanQuery | UserQuery | FilteredQuery | TermQuery | TermsQuery | MatchAllDocsQuery | ConstantScoreQuery | SpanOr | SpanNear | SpanOrTerms | SpanFirst | SpanNot | SpanTerm)


<Filter> Child of FilteredQuery

The choice of Filter that MUST also be matched

<Filter>'s children
NameCardinality
CachedFilterOne or none
RangeFilterOne or none
Element's model:

(RangeFilter | CachedFilter)


<RangeFilter/> Child of ConstantScoreQuery, Clause, CachedFilter, Filter

Filter used to limit query results to documents matching a range of field values

Example: Search for documents about banks from the last 10 years

	          
            <FilteredQuery>
               <Query>
                  <UserQuery>bank</UserQuery>
               </Query>	
               <Filter>
                     <RangeFilter fieldName="date" lowerTerm="19970101" upperTerm="20070101"/>
               </Filter>	
            </FilteredQuery>
	         

<RangeFilter>'s attributes
NameValuesDefault
fieldName
includeLowertrue, falsetrue
includeUppertrue, falsetrue
lowerTerm
upperTerm

This element is always empty.


@fieldName Attribute of RangeFilter

fieldName must be defined here or is taken from the most immediate parent XML element that defines a "fieldName" attribute


@lowerTerm Attribute of RangeFilter

The lower-most term value for this field (must be <= upperTerm)

Required


@upperTerm Attribute of RangeFilter

The upper-most term value for this field (must be >= lowerTerm)

Required


@includeLower Attribute of RangeFilter

Controls if the lowerTerm in the range is part of the allowed set of values

Possible values: true, false - Default value: true


@includeUpper Attribute of RangeFilter

Controls if the upperTerm in the range is part of the allowed set of values

Possible values: true, false - Default value: true


<SpanTerm> Child of SpanNear, Include, Query, Clause, SpanOr, SpanFirst, Exclude, CachedFilter

A single term used in a SpanQuery. These clauses are the building blocks for more complex "span" queries which test word proximity

Example: Find documents using terms close to each other about mining and accidents

	      <SpanNear slop="8" inOrder="false" fieldName="text">		
			<SpanOr>
				<SpanTerm>killed</SpanTerm>
				<SpanTerm>died</SpanTerm>
				<SpanTerm>dead</SpanTerm>
			</SpanOr>
			<SpanOr>
				<SpanTerm>miner</SpanTerm>
				<SpanTerm>mining</SpanTerm>
				<SpanTerm>miners</SpanTerm>
			</SpanOr>
	      </SpanNear>
	      

<SpanTerm>'s attributes
NameValuesDefault
fieldName

@fieldName Attribute of SpanTerm

fieldName must be defined here or is taken from the most immediate parent XML element that defines a "fieldName" attribute

Required


<SpanOrTerms> Child of SpanNear, Include, Query, Clause, SpanOr, SpanFirst, Exclude, CachedFilter

A field-specific analyzer is used here to parse the child text provided in this tag. The SpanTerms produced are ORed in terms of Boolean logic

Example: Use SpanOrTerms as a more convenient/succinct way of expressing multiple choices of SpanTerms. This example looks for reports using words describing a fatality near to references to miners

	      <SpanNear slop="8" inOrder="false" fieldName="text">		
			<SpanOrTerms>killed died death dead deaths</SpanOrTerms>
			<SpanOrTerms>miner mining miners</SpanOrTerms>
	      </SpanNear>
	      

<SpanOrTerms>'s attributes
NameValuesDefault
fieldName

@fieldName Attribute of SpanOrTerms

fieldName must be defined here or is taken from the most immediate parent XML element that defines a "fieldName" attribute

Required


<SpanOr> Child of SpanNear, Include, Query, Clause, SpanFirst, Exclude, CachedFilter

Takes any number of child queries from the Span family

Example: Find documents using terms close to each other about mining and accidents

	      <SpanNear slop="8" inOrder="false" fieldName="text">		
			<SpanOr>
				<SpanTerm>killed</SpanTerm>
				<SpanTerm>died</SpanTerm>
				<SpanTerm>dead</SpanTerm>
			</SpanOr>
			<SpanOr>
				<SpanTerm>miner</SpanTerm>
				<SpanTerm>mining</SpanTerm>
				<SpanTerm>miners</SpanTerm>
			</SpanOr>
	      </SpanNear>
	      

<SpanOr>'s children
NameCardinality
SpanFirstAny number
SpanNearAny number
SpanNotAny number
SpanOrAny number
SpanOrTermsAny number
SpanTermAny number
Element's model:

(SpanOr | SpanNear | SpanOrTerms | SpanFirst | SpanNot | SpanTerm)*


<SpanNear> Child of Include, Query, Clause, SpanOr, SpanFirst, Exclude, CachedFilter

Takes any number of child queries from the Span family and tests for proximity

<SpanNear>'s children
NameCardinality
SpanFirstAny number
SpanNearAny number
SpanNotAny number
SpanOrAny number
SpanOrTermsAny number
SpanTermAny number
<SpanNear>'s attributes
NameValuesDefault
inOrdertrue, falsetrue
slop
Element's model:

(SpanOr | SpanNear | SpanOrTerms | SpanFirst | SpanNot | SpanTerm)*


@slop Attribute of SpanNear

defines the maximum distance between Span elements where distance is expressed as word number, not byte offset

Example: Find documents using terms within 8 words of each other talking about mining and accidents

	      <SpanNear slop="8" inOrder="false" fieldName="text">		
			<SpanOr>
				<SpanTerm>killed</SpanTerm>
				<SpanTerm>died</SpanTerm>
				<SpanTerm>dead</SpanTerm>
			</SpanOr>
			<SpanOr>
				<SpanTerm>miner</SpanTerm>
				<SpanTerm>mining</SpanTerm>
				<SpanTerm>miners</SpanTerm>
			</SpanOr>
	      </SpanNear>
	      

Required


@inOrder Attribute of SpanNear

Controls if matching terms have to appear in the order listed or can be reversed

Possible values: true, false - Default value: true


<SpanFirst> Child of SpanNear, Include, Query, Clause, SpanOr, Exclude, CachedFilter

Looks for a SpanQuery match occuring near the beginning of a document

Example: Find letters where the first 50 words talk about a resignation:

	          
	         <SpanFirst end="50">
	               <SpanOrTerms fieldName="text">resigning resign leave</SpanOrTerms>
	         </SpanFirst>
	         

<SpanFirst>'s children
NameCardinality
SpanFirstOne or none
SpanNearOne or none
SpanNotOne or none
SpanOrOne or none
SpanOrTermsOne or none
SpanTermOne or none
<SpanFirst>'s attributes
NameValuesDefault
boost1.0
end
Element's model:

(SpanOr | SpanNear | SpanOrTerms | SpanFirst | SpanNot | SpanTerm)


@end Attribute of SpanFirst

Controls the end of the region considered in a document's field (expressed in word number, not byte offset)

Required


@boost Attribute of SpanFirst

Optional boost for matches on this query. Values > 1

Default value: 1.0


<SpanNot> Child of SpanNear, Include, Query, Clause, SpanOr, SpanFirst, Exclude, CachedFilter

Finds documents matching a SpanQuery but not if matching another SpanQuery

Example: Find documents talking about social services but not containing the word "public"

          <SpanNot fieldName="text">
             <Include>
                <SpanNear slop="2" inOrder="true">		
                     <SpanTerm>social</SpanTerm>
                     <SpanTerm>services</SpanTerm>
                </SpanNear>				
             </Include>
             <Exclude>
                <SpanTerm>public</SpanTerm>
             </Exclude>
          </SpanNot>
	      

<SpanNot>'s children
NameCardinality
ExcludeOnly one
IncludeOnly one
Element's model:

(Include, Exclude)


<Include> Child of SpanNot

The SpanQuery to find

<Include>'s children
NameCardinality
SpanFirstOne or none
SpanNearOne or none
SpanNotOne or none
SpanOrOne or none
SpanOrTermsOne or none
SpanTermOne or none
Element's model:

(SpanOr | SpanNear | SpanOrTerms | SpanFirst | SpanNot | SpanTerm)


<Exclude> Child of SpanNot

The SpanQuery to be avoided

<Exclude>'s children
NameCardinality
SpanFirstOne or none
SpanNearOne or none
SpanNotOne or none
SpanOrOne or none
SpanOrTermsOne or none
SpanTermOne or none
Element's model:

(SpanOr | SpanNear | SpanOrTerms | SpanFirst | SpanNot | SpanTerm)


<ConstantScoreQuery> Child of Query, Clause, CachedFilter

a utility tag to wrap any filter as a query

Example: Find all documents from the last 10 years

     <ConstantScoreQuery>
           <RangeFilter fieldName="date" lowerTerm="19970101" upperTerm="20070101"/>
     </ConstantScoreQuery>	
	

<ConstantScoreQuery>'s children
NameCardinality
CachedFilterAny number
RangeFilterAny number
<ConstantScoreQuery>'s attributes
NameValuesDefault
boost1.0
Element's model:

(RangeFilter | CachedFilter)*


@boost Attribute of ConstantScoreQuery

Optional boost for matches on this query. Values > 1

Default value: 1.0