2009-06-22 18:18:56 -04:00
<!--
T h i s D T D b u i l d s o n t h e < a h r e f = "LuceneCoreQuery.dtd.html" > c o r e L u c e n e X M L s y n t a x < / a > a n d a d d s s u p p o r t f o r f e a t u r e s f o u n d i n t h e "contrib" s e c t i o n o f t h e L u c e n e p r o j e c t .
C o r e P l u s E x t e n s i o n s P a r s e r . j a v a i s t h e J a v a c l a s s t h a t e n c a p s u l a t e s t h i s p a r s e r b e h a v i o u r .
T h e f e a t u r e s a d d e d a r e :
< u l >
< l i > < a h r e f = "#LikeThisQuery" > L i k e T h i s Q u e r y < / a > < / l i >
S u p p o r t f o r q u e r y i n g u s i n g l a r g e a m o u n t s o f e x a m p l e t e x t i n d i c a t i v e o f t h e u s e r s ' g e n e r a l a r e a o f i n t e r e s t
< l i > < a h r e f = "#FuzzyLikeThisQuery" > F u z z y L i k e T h i s Q u e r y < / a > < / l i >
A s t y l e o f f u z z y q u e r y w h i c h a u t o m a t i c a l l y l o o k s f o r f u z z y v a r i a t i o n s o n o n l y t h e "interesting" t e r m s
< l i > < a h r e f = "#BooleanFilter" > B o o l e a n F i l t e r < / a > < / l i >
I s t o F i l t e r s w h a t c o r e L u c e n e ' s B o o l e a n Q u e r y i s t o Q u e r i e s - a l l o w s m i x i n g o f c l a u s e s u s i n g B o o l e a n l o g i c
< l i > < a h r e f = "#TermsFilter" > T e r m s F i l t e r < / a > < / l i >
C o n s t r u c t s a f i l t e r f r o m a n a r b i t r a r y s e t o f t e r m s ( u n l i k e < a h r e f = "#RangeFilter" > R a n g e F i l t e r < / a > w h i c h r e q u i r e s a c o n t i g u o u s r a n g e o f t e r m s )
< l i > < a h r e f = "#DuplicateFilter" > D u p l i c a t e F i l t e r < / a > < / l i >
R e m o v e s d u p l i c a t e d d o c u m e n t s f r o m r e s u l t s w h e r e "duplicate" m e a n s d o c u m e n t s s h a r e a v a l u e f o r a p a r t i c u l a r f i e l d ( e . g . a p r i m a r y k e y )
< l i > < a h r e f = "#BoostingQuery" > B o o s t i n g Q u e r y < / a > < / l i >
I n f l u e n c e s c o r e o f a q u e r y 's matches in a subtle way which can' t b e a c h i e v e d u s i n g B o o l e a n Q u e r y
< / u l >
@ t i t l e C o n t r i b L u c e n e
- - >
<!-- @hidden include the core DTD -->
<!ENTITY % coreParserDTD SYSTEM "LuceneCoreQuery.dtd" >
<!-- @hidden Allow for extensions -->
<!ENTITY % extendedSpanQueries2 " " >
<!ENTITY % extendedQueries2 " " >
<!ENTITY % extendedFilters2 " " >
<!ENTITY % extendedQueries1 "|LikeThisQuery|BoostingQuery|FuzzyLikeThisQuery%extendedQueries2;%extendedSpanQueries2;" >
<!ENTITY % extendedFilters1 "|TermsFilter|BooleanFilter|DuplicateFilter%extendedFilters2;" >
%coreParserDTD;
<!--
P e r f o r m s f u z z y m a t c h i n g o n "significant" t e r m s i n f i e l d s . I m p r o v e s o n "LikeThisQuery" b y a l l o w i n g f o r f u z z y v a r i a t i o n s o f s u p p l i e d f i e l d s .
I m p r o v e s o n F u z z y Q u e r y b y r e w a r d i n g a l l f u z z y v a r i a n t s o f a t e r m w i t h t h e s a m e I D F r a t h e r t h a n d e f a u l t f u z z y b e h a v i o u r w h i c h r a n k s r a r e r
v a r i a n t s ( t y p i c a l l y m i s s p e l l i n g s ) m o r e h i g h l y . T h i s c a n b e a u s e f u l d e f a u l t s e a r c h m o d e f o r p r o c e s s i n g u s e r i n p u t w h e r e t h e e n d u s e r
i s n o t e x p e c t e d t o k n o w a b o u t t h e s t a n d a r d q u e r y o p e r a t o r s f o r f u z z y , b o o l e a n o r p h r a s e l o g i c f o u n d i n U s e r Q u e r y
@ e x a m p l e
< e m > S e a r c h f o r i n f o r m a t i o n a b o u t t h e S u m i t o m o b a n k , w h e r e t h e e n d u s e r h a s m i s - s p e l t t h e n a m e < / e m >
%
< F u z z y L i k e T h i s Q u e r y >
< F i e l d f i e l d N a m e = "contents" >
S u m i t i m o b a n k
< / F i e l d >
< / F u z z y L i k e T h i s Q u e r y >
%
- - >
<!ELEMENT FuzzyLikeThisQuery ( Field ) * >
<!-- Optional boost for matches on this query. Values > 1 -->
<!ATTLIST FuzzyLikeThisQuery boost CDATA "1.0" >
<!-- Limits the total number of terms selected from the provided text plus the selected "fuzzy" variants -->
<!ATTLIST FuzzyLikeThisQuery maxNumTerms CDATA "50" >
<!-- Ignore "Term Frequency" - a boost factor which rewards multiple occurences of the same term in a document -->
<!ATTLIST FuzzyLikeThisQuery ignoreTF ( true | false ) "false" >
<!-- A field used in a FuzzyLikeThisQuery -->
<!ELEMENT Field ( #PCDATA ) >
<!-- Controls the level of similarity required for fuzzy variants where 1 is identical and 0.5 is that the variant contains
h a l f o f t h e o r i g i n a l ' s c h a r a c t e r s i n t h e s a m e o r d e r . L o w e r v a l u e s p r o d u c e m o r e r e s u l t s b u t m a y t a k e l o n g e r t o e x e c u t e d u e t o
a d d i t i o n a l I O r e q u i r e d t o r e a d m a t c h i n g d o c u m e n t i d s - - >
<!ATTLIST Field minSimilarity CDATA "0.5" >
<!-- Controls the minimum number of characters at the start of fuzzy variant words that must exactly match the original.
A v a l u e o f z e r o w i l l r e q u i r e n o m i n i m u m a n d t h e s e a r c h s o f t w a r e w i l l e f f e c t i v e l y s c a n A L L t e r m s f r o m a t o z l o o k i n g f o r v a r i a t i o n s .
T h i s c a n i n c u r h i g h C P U o v e r h e a d a n d a p r e f i x l e n g t h o f j u s t "1" w i l l r e d u c e t h i s o v e r h e a d t o 1 / 2 6 t h o f t h e o r i g i n a l c o s t ( a s s u m i n g
a n e v e n d i s t r i b u t i o n o f l e t t e r s u s e d f r o m t h e a l p h a b e t ) .
- - >
<!ATTLIST Field prefixLength CDATA "1" >
<!-- fieldName must be defined here or is taken from the most immediate parent XML element that defines a "fieldName" attribute -->
<!ATTLIST Field fieldName CDATA #IMPLIED >
<!--
C h e r r y - p i c k s "significant" t e r m s f r o m t h e e x a m p l e c h i l d t e x t a n d q u e r i e s u s i n g t h e s e w o r d s . B y o n l y u s i n g s i g n i f i c a n t ( r e a d : r a r e ) t e r m s t h e
p e r f o r m a n c e c o s t o f t h e q u e r y i s s u b s t a n t i a l l y r e d u c e d a n d l a r g e b o d i e s o f t e x t c a n b e u s e d a s e x a m p l e c o n t e n t .
@ e x a m p l e
< e m > U s e a b l o c k o f t e x t a s a n e x a m p l e o f t h e t y p e o f c o n t e n t t o b e f o u n d , i g n o r i n g t h e "Reuters" w o r d w h i c h
a p p e a r s c o m m o n l y i n t h e i n d e x . < / e m >
%
< L i k e T h i s Q u e r y p e r c e n t T e r m s T o M a t c h = "5" s t o p W o r d s = "Reuters" >
I R A Q I T R O O P S R E P O R T E D P U S H I N G B A C K I R A N I A N S I r a q s a i d t o d a y i t s t r o o p s w e r e p u s h i n g I r a n i a n f o r c e s o u t o f
p o s i t i o n s t h e y h a d i n i t i a l l y o c c u p i e d w h e n t h e y l a u n c h e d a n e w o f f e n s i v e n e a r t h e s o u t h e r n p o r t o f
B a s r a e a r l y y e s t e r d a y . A H i g h C o m m a n d c o m m u n i q u e s a i d I r a q i t r o o p s h a d w o n a s i g n i f i c a n t v i c t o r y
a n d w e r e c o n t i n u i n g t o a d v a n c e . I r a q s a i d i t h a d f o i l e d a t h r e e - p r o n g e d t h r u s t s o m e 1 0 k m
( s i x m i l e s ) f r o m B a s r a , b u t a d m i t t e d t h e I r a n i a n s h a d o c c u p i e d g r o u n d h e l d b y t h e M o h a m m e d a l - Q a s s e m
u n i t , o n e o f t h r e e d i v i s i o n s a t t a c k e d . T h e c o m m u n i q u e s a i d I r a n i a n R e v o l u t i o n a r y G u a r d s w e r e u n d e r
a s s a u l t f r o m w a r p l a n e s , h e l i c o p t e r g u n s h i p s , h e a v y a r t i l l e r y a n d t a n k s . " O u r f o r c e s a r e c o n t i n u i n g
t h e i r a d v a n c e u n t i l t h e y p u r g e t h e l a s t f o o t h o l d " o c c u p i e d b y t h e I r a n i a n s , i t s a i d .
( I r a n s a i d i t s t r o o p s h a d k i l l e d o r w o u n d e d m o r e t h a n 4 , 0 0 0 I r a q i s a n d w e r e s t a b i l i s i n g t h e i r n e w p o s i t i o n s . )
T h e B a g h d a d c o m m u n i q u e s a i d I r a q i p l a n e s a l s o d e s t r o y e d o i l i n s t a l l a t i o n s a t I r a n ' s s o u t h w e s t e r n A h v a z f i e l d
d u r i n g a r a i d t o d a y . I t d e n i e d a n I r a n i a n r e p o r t t h a t a n I r a q i j e t w a s s h o t d o w n .
I r a q a l s o r e p o r t e d a n a v a l b a t t l e a t t h e n o r t h e r n t i p o f t h e G u l f . I r a q i n a v a l u n i t s a n d f o r c e s d e f e n d i n g a n
o f f s h o r e t e r m i n a l s a n k s i x I r a n i a n o u t o f 2 8 I r a n i a n b o a t s a t t e m p t i n g t o a t t a c k a n o f f s h o r e t e r m i n a l ,
t h e c o m m u n i q u e s a i d . R e u t e r s 3 ;
< / L i k e T h i s Q u e r y >
%
- - >
<!ELEMENT LikeThisQuery ( #PCDATA ) >
<!-- Optional boost for matches on this query. Values > 1 -->
<!ATTLIST LikeThisQuery boost CDATA "1.0" >
<!-- Comma delimited list of field names -->
<!ATTLIST LikeThisQuery fieldNames CDATA #IMPLIED >
<!-- a list of stop words - analyzed to produce stop terms -->
<!ATTLIST LikeThisQuery stopWords CDATA #IMPLIED >
<!-- controls the maximum number of words shortlisted for the query. The higher the number the slower the response due to more disk reads required -->
<!ATTLIST LikeThisQuery maxQueryTerms CDATA "20" >
<!-- Controls how many times a term must appear in the example text before it is shortlisted for use in the query -->
<!ATTLIST LikeThisQuery minTermFrequency CDATA "1" >
<!-- A quality control that can be used to limit the number of results to those documents matching a certain percentage of the shortlisted query terms.
V a l u e s m u s t b e b e t w e e n 1 a n d 1 0 0 - - >
<!ATTLIST LikeThisQuery percentTermsToMatch CDATA "30" >
<!--
R e q u i r e s m a t c h e s o n t h e "Query" e l e m e n t a n d o p t i o n a l l y b o o s t s b y a n y m a t c h e s o n t h e "BoostQuery" .
U n l i k e a r e g u l a r B o o l e a n Q u e r y t h e b o o s t c a n b e l e s s t h a n 1 t o p r o d u c e a s u b t r a c t i v e r a t h e r t h a n a d d i t i v e r e s u l t
o n t h e m a t c h s c o r e .
@ e x a m p l e < e m > F i n d d o c u m e n t s a b o u t b a n k s , p r e f e r a b l y r e l a t e d t o m e r g e r s , a n d p r e f e r a b l y n o t a b o u t "World bank" < / e m >
%
< B o o s t i n g Q u e r y >
< Q u e r y >
< B o o l e a n Q u e r y f i e l d N a m e = "contents" >
< C l a u s e o c c u r s = "should" >
< T e r m Q u e r y > m e r g e r < / T e r m Q u e r y >
< / C l a u s e >
< C l a u s e o c c u r s = "must" >
< T e r m Q u e r y > b a n k < / T e r m Q u e r y >
< / C l a u s e >
< / B o o l e a n Q u e r y >
< / Q u e r y >
< B o o s t Q u e r y b o o s t = "0.01" >
< U s e r Q u e r y > "world bank" < / U s e r Q u e r y >
< / B o o s t Q u e r y >
< / B o o s t i n g Q u e r y >
%
- - >
<!ELEMENT BoostingQuery ( Query , BoostQuery ) >
<!-- Optional boost for matches on this query. Values > 1 -->
<!ATTLIST BoostingQuery boost CDATA "1.0" >
<!--
C h i l d e l e m e n t o f B o o s t i n g Q u e r y u s e d t o c o n t a i n t h e c h o i c e o f Q u e r y w h i c h i s u s e d f o r b o o s t i n g p u r p o s e s
- - >
<!ELEMENT BoostQuery ( %queries; ) >
<!-- Optional boost for matches on this query. A boost of >0 but <1
e f f e c t i v e l y d e m o t e s r e s u l t s f r o m Q u e r y t h a t m a t c h t h i s B o o s t Q u e r y .
- - >
<!ATTLIST BoostQuery boost CDATA "1.0" >
<!-- Removes duplicated documents from results where "duplicate" means documents share a value for a particular field such as a primary key
@ e x a m p l e < e m > F i n d t h e l a t e s t v e r s i o n o f e a c h w e b p a g e t h a t m e n t i o n s "Lucene" < / e m >
%
< F i l t e r e d Q u e r y >
< Q u e r y >
< T e r m Q u e r y f i e l d N a m e = "text" > l u c e n e < / T e r m Q u e r y >
< / Q u e r y >
< F i l t e r >
< D u p l i c a t e F i l t e r f i e l d N a m e = "url" k e e p M o d e = "last" / >
< / F i l t e r >
< / F i l t e r e d Q u e r y >
%
- - >
<!ELEMENT DuplicateFilter EMPTY >
<!-- fieldName must be defined here or is taken from the most immediate parent XML element that defines a "fieldName" attribute -->
<!ATTLIST DuplicateFilter fieldName CDATA #IMPLIED >
<!-- Determines if the first or last document occurence is the one to return when presented with duplicated field values -->
<!ATTLIST DuplicateFilter keepMode ( first | last ) "first" >
<!-- Controls the choice of process used to produce the filter - "full" mode identifies only non - duplicate documents with the chosen field
w h i l e "fast" m o d e m a y p e r f o r m f a s t e r b u t w i l l a l s o m a r k d o c u m e n t s < e m > w i t h o u t < / e m > t h e f i e l d a s v a l i d . T h e f o r m e r a p p r o a c h s t a r t s b y
a s s u m i n g e v e r y d o c u m e n t i s a d u p l i c a t e t h e n f i n d s t h e "master" d o c u m e n t s t o k e e p w h i l e t h e l a t t e r a p p r o a c h a s s u m e s a l l d o c u m e n t s a r e
u n i q u e a n d u n m a r k s t h o s e d o c u m e n t s t h a t a r e a c o p y .
- - >
<!ATTLIST DuplicateFilter processingMode ( full | fast ) "full" >
<!-- Processes child text using a field - specific choice of Analyzer to produce a set of terms that are then used as a filter.
@ e x a m p l e < e m > F i n d d o c u m e n t s t a l k i n g a b o u t L u c e n e w r i t t e n o n a M o n d a y o r a F r i d a y < / e m >
%
< F i l t e r e d Q u e r y >
< Q u e r y >
< T e r m Q u e r y f i e l d N a m e = "text" > l u c e n e < / T e r m Q u e r y >
< / Q u e r y >
< F i l t e r >
< T e r m s F i l t e r f i e l d N a m e = "dayOfWeek" > m o n d a y f r i d a y < / T e r m s F i l t e r >
< / F i l t e r >
< / F i l t e r e d Q u e r y >
%
- - >
<!ELEMENT TermsFilter ( #PCDATA ) >
<!-- fieldName must be defined here or is taken from the most immediate parent XML element that defines a "fieldName" attribute -->
<!ATTLIST TermsFilter fieldName CDATA #IMPLIED >
<!--
A F i l t e r e q u i v a l e n t t o B o o l e a n Q u e r y t h a t a p p l i e s B o o l e a n l o g i c t o C l a u s e s c o n t a i n i n g F i l t e r s .
U n l i k e B o o l e a n Q u e r y a B o o l e a n F i l t e r c a n c o n t a i n a s i n g l e "mustNot" c l a u s e .
@ e x a m p l e < e m > F i n d d o c u m e n t s f r o m t h e f i r s t q u a r t e r o f t h i s y e a r o r l a s t y e a r t h a t a r e n o t i n "draft" s t a t u s < / e m >
%
< F i l t e r e d Q u e r y >
< Q u e r y >
< M a t c h A l l D o c s Q u e r y / >
< / Q u e r y >
< F i l t e r >
< B o o l e a n F i l t e r >
< C l a u s e o c c u r s = "should" >
< R a n g e F i l t e r f i e l d N a m e = "date" l o w e r T e r m = "20070101" u p p e r T e r m = "20070401" / >
< / C l a u s e >
< C l a u s e o c c u r s = "should" >
< R a n g e F i l t e r f i e l d N a m e = "date" l o w e r T e r m = "20060101" u p p e r T e r m = "20060401" / >
< / C l a u s e >
< C l a u s e o c c u r s = "mustNot" >
< T e r m s F i l t e r f i e l d N a m e = "status" > d r a f t < / T e r m s F i l t e r >
< / C l a u s e >
< / B o o l e a n F i l t e r >
< / F i l t e r >
< / F i l t e r e d Q u e r y >
%
- - >
<!ELEMENT BooleanFilter ( Clause ) + >