mirror of https://github.com/apache/lucene.git
- Small documentation mods.
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@730207 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
0afd451f24
commit
72725a0b58
|
@ -25,11 +25,11 @@ import org.apache.lucene.search.SortField;
|
|||
import org.apache.lucene.search.ExtendedFieldCache;
|
||||
|
||||
/**
|
||||
*This is a helper class to construct the trie-based index entries for numerical values.
|
||||
* <p>For more information, how the algorithm works, see the package description {@link org.apache.lucene.search.trie}. The format of how the
|
||||
* numerical values are stored in index is documented here:
|
||||
* This is a helper class to construct the trie-based index entries for numerical values.
|
||||
* <p>For more information on how the algorithm works, see the package description {@link org.apache.lucene.search.trie}.
|
||||
* The format of how the numerical values are stored in index is documented here:
|
||||
* <p>All numerical values are first converted to special <code>unsigned long</code>s by applying some bit-wise transformations. This means:<ul>
|
||||
* <li>{@link Date}s are casted to unix timestamps (milliseconds since 1970-01-01, this is how Java represents date/time
|
||||
* <li>{@link Date}s are casted to UNIX timestamps (milliseconds since 1970-01-01, this is how Java represents date/time
|
||||
* internally): {@link Date#getTime()}. The resulting <code>signed long</code> is transformed to the unsigned form like so:</li>
|
||||
* <li><code>signed long</code>s are shifted, so that {@link Long#MIN_VALUE} is mapped to <code>0x0000000000000000</code>,
|
||||
* {@link Long#MAX_VALUE} is mapped to <code>0xffffffffffffffff</code>.</li>
|
||||
|
@ -42,13 +42,12 @@ import org.apache.lucene.search.ExtendedFieldCache;
|
|||
* The resulting {@link String} is comparable like the corresponding <code>unsigned long</code>.
|
||||
* <p>To store the different precisions of the long values (from one character [only the most significant one] to the full encoded length),
|
||||
* each lower precision is prefixed by the length ({@link #TRIE_CODED_PADDING_START}<code>+precision == 0x20+precision</code>),
|
||||
* in an extra "helper" field with a suffixed field name (i.e. fieldname "numeric" => lower precision's name "numeric#trie").
|
||||
* in an extra "helper" field with a suffixed field name (i.e. fieldname "numeric" => lower precision's name "numeric#trie").
|
||||
* The full long is not prefixed at all and indexed and stored according to the given flags in the original field name.
|
||||
* By this it is possible to get the correct enumeration of terms in correct precision
|
||||
* of the term list by just jumping to the correct fieldname and/or prefix. The full precision value may also be
|
||||
* stored in the document. Having the full precision value as term in a separate field with the original name,
|
||||
* sorting of query results agains such fields is possible using the original field name.
|
||||
* @author Uwe Schindler (panFMP developer)
|
||||
*/
|
||||
public final class TrieUtils {
|
||||
|
||||
|
|
|
@ -16,52 +16,48 @@ We have developed an extension to Apache Lucene that stores
|
|||
the numerical values in a special string-encoded format with variable precision
|
||||
(all numerical values like doubles, longs, and timestamps are converted to lexicographic sortable string representations
|
||||
and stored with different precisions from one byte to the full 8 bytes - depending on the variant used).
|
||||
For a more detailed description, how the values are stored, see {@link org.apache.lucene.search.trie.TrieUtils}.
|
||||
For a more detailed description of how the values are stored, see {@link org.apache.lucene.search.trie.TrieUtils}.
|
||||
A range is then divided recursively into multiple intervals for searching:
|
||||
The center of the range is searched only with the lowest possible precision in the trie, the boundaries are matched
|
||||
The center of the range is searched only with the lowest possible precision in the trie, while the boundaries are matched
|
||||
more exactly. This reduces the number of terms dramatically.</p>
|
||||
|
||||
<p>For the variant, that uses a lowest precision of 1-byte the index only
|
||||
contains a maximum of 256 distinct values in the lowest precision.
|
||||
<p>For the variant that uses a lowest precision of 1-byte the index
|
||||
contains only a maximum of 256 distinct values in the lowest precision.
|
||||
Overall, a range could consist of a theoretical maximum of
|
||||
<code>7*255*2 + 255 = 3825</code> distinct terms (when there is a term for every distinct value of an
|
||||
8-byte-number in the index and the range covers all of them; a maximum of 255 distinct values is used
|
||||
because it would always be possible to reduce the full 256 values to one term with degraded precision).
|
||||
In practise, we have seen up to 300 terms in most cases (index with 500,000 metadata records
|
||||
and a homogeneous dispersion of values).</p>
|
||||
and a uniform value distribution).</p>
|
||||
|
||||
<p>There are two other variants of encoding: 4bit and 2bit. Each variant stores more different precisions
|
||||
of the longs and so needs more storage space (because it generates more and longer terms -
|
||||
of the longs and thus needs more storage space (because it generates more and longer terms -
|
||||
4bit: two times the length and number of terms; 2bit: four times the length and number of terms).
|
||||
But on the other hand, the maximum number of distinct terms used for range queries is
|
||||
<code>15*15*2 + 15 = 465</code> for the 4bit variant, and
|
||||
<code>31*3*2 + 3 = 189</code> for the 2bit variant.</p>
|
||||
|
||||
<p>This dramatically improves the performance of Apache Lucene with range queries, which
|
||||
is no longer dependent on the index size and number of distinct values because there is
|
||||
an upper limit not related to any of these properties.</p>
|
||||
are no longer dependent on the index size and the number of distinct values because there is
|
||||
an upper limit unrelated to either of these properties.</p>
|
||||
|
||||
<h3>Usage</h3>
|
||||
<p>To use the new query types the numerical values, which may be <code>long</code>, <code>double</code> or <code>Date</code>,
|
||||
during indexing the values must be stored in a special format in index (using {@link org.apache.lucene.search.trie.TrieUtils}).
|
||||
the values must be stored during indexing in a special format in the index (using {@link org.apache.lucene.search.trie.TrieUtils}).
|
||||
This can be done like this:</p>
|
||||
|
||||
<pre>
|
||||
Document doc=new Document();
|
||||
Document doc = new Document();
|
||||
// add some standard fields:
|
||||
String svalue="anything to index";
|
||||
doc.add(new Field("exampleString",
|
||||
svalue, Field.Store.YES, Field.Index.ANALYZED) ;
|
||||
String svalue = "anything to index";
|
||||
doc.add(new Field("exampleString", svalue, Field.Store.YES, Field.Index.ANALYZED) ;
|
||||
// add some numerical fields:
|
||||
double fvalue=1.057E17;
|
||||
TrieUtils.VARIANT_8BIT.addDoubleTrieCodedDocumentField(doc, "exampleDouble",
|
||||
fvalue, true /* index the field */, Field.Store.YES);
|
||||
long lvalue=121345L;
|
||||
TrieUtils.VARIANT_8BIT.addLongTrieCodedDocumentField(doc, "exampleLong",
|
||||
lvalue, true /* index the field */, Field.Store.YES);
|
||||
Date dvalue=new Date(); // actual time
|
||||
TrieUtils.VARIANT_8BIT.addDateTrieCodedDocumentField(doc, "exampleDate",
|
||||
dvalue, true /* index the field */, Field.Store.YES);
|
||||
double fvalue = 1.057E17;
|
||||
TrieUtils.VARIANT_8BIT.addDoubleTrieCodedDocumentField(doc, "exampleDouble", fvalue, true /* index the field */, Field.Store.YES);
|
||||
long lvalue = 121345L;
|
||||
TrieUtils.VARIANT_8BIT.addLongTrieCodedDocumentField(doc, "exampleLong", lvalue, true /* index the field */, Field.Store.YES);
|
||||
Date dvalue = new Date(); // actual time
|
||||
TrieUtils.VARIANT_8BIT.addDateTrieCodedDocumentField(doc, "exampleDate", dvalue, true /* index the field */, Field.Store.YES);
|
||||
// add document to IndexWriter
|
||||
</pre>
|
||||
|
||||
|
@ -69,21 +65,21 @@ This can be done like this:</p>
|
|||
|
||||
<pre>
|
||||
// Java 1.4, because Double.valueOf(double) is not available:
|
||||
Query q=new TrieRangeQuery("exampleDouble", new Double(1.0E17), new Double(2.0E17), TrieUtils.VARIANT_8BIT);
|
||||
Query q = new TrieRangeQuery("exampleDouble", new Double(1.0E17), new Double(2.0E17), TrieUtils.VARIANT_8BIT);
|
||||
// OR, Java 1.5, using autoboxing:
|
||||
Query q=new TrieRangeQuery("exampleDouble", 1.0E17, 2.0E17, TrieUtils.VARIANT_8BIT);
|
||||
TopDocs docs=searcher.search(q, 10);
|
||||
for (int i=0; i<docs.scoreDocs.length; i++) {
|
||||
Document doc=searcher.doc(docs.scoreDocs[i].doc);
|
||||
Query q = new TrieRangeQuery("exampleDouble", 1.0E17, 2.0E17, TrieUtils.VARIANT_8BIT);
|
||||
TopDocs docs = searcher.search(q, 10);
|
||||
for (int i = 0; i<docs.scoreDocs.length; i++) {
|
||||
Document doc = searcher.doc(docs.scoreDocs[i].doc);
|
||||
System.out.println(doc.get("exampleString"));
|
||||
// decode the stored numerical value (important!!!):
|
||||
System.out.println( TrieUtils.VARIANT_8BIT.trieCodedToDouble(doc.get("exampleDouble")) );
|
||||
System.out.println(TrieUtils.VARIANT_8BIT.trieCodedToDouble(doc.get("exampleDouble")));
|
||||
}
|
||||
</pre>
|
||||
|
||||
<h3>Performance</h3>
|
||||
|
||||
<p>Comparisions of the different types of RangeQueries on an index with about 500,000 docs showed,
|
||||
<p>Comparisions of the different types of RangeQueries on an index with about 500,000 docs showed
|
||||
that the old {@link org.apache.lucene.search.RangeQuery} (with raised
|
||||
{@link org.apache.lucene.search.BooleanQuery} clause count) took about 30-40 secs to complete,
|
||||
{@link org.apache.lucene.search.ConstantScoreRangeQuery} took 5 secs and
|
||||
|
|
Loading…
Reference in New Issue