mirror of https://github.com/apache/lucene.git
- Small documentation mods.
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@730207 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
0afd451f24
commit
72725a0b58
|
@ -26,10 +26,10 @@ import org.apache.lucene.search.ExtendedFieldCache;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* This is a helper class to construct the trie-based index entries for numerical values.
|
* This is a helper class to construct the trie-based index entries for numerical values.
|
||||||
* <p>For more information, how the algorithm works, see the package description {@link org.apache.lucene.search.trie}. The format of how the
|
* <p>For more information on how the algorithm works, see the package description {@link org.apache.lucene.search.trie}.
|
||||||
* numerical values are stored in index is documented here:
|
* The format of how the numerical values are stored in index is documented here:
|
||||||
* <p>All numerical values are first converted to special <code>unsigned long</code>s by applying some bit-wise transformations. This means:<ul>
|
* <p>All numerical values are first converted to special <code>unsigned long</code>s by applying some bit-wise transformations. This means:<ul>
|
||||||
* <li>{@link Date}s are casted to unix timestamps (milliseconds since 1970-01-01, this is how Java represents date/time
|
* <li>{@link Date}s are casted to UNIX timestamps (milliseconds since 1970-01-01, this is how Java represents date/time
|
||||||
* internally): {@link Date#getTime()}. The resulting <code>signed long</code> is transformed to the unsigned form like so:</li>
|
* internally): {@link Date#getTime()}. The resulting <code>signed long</code> is transformed to the unsigned form like so:</li>
|
||||||
* <li><code>signed long</code>s are shifted, so that {@link Long#MIN_VALUE} is mapped to <code>0x0000000000000000</code>,
|
* <li><code>signed long</code>s are shifted, so that {@link Long#MIN_VALUE} is mapped to <code>0x0000000000000000</code>,
|
||||||
* {@link Long#MAX_VALUE} is mapped to <code>0xffffffffffffffff</code>.</li>
|
* {@link Long#MAX_VALUE} is mapped to <code>0xffffffffffffffff</code>.</li>
|
||||||
|
@ -42,13 +42,12 @@ import org.apache.lucene.search.ExtendedFieldCache;
|
||||||
* The resulting {@link String} is comparable like the corresponding <code>unsigned long</code>.
|
* The resulting {@link String} is comparable like the corresponding <code>unsigned long</code>.
|
||||||
* <p>To store the different precisions of the long values (from one character [only the most significant one] to the full encoded length),
|
* <p>To store the different precisions of the long values (from one character [only the most significant one] to the full encoded length),
|
||||||
* each lower precision is prefixed by the length ({@link #TRIE_CODED_PADDING_START}<code>+precision == 0x20+precision</code>),
|
* each lower precision is prefixed by the length ({@link #TRIE_CODED_PADDING_START}<code>+precision == 0x20+precision</code>),
|
||||||
* in an extra "helper" field with a suffixed field name (i.e. fieldname "numeric" => lower precision's name "numeric#trie").
|
* in an extra "helper" field with a suffixed field name (i.e. fieldname "numeric" => lower precision's name "numeric#trie").
|
||||||
* The full long is not prefixed at all and indexed and stored according to the given flags in the original field name.
|
* The full long is not prefixed at all and indexed and stored according to the given flags in the original field name.
|
||||||
* By this it is possible to get the correct enumeration of terms in correct precision
|
* By this it is possible to get the correct enumeration of terms in correct precision
|
||||||
* of the term list by just jumping to the correct fieldname and/or prefix. The full precision value may also be
|
* of the term list by just jumping to the correct fieldname and/or prefix. The full precision value may also be
|
||||||
* stored in the document. Having the full precision value as term in a separate field with the original name,
|
* stored in the document. Having the full precision value as term in a separate field with the original name,
|
||||||
* sorting of query results agains such fields is possible using the original field name.
|
* sorting of query results agains such fields is possible using the original field name.
|
||||||
* @author Uwe Schindler (panFMP developer)
|
|
||||||
*/
|
*/
|
||||||
public final class TrieUtils {
|
public final class TrieUtils {
|
||||||
|
|
||||||
|
|
|
@ -16,52 +16,48 @@ We have developed an extension to Apache Lucene that stores
|
||||||
the numerical values in a special string-encoded format with variable precision
|
the numerical values in a special string-encoded format with variable precision
|
||||||
(all numerical values like doubles, longs, and timestamps are converted to lexicographic sortable string representations
|
(all numerical values like doubles, longs, and timestamps are converted to lexicographic sortable string representations
|
||||||
and stored with different precisions from one byte to the full 8 bytes - depending on the variant used).
|
and stored with different precisions from one byte to the full 8 bytes - depending on the variant used).
|
||||||
For a more detailed description, how the values are stored, see {@link org.apache.lucene.search.trie.TrieUtils}.
|
For a more detailed description of how the values are stored, see {@link org.apache.lucene.search.trie.TrieUtils}.
|
||||||
A range is then divided recursively into multiple intervals for searching:
|
A range is then divided recursively into multiple intervals for searching:
|
||||||
The center of the range is searched only with the lowest possible precision in the trie, the boundaries are matched
|
The center of the range is searched only with the lowest possible precision in the trie, while the boundaries are matched
|
||||||
more exactly. This reduces the number of terms dramatically.</p>
|
more exactly. This reduces the number of terms dramatically.</p>
|
||||||
|
|
||||||
<p>For the variant, that uses a lowest precision of 1-byte the index only
|
<p>For the variant that uses a lowest precision of 1-byte the index
|
||||||
contains a maximum of 256 distinct values in the lowest precision.
|
contains only a maximum of 256 distinct values in the lowest precision.
|
||||||
Overall, a range could consist of a theoretical maximum of
|
Overall, a range could consist of a theoretical maximum of
|
||||||
<code>7*255*2 + 255 = 3825</code> distinct terms (when there is a term for every distinct value of an
|
<code>7*255*2 + 255 = 3825</code> distinct terms (when there is a term for every distinct value of an
|
||||||
8-byte-number in the index and the range covers all of them; a maximum of 255 distinct values is used
|
8-byte-number in the index and the range covers all of them; a maximum of 255 distinct values is used
|
||||||
because it would always be possible to reduce the full 256 values to one term with degraded precision).
|
because it would always be possible to reduce the full 256 values to one term with degraded precision).
|
||||||
In practise, we have seen up to 300 terms in most cases (index with 500,000 metadata records
|
In practise, we have seen up to 300 terms in most cases (index with 500,000 metadata records
|
||||||
and a homogeneous dispersion of values).</p>
|
and a uniform value distribution).</p>
|
||||||
|
|
||||||
<p>There are two other variants of encoding: 4bit and 2bit. Each variant stores more different precisions
|
<p>There are two other variants of encoding: 4bit and 2bit. Each variant stores more different precisions
|
||||||
of the longs and so needs more storage space (because it generates more and longer terms -
|
of the longs and thus needs more storage space (because it generates more and longer terms -
|
||||||
4bit: two times the length and number of terms; 2bit: four times the length and number of terms).
|
4bit: two times the length and number of terms; 2bit: four times the length and number of terms).
|
||||||
But on the other hand, the maximum number of distinct terms used for range queries is
|
But on the other hand, the maximum number of distinct terms used for range queries is
|
||||||
<code>15*15*2 + 15 = 465</code> for the 4bit variant, and
|
<code>15*15*2 + 15 = 465</code> for the 4bit variant, and
|
||||||
<code>31*3*2 + 3 = 189</code> for the 2bit variant.</p>
|
<code>31*3*2 + 3 = 189</code> for the 2bit variant.</p>
|
||||||
|
|
||||||
<p>This dramatically improves the performance of Apache Lucene with range queries, which
|
<p>This dramatically improves the performance of Apache Lucene with range queries, which
|
||||||
is no longer dependent on the index size and number of distinct values because there is
|
are no longer dependent on the index size and the number of distinct values because there is
|
||||||
an upper limit not related to any of these properties.</p>
|
an upper limit unrelated to either of these properties.</p>
|
||||||
|
|
||||||
<h3>Usage</h3>
|
<h3>Usage</h3>
|
||||||
<p>To use the new query types the numerical values, which may be <code>long</code>, <code>double</code> or <code>Date</code>,
|
<p>To use the new query types the numerical values, which may be <code>long</code>, <code>double</code> or <code>Date</code>,
|
||||||
during indexing the values must be stored in a special format in index (using {@link org.apache.lucene.search.trie.TrieUtils}).
|
the values must be stored during indexing in a special format in the index (using {@link org.apache.lucene.search.trie.TrieUtils}).
|
||||||
This can be done like this:</p>
|
This can be done like this:</p>
|
||||||
|
|
||||||
<pre>
|
<pre>
|
||||||
Document doc = new Document();
|
Document doc = new Document();
|
||||||
// add some standard fields:
|
// add some standard fields:
|
||||||
String svalue = "anything to index";
|
String svalue = "anything to index";
|
||||||
doc.add(new Field("exampleString",
|
doc.add(new Field("exampleString", svalue, Field.Store.YES, Field.Index.ANALYZED) ;
|
||||||
svalue, Field.Store.YES, Field.Index.ANALYZED) ;
|
|
||||||
// add some numerical fields:
|
// add some numerical fields:
|
||||||
double fvalue = 1.057E17;
|
double fvalue = 1.057E17;
|
||||||
TrieUtils.VARIANT_8BIT.addDoubleTrieCodedDocumentField(doc, "exampleDouble",
|
TrieUtils.VARIANT_8BIT.addDoubleTrieCodedDocumentField(doc, "exampleDouble", fvalue, true /* index the field */, Field.Store.YES);
|
||||||
fvalue, true /* index the field */, Field.Store.YES);
|
|
||||||
long lvalue = 121345L;
|
long lvalue = 121345L;
|
||||||
TrieUtils.VARIANT_8BIT.addLongTrieCodedDocumentField(doc, "exampleLong",
|
TrieUtils.VARIANT_8BIT.addLongTrieCodedDocumentField(doc, "exampleLong", lvalue, true /* index the field */, Field.Store.YES);
|
||||||
lvalue, true /* index the field */, Field.Store.YES);
|
|
||||||
Date dvalue = new Date(); // actual time
|
Date dvalue = new Date(); // actual time
|
||||||
TrieUtils.VARIANT_8BIT.addDateTrieCodedDocumentField(doc, "exampleDate",
|
TrieUtils.VARIANT_8BIT.addDateTrieCodedDocumentField(doc, "exampleDate", dvalue, true /* index the field */, Field.Store.YES);
|
||||||
dvalue, true /* index the field */, Field.Store.YES);
|
|
||||||
// add document to IndexWriter
|
// add document to IndexWriter
|
||||||
</pre>
|
</pre>
|
||||||
|
|
||||||
|
@ -83,7 +79,7 @@ This can be done like this:</p>
|
||||||
|
|
||||||
<h3>Performance</h3>
|
<h3>Performance</h3>
|
||||||
|
|
||||||
<p>Comparisions of the different types of RangeQueries on an index with about 500,000 docs showed,
|
<p>Comparisions of the different types of RangeQueries on an index with about 500,000 docs showed
|
||||||
that the old {@link org.apache.lucene.search.RangeQuery} (with raised
|
that the old {@link org.apache.lucene.search.RangeQuery} (with raised
|
||||||
{@link org.apache.lucene.search.BooleanQuery} clause count) took about 30-40 secs to complete,
|
{@link org.apache.lucene.search.BooleanQuery} clause count) took about 30-40 secs to complete,
|
||||||
{@link org.apache.lucene.search.ConstantScoreRangeQuery} took 5 secs and
|
{@link org.apache.lucene.search.ConstantScoreRangeQuery} took 5 secs and
|
||||||
|
|
Loading…
Reference in New Issue