parent
be9c37fc76
commit
671a209ed9
|
@ -30,7 +30,7 @@ occurring in a document is low. At the same time, as
|
||||||
internally each shingle is hashed into to 128-bit hash, you should choose
|
internally each shingle is hashed into to 128-bit hash, you should choose
|
||||||
`k` small enough so that all possible
|
`k` small enough so that all possible
|
||||||
different k-words shingles can be hashed to 128-bit hash with
|
different k-words shingles can be hashed to 128-bit hash with
|
||||||
minimal collision. 5-word shingles typically work well.
|
minimal collision.
|
||||||
|
|
||||||
* choosing the right settings for `hash_count`, `bucket_count` and
|
* choosing the right settings for `hash_count`, `bucket_count` and
|
||||||
`hash_set_size` needs some experimentation.
|
`hash_set_size` needs some experimentation.
|
||||||
|
@ -39,7 +39,7 @@ minimal collision. 5-word shingles typically work well.
|
||||||
will provide a higher guarantee that different tokens are
|
will provide a higher guarantee that different tokens are
|
||||||
indexed to different buckets.
|
indexed to different buckets.
|
||||||
** to improve the recall,
|
** to improve the recall,
|
||||||
you should increase `hash_token` parameter. For example,
|
you should increase `hash_count` parameter. For example,
|
||||||
setting `hash_count=2`, will make each token to be hashed in
|
setting `hash_count=2`, will make each token to be hashed in
|
||||||
two different ways, thus increasing the number of potential
|
two different ways, thus increasing the number of potential
|
||||||
candidates for search.
|
candidates for search.
|
||||||
|
|
Loading…
Reference in New Issue