mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-22 21:05:23 +00:00
Changes: * Rewrites description and adds a Lucene link * Reformats the configurable parameters as a definition list * Changes the `Theory` heading to `Using the min_hash token filter for similarity search` * Adds some additional detail to the analyzer example
This commit is contained in:
parent
c0f732b9f6
commit
a0ca0325fe
@ -4,59 +4,82 @@
|
||||
<titleabbrev>MinHash</titleabbrev>
|
||||
++++
|
||||
|
||||
The `min_hash` token filter hashes each token of the token stream and divides
|
||||
the resulting hashes into buckets, keeping the lowest-valued hashes per
|
||||
bucket. It then returns these hashes as tokens.
|
||||
Uses the https://en.wikipedia.org/wiki/MinHash[MinHash] technique to produce a
|
||||
signature for a token stream. You can use MinHash signatures to estimate the
|
||||
similarity of documents. See <<analysis-minhash-tokenfilter-similarity-search>>.
|
||||
|
||||
The following are settings that can be set for a `min_hash` token filter.
|
||||
The `min_hash` filter performs the following operations on a token stream in
|
||||
order:
|
||||
|
||||
[cols="<,<", options="header",]
|
||||
|=======================================================================
|
||||
|Setting |Description
|
||||
|`hash_count` |The number of hashes to hash the token stream with. Defaults to `1`.
|
||||
. Hashes each token in the stream.
|
||||
. Assigns the hashes to buckets, keeping only the smallest hashes of each
|
||||
bucket.
|
||||
. Outputs the smallest hash from each bucket as a token stream.
|
||||
|
||||
|`bucket_count` |The number of buckets to divide the minhashes into. Defaults to `512`.
|
||||
This filter uses Lucene's
|
||||
{lucene-analysis-docs}/minhash/MinHashFilter.html[MinHashFilter].
|
||||
|
||||
|`hash_set_size` |The number of minhashes to keep per bucket. Defaults to `1`.
|
||||
[[analysis-minhash-tokenfilter-configure-parms]]
|
||||
==== Configurable parameters
|
||||
|
||||
|`with_rotation` |Whether or not to fill empty buckets with the value of the first non-empty
|
||||
bucket to its circular right. Only takes effect if hash_set_size is equal to one.
|
||||
Defaults to `true` if bucket_count is greater than one, else `false`.
|
||||
|=======================================================================
|
||||
`bucket_count`::
|
||||
(Optional, integer)
|
||||
Number of buckets to which hashes are assigned. Defaults to `512`.
|
||||
|
||||
Some points to consider while setting up a `min_hash` filter:
|
||||
`hash_count`::
|
||||
(Optional, integer)
|
||||
Number of ways to hash each token in the stream. Defaults to `1`.
|
||||
|
||||
`hash_set_size`::
|
||||
(Optional, integer)
|
||||
Number of hashes to keep from each bucket. Defaults to `1`.
|
||||
+
|
||||
Hashes are retained by ascending size, starting with the bucket's smallest hash
|
||||
first.
|
||||
|
||||
`with_rotation`::
|
||||
(Optional, boolean)
|
||||
If `true`, the filter fills empty buckets with the value of the first non-empty
|
||||
bucket to its circular right if the `hash_set_size` is `1`. If the
|
||||
`bucket_count` argument is greater than `1`, this parameter defaults to `true`.
|
||||
Otherwise, this parameter defaults to `false`.
|
||||
|
||||
[[analysis-minhash-tokenfilter-configuration-tips]]
|
||||
==== Tips for configuring the `min_hash` filter
|
||||
|
||||
* `min_hash` filter input tokens should typically be k-words shingles produced
|
||||
from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
|
||||
from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
|
||||
choose `k` large enough so that the probability of any given shingle
|
||||
occurring in a document is low. At the same time, as
|
||||
occurring in a document is low. At the same time, as
|
||||
internally each shingle is hashed into to 128-bit hash, you should choose
|
||||
`k` small enough so that all possible
|
||||
different k-words shingles can be hashed to 128-bit hash with
|
||||
minimal collision.
|
||||
|
||||
* choosing the right settings for `hash_count`, `bucket_count` and
|
||||
`hash_set_size` needs some experimentation.
|
||||
** to improve the precision, you should increase `bucket_count` or
|
||||
`hash_set_size`. Higher values of `bucket_count` or `hash_set_size`
|
||||
will provide a higher guarantee that different tokens are
|
||||
indexed to different buckets.
|
||||
** to improve the recall,
|
||||
you should increase `hash_count` parameter. For example,
|
||||
setting `hash_count=2`, will make each token to be hashed in
|
||||
two different ways, thus increasing the number of potential
|
||||
candidates for search.
|
||||
* We recommend you test different arguments for the `hash_count`, `bucket_count` and
|
||||
`hash_set_size` parameters:
|
||||
|
||||
* the default settings makes the `min_hash` filter to produce for
|
||||
each document 512 `min_hash` tokens, each is of size 16 bytes.
|
||||
Thus, each document's size will be increased by around 8Kb.
|
||||
** To improve precision, increase the `bucket_count` or
|
||||
`hash_set_size` arguments. Higher `bucket_count` and `hash_set_size` values
|
||||
increase the likelihood that different tokens are indexed to different
|
||||
buckets.
|
||||
|
||||
* `min_hash` filter is used to hash for Jaccard similarity. This means
|
||||
** To improve the recall, increase the value of the `hash_count` argument. For
|
||||
example, setting `hash_count` to `2` hashes each token in two different ways,
|
||||
increasing the number of potential candidates for search.
|
||||
|
||||
* By default, the `min_hash` filter produces 512 tokens for each document. Each
|
||||
token is 16 bytes in size. This means each document's size will be increased by
|
||||
around 8Kb.
|
||||
|
||||
* The `min_hash` filter is used for Jaccard similarity. This means
|
||||
that it doesn't matter how many times a document contains a certain token,
|
||||
only that if it contains it or not.
|
||||
|
||||
==== Theory
|
||||
MinHash token filter allows you to hash documents for similarity search.
|
||||
[[analysis-minhash-tokenfilter-similarity-search]]
|
||||
==== Using the `min_hash` token filter for similarity search
|
||||
|
||||
The `min_hash` token filter allows you to hash documents for similarity search.
|
||||
Similarity search, or nearest neighbor search is a complex problem.
|
||||
A naive solution requires an exhaustive pairwise comparison between a query
|
||||
document and every document in an index. This is a prohibitive operation
|
||||
@ -88,18 +111,33 @@ document's tokens and chooses the minimum hash code among them.
|
||||
The minimum hash codes from all hash functions are combined
|
||||
to form a signature for the document.
|
||||
|
||||
[[analysis-minhash-tokenfilter-customize]]
|
||||
==== Customize and add to an analyzer
|
||||
|
||||
==== Example of setting MinHash Token Filter in Elasticsearch
|
||||
Here is an example of setting up a `min_hash` filter:
|
||||
To customize the `min_hash` filter, duplicate it to create the basis for a new
|
||||
custom token filter. You can modify the filter using its configurable
|
||||
parameters.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
POST /index1
|
||||
For example, the following <<indices-create-index,create index API>> request
|
||||
uses the following custom token filters to configure a new
|
||||
<<analysis-custom-analyzer,custom analyzer>>:
|
||||
|
||||
* `my_shingle_filter`, a custom <<analysis-shingle-tokenfilter,`shingle`
|
||||
filter>>. `my_shingle_filter` only outputs five-word shingles.
|
||||
* `my_minhash_filter`, a custom `min_hash` filter. `my_minhash_filter` hashes
|
||||
each five-word shingle once. It then assigns the hashes into 512 buckets,
|
||||
keeping only the smallest hash from each bucket.
|
||||
|
||||
The request also assigns the custom analyzer to the `fingerprint` field mapping.
|
||||
|
||||
[source,console]
|
||||
----
|
||||
PUT /my_index
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"filter": {
|
||||
"my_shingle_filter": { <1>
|
||||
"my_shingle_filter": { <1>
|
||||
"type": "shingle",
|
||||
"min_shingle_size": 5,
|
||||
"max_shingle_size": 5,
|
||||
@ -107,10 +145,10 @@ POST /index1
|
||||
},
|
||||
"my_minhash_filter": {
|
||||
"type": "min_hash",
|
||||
"hash_count": 1, <2>
|
||||
"bucket_count": 512, <3>
|
||||
"hash_set_size": 1, <4>
|
||||
"with_rotation": true <5>
|
||||
"hash_count": 1, <2>
|
||||
"bucket_count": 512, <3>
|
||||
"hash_set_size": 1, <4>
|
||||
"with_rotation": true <5>
|
||||
}
|
||||
},
|
||||
"analyzer": {
|
||||
@ -133,10 +171,10 @@ POST /index1
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// NOTCONSOLE
|
||||
<1> setting a shingle filter with 5-word shingles
|
||||
<2> setting min_hash filter to hash with 1 hash
|
||||
<3> setting min_hash filter to hash tokens into 512 buckets
|
||||
<4> setting min_hash filter to keep only a single smallest hash in each bucket
|
||||
<5> setting min_hash filter to fill empty buckets with values from neighboring buckets
|
||||
----
|
||||
|
||||
<1> Configures a custom shingle filter to output only five-word shingles.
|
||||
<2> Each five-word shingle in the stream is hashed once.
|
||||
<3> The hashes are assigned to 512 buckets.
|
||||
<4> Only the smallest hash in each bucket is retained.
|
||||
<5> The filter fills empty buckets with the values of neighboring buckets.
|
||||
|
Loading…
x
Reference in New Issue
Block a user