mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-14 17:05:36 +00:00
Previously the functions accepted a doc values reference, whereas they now accept the name of the vector field. Here's an example of how a vector function was called before and after the change. ``` Before: cosineSimilarity(params.query_vector, doc['field']) After: cosineSimilarity(params.query_vector, 'field') ``` This seems more intuitive, since we don't allow direct access to vector doc values and the the meaning of `doc['field']` is unclear. The PR makes the following changes (broken into distinct commits): * Add new function signatures of the form `function(params.query_vector, 'field')` and deprecates the old ones. Because Painless doesn't allow two methods with the same name and number of arguments, we allow a generic `Object` to be passed in to the function and decide on the behavior through an `instanceof` check. * Refactor the class bindings so that the document field is passed to the constructor instead of the instance method. This allows us to avoid retrieving the vector doc values on every function invocation, which gives a tiny speed-up in benchmarks. Note that this PR adds new signatures for the sparse vector functions too, even though sparse vectors are deprecated. It seemed simplest to understand (for both us and users) to keep everything symmetric between dense and sparse vectors.
372 lines
9.4 KiB
Plaintext
372 lines
9.4 KiB
Plaintext
[role="xpack"]
|
|
[testenv="basic"]
|
|
[[vector-functions]]
|
|
===== Functions for vector fields
|
|
|
|
experimental[]
|
|
|
|
NOTE: During vector functions' calculation, all matched documents are
|
|
linearly scanned. Thus, expect the query time grow linearly
|
|
with the number of matched documents. For this reason, we recommend
|
|
to limit the number of matched documents with a `query` parameter.
|
|
|
|
====== `dense_vector` functions
|
|
|
|
Let's create an index with a `dense_vector` mapping and index a couple
|
|
of documents into it.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT my_index
|
|
{
|
|
"mappings": {
|
|
"properties": {
|
|
"my_dense_vector": {
|
|
"type": "dense_vector",
|
|
"dims": 3
|
|
},
|
|
"status" : {
|
|
"type" : "keyword"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
PUT my_index/_doc/1
|
|
{
|
|
"my_dense_vector": [0.5, 10, 6],
|
|
"status" : "published"
|
|
}
|
|
|
|
PUT my_index/_doc/2
|
|
{
|
|
"my_dense_vector": [-0.5, 10, 10],
|
|
"status" : "published"
|
|
}
|
|
|
|
POST my_index/_refresh
|
|
|
|
--------------------------------------------------
|
|
// TESTSETUP
|
|
|
|
The `cosineSimilarity` function calculates the measure of
|
|
cosine similarity between a given query vector and document vectors.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET my_index/_search
|
|
{
|
|
"query": {
|
|
"script_score": {
|
|
"query" : {
|
|
"bool" : {
|
|
"filter" : {
|
|
"term" : {
|
|
"status" : "published" <1>
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0", <2>
|
|
"params": {
|
|
"query_vector": [4, 3.4, -0.2] <3>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> To restrict the number of documents on which script score calculation is applied, provide a filter.
|
|
<2> The script adds 1.0 to the cosine similarity to prevent the score from being negative.
|
|
<3> To take advantage of the script optimizations, provide a query vector as a script parameter.
|
|
|
|
NOTE: If a document's dense vector field has a number of dimensions
|
|
different from the query's vector, an error will be thrown.
|
|
|
|
The `dotProduct` function calculates the measure of
|
|
dot product between a given query vector and document vectors.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET my_index/_search
|
|
{
|
|
"query": {
|
|
"script_score": {
|
|
"query" : {
|
|
"bool" : {
|
|
"filter" : {
|
|
"term" : {
|
|
"status" : "published"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": """
|
|
double value = dotProduct(params.query_vector, 'my_dense_vector');
|
|
return sigmoid(1, Math.E, -value); <1>
|
|
""",
|
|
"params": {
|
|
"query_vector": [4, 3.4, -0.2]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Using the standard sigmoid function prevents scores from being negative.
|
|
|
|
The `l1norm` function calculates L^1^ distance
|
|
(Manhattan distance) between a given query vector and
|
|
document vectors.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET my_index/_search
|
|
{
|
|
"query": {
|
|
"script_score": {
|
|
"query" : {
|
|
"bool" : {
|
|
"filter" : {
|
|
"term" : {
|
|
"status" : "published"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": "1 / (1 + l1norm(params.queryVector, 'my_dense_vector'))", <1>
|
|
"params": {
|
|
"queryVector": [4, 3.4, -0.2]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Unlike `cosineSimilarity` that represent similarity, `l1norm` and
|
|
`l2norm` shown below represent distances or differences. This means, that
|
|
the more similar the vectors are, the lower the scores will be that are
|
|
produced by the `l1norm` and `l2norm` functions.
|
|
Thus, as we need more similar vectors to score higher,
|
|
we reversed the output from `l1norm` and `l2norm`. Also, to avoid
|
|
division by 0 when a document vector matches the query exactly,
|
|
we added `1` in the denominator.
|
|
|
|
The `l2norm` function calculates L^2^ distance
|
|
(Euclidean distance) between a given query vector and
|
|
document vectors.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET my_index/_search
|
|
{
|
|
"query": {
|
|
"script_score": {
|
|
"query" : {
|
|
"bool" : {
|
|
"filter" : {
|
|
"term" : {
|
|
"status" : "published"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": "1 / (1 + l2norm(params.queryVector, 'my_dense_vector'))",
|
|
"params": {
|
|
"queryVector": [4, 3.4, -0.2]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
NOTE: If a document doesn't have a value for a vector field on which
|
|
a vector function is executed, an error will be thrown.
|
|
|
|
You can check if a document has a value for the field `my_vector` by
|
|
`doc['my_vector'].size() == 0`. Your overall script can look like this:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"source": "doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(params.queryVector, 'my_vector')"
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
====== `sparse_vector` functions
|
|
|
|
deprecated[7.6, The `sparse_vector` type is deprecated and will be removed in 8.0.]
|
|
|
|
Let's create an index with a `sparse_vector` mapping and index a couple
|
|
of documents into it.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT my_sparse_index
|
|
{
|
|
"mappings": {
|
|
"properties": {
|
|
"my_sparse_vector": {
|
|
"type": "sparse_vector"
|
|
},
|
|
"status" : {
|
|
"type" : "keyword"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT my_sparse_index/_doc/1
|
|
{
|
|
"my_sparse_vector": {"2": 1.5, "15" : 2, "50": -1.1, "4545": 1.1},
|
|
"status" : "published"
|
|
}
|
|
|
|
PUT my_sparse_index/_doc/2
|
|
{
|
|
"my_sparse_vector": {"2": 2.5, "10" : 1.3, "55": -2.3, "113": 1.6},
|
|
"status" : "published"
|
|
}
|
|
|
|
POST my_sparse_index/_refresh
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
The `cosineSimilaritySparse` function calculates cosine similarity
|
|
between a given query vector and document vectors.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET my_sparse_index/_search
|
|
{
|
|
"query": {
|
|
"script_score": {
|
|
"query" : {
|
|
"bool" : {
|
|
"filter" : {
|
|
"term" : {
|
|
"status" : "published"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": "cosineSimilaritySparse(params.query_vector, 'my_sparse_vector') + 1.0",
|
|
"params": {
|
|
"query_vector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
|
|
|
|
The `dotProductSparse` function calculates dot product
|
|
between a given query vector and document vectors.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET my_sparse_index/_search
|
|
{
|
|
"query": {
|
|
"script_score": {
|
|
"query" : {
|
|
"bool" : {
|
|
"filter" : {
|
|
"term" : {
|
|
"status" : "published"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": """
|
|
double value = dotProductSparse(params.query_vector, 'my_sparse_vector');
|
|
return sigmoid(1, Math.E, -value);
|
|
""",
|
|
"params": {
|
|
"query_vector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
|
|
|
|
The `l1normSparse` function calculates L^1^ distance
|
|
between a given query vector and document vectors.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET my_sparse_index/_search
|
|
{
|
|
"query": {
|
|
"script_score": {
|
|
"query" : {
|
|
"bool" : {
|
|
"filter" : {
|
|
"term" : {
|
|
"status" : "published"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": "1 / (1 + l1normSparse(params.queryVector, 'my_sparse_vector'))",
|
|
"params": {
|
|
"queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
|
|
|
|
The `l2normSparse` function calculates L^2^ distance
|
|
between a given query vector and document vectors.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET my_sparse_index/_search
|
|
{
|
|
"query": {
|
|
"script_score": {
|
|
"query" : {
|
|
"bool" : {
|
|
"filter" : {
|
|
"term" : {
|
|
"status" : "published"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": "1 / (1 + l2normSparse(params.queryVector, 'my_sparse_vector'))",
|
|
"params": {
|
|
"queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
// TEST[warning:The [sparse_vector] field type is deprecated and will be removed in 8.0.]
|