Minor KNN tweaks

This commit is contained in:
aetter 2021-05-28 13:54:13 -07:00
parent 2f8ef915a7
commit 2c32397202
5 changed files with 28 additions and 19 deletions

View File

@ -7,6 +7,7 @@ has_children: false
---
# JNI library
To integrate [nmslib's](https://github.com/nmslib/nmslib/) approximate k-NN functionality (implemented in C++) into the k-NN plugin (implemented in Java), we created a Java Native Interface library, which lets the k-NN plugin leverage nmslib's functionality. To see how we build the JNI library binary and learn how to get the most of it in your production environment, see [JNI Library Artifacts](https://github.com/opensearch-project/k-NN#jni-library-artifacts).
For more information about JNI, see [Java Native Interface](https://en.wikipedia.org/wiki/Java_Native_Interface) on Wikipedia.

View File

@ -8,7 +8,8 @@ has_children: false
# k-NN Index
## `knn_vector` datatype
## knn_vector data type
The k-NN plugin introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors
into an OpenSearch index.

View File

@ -8,6 +8,7 @@ has_math: true
---
# Exact k-NN with scoring script
The k-NN plugin implements the OpenSearch score script plugin that you can use to find the exact k-nearest neighbors to a given query point. Using the k-NN score script, you can apply a filter on an index before executing the nearest neighbor search. This is useful for dynamic search cases where the index body may vary based on other conditions.
Because the score script approach executes a brute force search, it doesn't scale as well as the [approximate approach](../approximate-knn). In some cases, it might be better to think about refactoring your workflow or index structure to use the approximate approach instead of the score script approach.

View File

@ -54,10 +54,15 @@ l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This functi
cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:<br /> `float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)` <br />In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score.
## Constraints
1. If a documents `knn_vector` field has different dimensions than the query, the function throws an `IllegalArgumentException`.
2. If a vector field doesn't have a value, the function throws an <code>IllegalStateException</code>.
You can avoid this situation by first checking if a document has a value in its field:
```
"source": "doc[params.field].size() == 0 ? 0 : 1 / (1 + l2Squared(params.query_value, doc[params.field]))",
```
Because scores can only be positive, this script ranks documents with vector fields higher than those without.

View File

@ -102,7 +102,8 @@ As an example, assume you have a million vectors with a dimension of 256 and M o
1.1 * (4 *256 + 8 * 16) * 1,000,000 ~= 1.26 GB
```
**Note**: Remember that having a replica doubles the total number of vectors.
Having a replica doubles the total number of vectors.
{: .note }
## Approximate nearest neighbor versus score script