From 2c323972029720e09b1e1ed8c1e311d5d43a5a2d Mon Sep 17 00:00:00 2001 From: aetter Date: Fri, 28 May 2021 13:54:13 -0700 Subject: [PATCH] Minor KNN tweaks --- _search-plugins/knn/jni-library.md | 3 ++- _search-plugins/knn/knn-index.md | 15 ++++++++------- _search-plugins/knn/knn-score-script.md | 7 ++++--- _search-plugins/knn/painless-functions.md | 15 ++++++++++----- _search-plugins/knn/performance-tuning.md | 7 ++++--- 5 files changed, 28 insertions(+), 19 deletions(-) diff --git a/_search-plugins/knn/jni-library.md b/_search-plugins/knn/jni-library.md index 6ce1ffcf..2a88b906 100644 --- a/_search-plugins/knn/jni-library.md +++ b/_search-plugins/knn/jni-library.md @@ -7,6 +7,7 @@ has_children: false --- # JNI library + To integrate [nmslib's](https://github.com/nmslib/nmslib/) approximate k-NN functionality (implemented in C++) into the k-NN plugin (implemented in Java), we created a Java Native Interface library, which lets the k-NN plugin leverage nmslib's functionality. To see how we build the JNI library binary and learn how to get the most of it in your production environment, see [JNI Library Artifacts](https://github.com/opensearch-project/k-NN#jni-library-artifacts). -For more information about JNI, see [Java Native Interface](https://en.wikipedia.org/wiki/Java_Native_Interface) on Wikipedia. +For more information about JNI, see [Java Native Interface](https://en.wikipedia.org/wiki/Java_Native_Interface) on Wikipedia. diff --git a/_search-plugins/knn/knn-index.md b/_search-plugins/knn/knn-index.md index 081ed06e..d60ffc8b 100644 --- a/_search-plugins/knn/knn-index.md +++ b/_search-plugins/knn/knn-index.md @@ -8,7 +8,8 @@ has_children: false # k-NN Index -## `knn_vector` datatype +## knn_vector data type + The k-NN plugin introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors into an OpenSearch index. @@ -23,7 +24,7 @@ into an OpenSearch index. "parameters": { "ef_construction": 128, "m": 24 - } + } } } ``` @@ -44,15 +45,15 @@ Mapping Pararameter | Required | Default | Updateable | Description Additionally, the k-NN plugin introduces several index settings that can be used to configure the k-NN structure as well. -At the moment, several parameters defined in the settings are in the deprecation process. Those parameters should be set -in the mapping instead of the index settings. Parameters set in the mapping will override the parameters set in the -index settings. Setting the parameters in the mapping allows an index to have multiple `knn_vector` fields with +At the moment, several parameters defined in the settings are in the deprecation process. Those parameters should be set +in the mapping instead of the index settings. Parameters set in the mapping will override the parameters set in the +index settings. Setting the parameters in the mapping allows an index to have multiple `knn_vector` fields with different parameters. Setting | Default | Updateable | Description :--- | :--- | :--- | :--- -`index.knn` | false | false | Whether the index should build hnsw graphs for the `knn_vector` fields. If set to false, the `knn_vector` fields will be stored in doc values, but Approximate k-NN search functionality will be disabled. -`index.knn.algo_param.ef_search` | 512 | true | The size of the dynamic list used during k-NN searches. Higher values lead to more accurate but slower searches. +`index.knn` | false | false | Whether the index should build hnsw graphs for the `knn_vector` fields. If set to false, the `knn_vector` fields will be stored in doc values, but Approximate k-NN search functionality will be disabled. +`index.knn.algo_param.ef_search` | 512 | true | The size of the dynamic list used during k-NN searches. Higher values lead to more accurate but slower searches. `index.knn.algo_param.ef_construction` | 512 | false | (Deprecated in 1.0.0. Use the mapping parameters to set this value instead.) Refer to mapping definition. `index.knn.algo_param.m` | 16 | false | (Deprecated in 1.0.0. Use the mapping parameters to set this value instead.) Refer to mapping definition. `index.knn.space_type` | "l2" | false | (Deprecated in 1.0.0. Use the mapping parameters to set this value instead.) Refer to mapping definition. diff --git a/_search-plugins/knn/knn-score-script.md b/_search-plugins/knn/knn-score-script.md index d4c1ba55..9dd6542a 100644 --- a/_search-plugins/knn/knn-score-script.md +++ b/_search-plugins/knn/knn-score-script.md @@ -8,15 +8,16 @@ has_math: true --- # Exact k-NN with scoring script -The k-NN plugin implements the OpenSearch score script plugin that you can use to find the exact k-nearest neighbors to a given query point. Using the k-NN score script, you can apply a filter on an index before executing the nearest neighbor search. This is useful for dynamic search cases where the index body may vary based on other conditions. + +The k-NN plugin implements the OpenSearch score script plugin that you can use to find the exact k-nearest neighbors to a given query point. Using the k-NN score script, you can apply a filter on an index before executing the nearest neighbor search. This is useful for dynamic search cases where the index body may vary based on other conditions. Because the score script approach executes a brute force search, it doesn't scale as well as the [approximate approach](../approximate-knn). In some cases, it might be better to think about refactoring your workflow or index structure to use the approximate approach instead of the score script approach. ## Getting started with the score script for vectors -Similar to approximate nearest neighbor search, in order to use the score script on a body of vectors, you must first create an index with one or more `knn_vector` fields. +Similar to approximate nearest neighbor search, in order to use the score script on a body of vectors, you must first create an index with one or more `knn_vector` fields. -If you intend to just use the score script approach (and not the approximate approach) you can set `index.knn` to `false` and not set `index.knn.space_type`. You can choose the space type during search. See [spaces](#spaces) for the spaces the k-NN score script suppports. +If you intend to just use the score script approach (and not the approximate approach) you can set `index.knn` to `false` and not set `index.knn.space_type`. You can choose the space type during search. See [spaces](#spaces) for the spaces the k-NN score script suppports. This example creates an index with two `knn_vector` fields: diff --git a/_search-plugins/knn/painless-functions.md b/_search-plugins/knn/painless-functions.md index 1c1d994a..fb469f64 100644 --- a/_search-plugins/knn/painless-functions.md +++ b/_search-plugins/knn/painless-functions.md @@ -48,16 +48,21 @@ GET my-knn-index-2/_search The following table describes the available painless functions the k-NN plugin provides: Function name | Function signature | Description -:--- | :--- +:--- | :--- l2Squared | `float l2Squared (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors. l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors. cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:
`float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)`
In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score. ## Constraints + 1. If a document’s `knn_vector` field has different dimensions than the query, the function throws an `IllegalArgumentException`. + 2. If a vector field doesn't have a value, the function throws an IllegalStateException. + You can avoid this situation by first checking if a document has a value in its field: -``` -"source": "doc[params.field].size() == 0 ? 0 : 1 / (1 + l2Squared(params.query_value, doc[params.field]))", -``` -Because scores can only be positive, this script ranks documents with vector fields higher than those without. + + ``` + "source": "doc[params.field].size() == 0 ? 0 : 1 / (1 + l2Squared(params.query_value, doc[params.field]))", + ``` + + Because scores can only be positive, this script ranks documents with vector fields higher than those without. diff --git a/_search-plugins/knn/performance-tuning.md b/_search-plugins/knn/performance-tuning.md index 5d42d641..06c2287d 100644 --- a/_search-plugins/knn/performance-tuning.md +++ b/_search-plugins/knn/performance-tuning.md @@ -49,7 +49,7 @@ Take the following steps to improve search performance: * **Reduce segment count** - To improve search performance, you must keep the number of segments under control. Lucene's IndexSearcher searches over all of the segments in a shard to find the 'size' best results. However, because the complexity of search for the HNSW algorithm is logarithmic with respect to the number of vectors, searching over five graphs with 100 vectors each and then taking the top 'size' results from 5*k results will take longer than searching over one graph with 500 vectors and then taking the top size results from k results. + To improve search performance, you must keep the number of segments under control. Lucene's IndexSearcher searches over all of the segments in a shard to find the 'size' best results. However, because the complexity of search for the HNSW algorithm is logarithmic with respect to the number of vectors, searching over five graphs with 100 vectors each and then taking the top 'size' results from 5*k results will take longer than searching over one graph with 500 vectors and then taking the top size results from k results. Ideally, having one segment per shard provides the optimal performance with respect to search latency. You can configure an index to have multiple shards to avoid giant shards and achieve more parallelism. @@ -86,7 +86,7 @@ Take the following steps to improve search performance: Recall depends on multiple factors like number of vectors, number of dimensions, segments, and so on. Searching over a large number of small segments and aggregating the results leads to better recall than searching over a small number of large segments and aggregating results. The larger the graph, the more chances of losing recall if you're using smaller algorithm parameters. Choosing larger values for algorithm parameters should help solve this issue but sacrifices search latency and indexing time. That being said, it's important to understand your system's requirements for latency and accuracy, and then choose the number of segments you want your index to have based on experimentation. -To configure recall, adjust the algorithm parameters of the HNSW algorithm exposed through index settings. Algorithm parameters that control recall are `m`, `ef_construction`, and `ef_search`. For more information about how algorithm parameters influence indexing and search recall, see [HNSW algorithm parameters](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md). Increasing these values can help recall and lead to better search results, but at the cost of higher memory utilization and increased indexing time. +To configure recall, adjust the algorithm parameters of the HNSW algorithm exposed through index settings. Algorithm parameters that control recall are `m`, `ef_construction`, and `ef_search`. For more information about how algorithm parameters influence indexing and search recall, see [HNSW algorithm parameters](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md). Increasing these values can help recall and lead to better search results, but at the cost of higher memory utilization and increased indexing time. The default recall values work on a broader set of use cases, but make sure to run your own experiments on your data sets and choose the appropriate values. For index-level settings, see [Index settings](../knn-index#index-settings). @@ -102,7 +102,8 @@ As an example, assume you have a million vectors with a dimension of 256 and M o 1.1 * (4 *256 + 8 * 16) * 1,000,000 ~= 1.26 GB ``` -**Note**: Remember that having a replica doubles the total number of vectors. +Having a replica doubles the total number of vectors. +{: .note } ## Approximate nearest neighbor versus score script