“fixfor#791_k-NN-ANN” (#837)

* “fixfor#791_k-NN-ANN” Signed-off-by: cwillum <cwmmoore@amazon.com> * “fixfor#791_k-NN-ANN” Signed-off-by: cwillum <cwmmoore@amazon.com> * “fixfor#791_k-NN-ANN” Signed-off-by: cwillum <cwmmoore@amazon.com> * “fix#829-PreAgg-field_doc_count” Signed-off-by: cwillum <cwmmoore@amazon.com> * “fix#829-PreAgg-field_doc_count” Signed-off-by: cwillum <cwmmoore@amazon.com> * “fixfor#791_k-NN-ANN” Signed-off-by: cwillum <cwmmoore@amazon.com> * “fixfor#791_k-NN-ANN” Signed-off-by: cwillum <cwmmoore@amazon.com> * “fixfor#791_k-NN-ANN” Signed-off-by: cwillum <cwmmoore@amazon.com> * “fixfor#791_k-NN-ANN” Signed-off-by: cwillum <cwmmoore@amazon.com> Signed-off-by: cwillum <cwmmoore@amazon.com>
2025-02-28 03:09:05 +00:00 · 2022-08-11 13:07:03 -07:00 · 2022-08-11 13:07:03 -07:00 · b6bcce916c
commit b6bcce916c
parent 41c4a8433a
4 changed files with 99 additions and 52 deletions
--- a/_search-plugins/knn/approximate-knn.md
+++ b/_search-plugins/knn/approximate-knn.md
@ -9,23 +9,34 @@ has_math: true

 # Approximate k-NN search

-The approximate k-NN search method uses nearest neighbor algorithms from *nmslib* and *faiss* to power
-k-NN search. To see the algorithms that the plugin currently supports, check out the [k-NN Index documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions).
-In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest neighbors. Of the three search methods the plugin provides, this method offers the best search scalability for large data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach is preferred.
+Standard k-NN search methods compute similarity using a brute-force approach that measures the nearest distance between a query and a number of points, which produces exact results. This works well in many applications. However, in the case of extremely large datasets with high dimensionality, this creates a scaling problem that reduces the efficiency of the search. Approximate k-NN search methods can overcome this by employing tools that restructure indexes more efficiently and reduce the dimensionality of searchable vectors. Using this approach requires a sacrifice in accuracy but increases search processing speeds appreciably.

-The k-NN plugin builds a native library index of the vectors for each "knn-vector field"/ "Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, see the [Apache Lucene documentation](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description).
-These native library indices are loaded into native memory during search and managed by a cache. To learn more about
-pre-loading native library indices into memory, refer to the [warmup API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#warmup-operation). Additionally, you can see what native library indices are already loaded in memory, which you can learn more about in the [stats API section]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#stats).
+The Approximate k-NN search methods leveraged by OpenSearch use approximate nearest neighbor (ANN) algorithms from the [nmslib](https://github.com/nmslib/nmslib), [faiss](https://github.com/facebookresearch/faiss), and [Lucene](https://lucene.apache.org/) libraries to power k-NN search. These search methods employ ANN to improve search latency for large datasets. Of the three search methods the k-NN plugin provides, this method offers the best search scalability for large datasets. This approach is the preferred method when a dataset reaches hundreds of thousands of vectors.

-Because the native library indices are constructed during indexing, it is not possible to apply a filter on an index
+For details on the algorithms the plugin currently supports, see [k-NN Index documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions).
+{: .note}
+
+The k-NN plugin builds a native library index of the vectors for each knn-vector field/Lucene segment pair during indexing, which can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, see the [Apache Lucene documentation](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description). These native library indexes are loaded into native memory during search and managed by a cache. To learn more about preloading native library indexes into memory, refer to the [warmup API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#warmup-operation). Additionally, you can see which native library indexes are already loaded in memory. To learn more about this, see the [stats API section]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#stats).
+
+Because the native library indexes are constructed during indexing, it is not possible to apply a filter on an index
 and then use this search method. All filters are applied on the results produced by the approximate nearest neighbor search.

+### Recommendations for engines and cluster node sizing
+
+Each of the three engines used for approximate k-NN search has its own attributes that make one more sensible to use than the others in a given situation. You can follow the general information below to help determine which engine will best meet your requirements.
+
+* The faiss engine performs exceptionally well (on orders of magnitude) with hardware that includes a GPU. When cost is not the first concern, this is the recommended engine.
+* When only a CPU is available, nmslib is a good choice. In general, it outperforms both faiss and Lucene. 
+* For relatively smaller datasets (up to a few million vectors), the Lucene engine demonstrates better latencies and recall. At the same time, the size of the index is smallest compared to the other engines, which allows it to use smaller AWS instances for data nodes.<br>Also, the Lucene engine uses pure Java implementation and does not share any of the limitations that engines using platform-native code experience. However, one exception to this is that the maximum number of vector dimensions for the Lucene engine is 1024, compared with 10000 for the other engines. Refer to the sample mapping parameters in the following section to see where this is configured.
+
+When considering cluster node sizing, a general approach is to first establish an even distribution of the index across the cluster. However, there are other considerations. To help make these choices, you can refer to the OpenSearch managed service guidance in the section [Sizing domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/sizing-domains.html).
+
 ## Get started with approximate k-NN

-To use the k-NN plugin's approximate search functionality, you must first create a k-NN index with setting `index.knn` to `true`. This setting tells the plugin to create native library indices for the index.
+To use the k-NN plugin's approximate search functionality, you must first create a k-NN index with `index.knn` set to `true`. This setting tells the plugin to create native library indexes for the index.

 Next, you must add one or more fields of the `knn_vector` data type. This example creates an index with two
-`knn_vector`'s, one using *faiss*, the other using *nmslib*, fields:
+`knn_vector` fields, one using `faiss` and the other using `nmslib` fields:

 ```json
 PUT my-knn-index-1
@ -69,12 +80,11 @@ PUT my-knn-index-1
 }
 ```

-In the example above, both `knn_vector`s are configured from method definitions. Additionally, `knn_vector`s can also be configured from models. Learn more about it [here]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#knn_vector-data-type)!
+In the example above, both `knn_vector` fields are configured from method definitions. Additionally, `knn_vector` fields can also be configured from models. You can learn more about this in the [knn_vector data type]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#knn_vector-data-type) section.

-The `knn_vector` data type supports a vector of floats that can have a dimension of up to 10,000, as set by the
-dimension mapping parameter.
+The `knn_vector` data type supports a vector of floats that can have a dimension of up to 10000 for the nmslib and faiss engines, as set by the dimension mapping parameter. The maximum dimension for the Lucene library is 1024.

-In OpenSearch, codecs handle the storage and retrieval of indices. The k-NN plugin uses a custom codec to write vector data to native library indices so that the underlying k-NN search library can read it.
+In OpenSearch, codecs handle the storage and retrieval of indexes. The k-NN plugin uses a custom codec to write vector data to native library indexes so that the underlying k-NN search library can read it.
 {: .tip }

 After you create the index, you can add some data to it:
@ -148,9 +158,9 @@ PUT /train-index
 }
 ```

-Notice that `index.knn` is not set in the index settings. This ensures that we do not create native library indices for this index.
+Notice that `index.knn` is not set in the index settings. This ensures that you do not create native library indexes for this index.

-Next, let's add some data to it:
+You can now add some data to the index:

 ```json
 POST _bulk
@ -200,8 +210,8 @@ GET /_plugins/_knn/models/my-model?filter_path=state&pretty
 }
 ```

-Once the model enters the "created" state, we can create an index that will use this model to initialize it's native
-library indices:
+Once the model enters the "created" state, you can create an index that will use this model to initialize its native
+library indexes:

 ```json
 PUT /target-index
@ -296,12 +306,12 @@ A space corresponds to the function used to measure the distance between two poi
    <td>\[ 1 - {A &middot; B \over \|A\| &middot; \|B\|} = 1 -
    {\sum_{i=1}^n (A_i &middot; B_i) \over \sqrt{\sum_{i=1}^n A_i^2} &middot; \sqrt{\sum_{i=1}^n B_i^2}}\]
    where \(\|A\|\) and \(\|B\|\) represent normalized vectors.</td>
-    <td>1 / (1 + Distance Function)</td>
+    <td>nmslib and faiss:<br>1 / (1 + Distance Function)<br>Lucene:<br>(1 + Distance Function) / 2</td>
  </tr>
  <tr>
-    <td>innerproduct</td>
+    <td>innerproduct (not supported for Lucene)</td>
    <td>\[ Distance(X, Y) = - {A &middot; B} \]</td>
-    <td>if (Distance Function >= 0) 1 / (1 + Distance Function) else -Distance Function + 1</td>
+    <td>if Distance Function is > or = 0, use 1 / (1 + Distance Function). Otherwise, -Distance Function + 1</td>
  </tr>
 </table>

--- a/_search-plugins/knn/index.md
+++ b/_search-plugins/knn/index.md
@ -22,7 +22,7 @@ This plugin supports three different methods for obtaining the k-nearest neighbo

    Approximate k-NN is the best choice for searches over large indices (i.e. hundreds of thousands of vectors or more) that require low latency. You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched. In this case, you should use either the script scoring method or painless extensions.

-    For more details about this method, see [Approximate k-NN search]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/).
+    For more details about this method, including recommendations for which engine to use, see [Approximate k-NN search]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/).

 2. **Script Score k-NN**

--- a/_search-plugins/knn/jni-libraries.md
+++ b/_search-plugins/knn/jni-libraries.md
@ -8,7 +8,10 @@ has_children: false

 # JNI libraries

-To integrate [*nmslib*'s](https://github.com/nmslib/nmslib/) and [*faiss*'s](https://github.com/facebookresearch/faiss/) Approximate k-NN functionality (implemented in C++) into the k-NN plugin (implemented in Java), we created a Java Native Interface, which lets the k-NN plugin make calls to the native libraries. We create 3 libraries: `libopensearchknn_nmslib`, the JNI library that interfaces with nmslib, `libopensearchknn_faiss`, the JNI library that interfaces with faiss, and `libopensearchknn_common`, a library containing common shared functionality between native libraries.
+To integrate [nmslib](https://github.com/nmslib/nmslib/) and [faiss](https://github.com/facebookresearch/faiss/) approximate k-NN functionality (implemented in C++) into the k-NN plugin (implemented in Java), we created a Java Native Interface, which lets the k-NN plugin make calls to the native libraries. The interface includes three libraries: `libopensearchknn_nmslib`, the JNI library that interfaces with nmslib, `libopensearchknn_faiss`, the JNI library that interfaces with faiss, and `libopensearchknn_common`, a library containing common shared functionality between native libraries.
+
+The Lucene library is not implemented using a native library.
+{: .note}

 The libraries `libopensearchknn_faiss` and `libopensearchknn_nmslib` are lazily loaded when they are first called in the plugin. This means that if you are only planning on using one of the libraries, the plugin never loads the other library.

--- a/_search-plugins/knn/knn-index.md
+++ b/_search-plugins/knn/knn-index.md
@ -53,54 +53,56 @@ However, if you intend to just use painless scripting or a k-NN score script, yo
 A method definition refers to the underlying configuration of the Approximate k-NN algorithm you want to use. Method definitions are used to either create a `knn_vector` field (when the method does not require training) or [create a model during training]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model) that can then be used to [create a `knn_vector` field]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model).

 A method definition will always contain the name of the method, the space_type the method is built for, the engine
-(the native library) to use, and a map of parameters.
+(the library) to use, and a map of parameters.

 Mapping Parameter | Required | Default | Updatable | Description
 :--- | :--- | :--- | :--- | :---
 `name` | true | n/a | false | The identifier for the nearest neighbor method.
-`space_type` | false | "l2" | false | The vector space used to calculate the distance between vectors.
-`engine` | false | "nmslib" | false | The approximate k-NN library to use for indexing and search. Either "faiss" or "nmslib".
+`space_type` | false | l2 | false | The vector space used to calculate the distance between vectors.
+`engine` | false | nmslib | false | The approximate k-NN library to use for indexing and search. The available libraries are faiss, nmslib, and Lucene.
 `parameters` | false | null | false | The parameters used for the nearest neighbor method.

 ### Supported nmslib methods

 Method Name | Requires Training? | Supported Spaces | Description
 :--- | :--- | :--- | :---
-`hnsw` | false | "l2", "innerproduct", "cosinesimil", "l1", "linf" | Hierarchical proximity graph approach to Approximate k-NN search. For more details on the algorithm, [checkout this paper](https://arxiv.org/abs/1603.09320)!
+`hnsw` | false | l2, innerproduct, cosinesimil, l1, linf | Hierarchical proximity graph approach to Approximate k-NN search. For more details on the algorithm, see this [abstract](https://arxiv.org/abs/1603.09320).

-#### HNSW Parameters
+#### HNSW parameters

-Paramater Name | Required | Default | Updatable | Description
+Parameter Name | Required | Default | Updatable | Description
 :--- | :--- | :--- | :--- | :---
-`ef_construction` | false | 512 | false | The size of the dynamic list used during k-NN graph creation. Higher values lead to a more accurate graph, but slower indexing speed.
-`m` | false | 16 | false | The number of bidirectional links that the plugin creates for each new element. Increasing and decreasing this value can have a large impact on memory consumption. Keep this value between 2-100.
+`ef_construction` | false | 512 | false | The size of the dynamic list used during k-NN graph creation. Higher values lead to a more accurate graph but slower indexing speed.
+`m` | false | 16 | false | The number of bidirectional links that the plugin creates for each new element. Increasing and decreasing this value can have a large impact on memory consumption. Keep this value between 2 and 100.

-**Note** --- For *nmslib*, *ef_search* is set in the [index settings](#index-settings).
+For nmslib, *ef_search* is set in the [index settings](#index-settings).
+{: .note}

 ### Supported faiss methods

 Method Name | Requires Training? | Supported Spaces | Description
 :--- | :--- | :--- | :---
-`hnsw` | false | "l2", "innerproduct"* | Hierarchical proximity graph approach to Approximate k-NN search.
-`ivf` | true | "l2", "innerproduct" | Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets are searched.
+`hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to Approximate k-NN search.
+`ivf` | true | l2, innerproduct | Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.

-**Note** --- For *hnsw*, "innerproduct" is not available when PQ is used.
+For hnsw, "innerproduct" is not available when PQ is used.
+{: .note}

-#### HNSW Parameters
+#### HNSW parameters

-Paramater Name | Required | Default | Updatable | Description
+Parameter Name | Required | Default | Updatable | Description
 :--- | :--- | :--- | :--- | :---
 `ef_search` | false | 512 | false | The size of the dynamic list used during k-NN searches. Higher values lead to more accurate but slower searches.
-`ef_construction` | false | 512 | false | The size of the dynamic list used during k-NN graph creation. Higher values lead to a more accurate graph, but slower indexing speed.
-`m` | false | 16 | false | The number of bidirectional links that the plugin creates for each new element. Increasing and decreasing this value can have a large impact on memory consumption. Keep this value between 2-100.
+`ef_construction` | false | 512 | false | The size of the dynamic list used during k-NN graph creation. Higher values lead to a more accurate graph but slower indexing speed.
+`m` | false | 16 | false | The number of bidirectional links that the plugin creates for each new element. Increasing and decreasing this value can have a large impact on memory consumption. Keep this value between 2 and 100.
 `encoder` | false | flat | false | Encoder definition for encoding vectors. Encoders can reduce the memory footprint of your index, at the expense of search accuracy.

-#### IVF Parameters
+#### IVF parameters

-Paramater Name | Required | Default | Updatable | Description
+Parameter Name | Required | Default | Updatable | Description
 :--- | :--- | :--- | :--- | :---
-`nlist` | false | 4 | false | Number of buckets to partition vectors into. Higher values may lead to more accurate searches, at the expense of memory and training latency. For more information about choosing the right value, refer to [Guidelines to choose an index](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).
-`nprobes` | false | 1 | false | Number of buckets to search over during query. Higher values lead to more accurate but slower searches.
+`nlist` | false | 4 | false | Number of buckets to partition vectors into. Higher values may lead to more accurate searches at the expense of memory and training latency. For more information about choosing the right value, refer to [Guidelines to choose an index](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).
+`nprobes` | false | 1 | false | Number of buckets to search during query. Higher values lead to more accurate but slower searches.
 `encoder` | false | flat | false | Encoder definition for encoding vectors. Encoders can reduce the memory footprint of your index, at the expense of search accuracy.

 For more information about setting these parameters, please refer to [*faiss*'s documentation](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes).
@ -109,13 +111,45 @@ For more information about setting these parameters, please refer to [*faiss*'s

 The IVF algorithm requires a training step. To create an index that uses IVF, you need to train a model with the
 [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model), passing the IVF method definition. IVF requires that, at a minimum, there should be `nlist` training
-data points, but it is [recommended to use more](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset).
-Training data can either the same data that is going to be ingested or a separate set of data.
+data points, but it is [recommended that you use more](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset).
+Training data can be composed of either the same data that is going to be ingested or a separate dataset.
+
+### Supported Lucene methods
+
+Method Name | Requires Training? | Supported Spaces | Description
+:--- | :--- | :--- | :---
+`hnsw` | false | l2, cosinesimil | Hierarchical proximity graph approach to Approximate k-NN search.
+
+#### HNSW parameters
+
+Parameter Name | Required | Default | Updatable | Description
+:--- | :--- | :--- | :--- | :---
+`ef_construction` | false | 512 | false | The size of the dynamic list used during k-NN graph creation. Higher values lead to a more accurate graph but slower indexing speed.<br>The Lucene engine uses the proprietary term "beam_width" to describe this function, which corresponds directly to "ef_construction". To be consistent throughout OpenSearch documentation, we retain the term "ef_construction" to label this parameter.
+`m` | false | 16 | false | The number of bidirectional links that the plugin creates for each new element. Increasing and decreasing this value can have a large impact on memory consumption. Keep this value between 2 and 100.<br>The Lucene engine uses the proprietary term "max_connections" to describe this function, which corresponds directly to "m". To be consistent throughout OpenSearch documentation, we retain the term "m" to label this parameter.
+
+Lucene HNSW implementation ignores `ef_search`  and dynamically sets it to the value of "k" in the search request. Therefore, there is no need to make settings for `ef_search` when using the Lucene engine.
+{: .note}
+
+```json
+{
+    "type": "knn_vector",
+    "dimension": 100,
+    "method": {
+        "name":"hnsw",
+        "engine":"lucene",
+        "space_type": "l2",
+        "parameters":{
+            "m":2048,
+            "ef_construction": 245
+        }
+    }
+}
+```

 ### Supported faiss encoders

-You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. *faiss* has
-several encoder types, but currently, the plugin only supports *flat* and *pq* encoding.
+You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. faiss has
+several encoder types, but the plugin currently only supports *flat* and *pq* encoding.

 An example method definition that specifies an encoder may look something like this:

@ -140,7 +174,7 @@ Encoder Name | Requires Training? | Description
 `flat` | false | Encode vectors as floating point arrays. This encoding does not reduce memory footprint.
 `pq` | true | Short for product quantization, it is a lossy compression technique that encodes a vector into a fixed size of bytes using clustering, with the goal of minimizing the drop in k-NN search accuracy. From a high level, vectors are broken up into `m` subvectors, and then each subvector is represented by a `code_size` code obtained from a code book produced during training. For more details on product quantization, here is a [great blog post](https://medium.com/dotstar/understanding-faiss-part-2-79d90b1e5388)!

-#### PQ Parameters
+#### PQ parameters

 Paramater Name | Required | Default | Updatable | Description
 :--- | :--- | :--- | :--- | :---
@ -160,7 +194,7 @@ If memory is a concern, consider adding a PQ encoder to your HNSW or IVF index.
 ### Memory Estimation

 In a typical OpenSearch cluster, a certain portion of RAM is set aside for the JVM heap. The k-NN plugin allocates
-native library indices to a portion of the remaining RAM. This portion's size is determined by
+native library indexes to a portion of the remaining RAM. This portion's size is determined by
 the `circuit_breaker_limit` cluster setting. By default, the limit is set at 50%.

 Having a replica doubles the total number of vectors.
@ -196,7 +230,7 @@ At the moment, several parameters defined in the settings are in the deprecation
 Setting | Default | Updateable | Description
 :--- | :--- | :--- | :---
 `index.knn` | false | false | Whether the index should build native library indices for the `knn_vector` fields. If set to false, the `knn_vector` fields will be stored in doc values, but Approximate k-NN search functionality will be disabled.
-`index.knn.algo_param.ef_search` | 512 | true | The size of the dynamic list used during k-NN searches. Higher values lead to more accurate but slower searches. Only available for *nmslib*.
-`index.knn.algo_param.ef_construction` | 512 | false | (Deprecated in 1.0.0. Use the mapping parameters to set this value instead.) Only available for *nmslib*. Refer to mapping definition.
-`index.knn.algo_param.m` | 16 | false | (Deprecated in 1.0.0. Use the mapping parameters to set this value instead.) Only available for *nmslib*. Refer to mapping definition.
-`index.knn.space_type` | "l2" | false | (Deprecated in 1.0.0. Use the mapping parameters to set this value instead.) Only available for *nmslib*. Refer to mapping definition.
+`index.knn.algo_param.ef_search` | 512 | true | The size of the dynamic list used during k-NN searches. Higher values lead to more accurate but slower searches. Only available for nmslib.
+`index.knn.algo_param.ef_construction` | 512 | false | Deprecated in 1.0.0. Use the [mapping parameters](https://opensearch.org/docs/latest/search-plugins/knn/knn-index/#method-definitions) to set this value instead.
+`index.knn.algo_param.m` | 16 | false | Deprecated in 1.0.0. Use the [mapping parameters](https://opensearch.org/docs/latest/search-plugins/knn/knn-index/#method-definitions) to set this value instead.
+`index.knn.space_type` | l2 | false | Deprecated in 1.0.0. Use the [mapping parameters](https://opensearch.org/docs/latest/search-plugins/knn/knn-index/#method-definitions) to set this value instead.