Add k-NN vector field type (#4850)

* Add k-NN vector field type Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Rename topic Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
2023-08-22 13:27:31 -04:00 · 2023-08-22 13:27:31 -04:00 · 6bece563ea
commit 6bece563ea
parent fc14355c1f
4 changed files with 172 additions and 123 deletions
--- a/_field-types/supported-field-types/index.md
+++ b/_field-types/supported-field-types/index.md
@ -27,6 +27,7 @@ IP | [`ip`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/ip/):
 [Autocomplete]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/autocomplete/) |[`completion`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/completion/): Provides autocomplete functionality through a completion suggester.<br> [`search_as_you_type`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/search-as-you-type/): Provides search-as-you-type functionality using both prefix and infix completion. 
 [Geographic]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/geographic/)| [`geo_point`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/geo-point/): A geographic point.<br>[`geo_shape`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/geo-shape/): A geographic shape.
 [Rank]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/rank/) | Boosts or decreases the relevance score of documents (`rank_feature`, `rank_features`).  
+[k-NN vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) | Allows indexing a k-NN vector into OpenSearch and performing different kinds of k-NN search.
 Percolator | [`percolator`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/percolator/): Specifies to treat this field as a query. 

 ## Arrays
--- a/_field-types/supported-field-types/knn-vector.md
+++ b/_field-types/supported-field-types/knn-vector.md
@ -0,0 +1,166 @@
+---
+layout: default
+title: k-NN vector
+nav_order: 58
+has_children: false
+parent: Supported field types
+---
+
+# k-NN vector 
+
+The k-NN plugin introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors
+into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. In general, a `knn_vector` field can be built either by providing a method definition or specifying a model id.
+
+## Example
+
+For example, to map `my_vector1` as a `knn_vector`, use the following request:
+
+```json
+PUT test-index
+{
+  "settings": {
+    "index": {
+      "knn": true,
+      "knn.algo_param.ef_search": 100
+    }
+  },
+  "mappings": {
+    "properties": {
+      "my_vector1": {
+        "type": "knn_vector",
+        "dimension": 3,
+        "method": {
+          "name": "hnsw",
+          "space_type": "l2",
+          "engine": "lucene",
+          "parameters": {
+            "ef_construction": 128,
+            "m": 24
+          }
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Method definitions
+
+Method definitions are used when the underlying Approximate k-NN algorithm does not require training. For example, the following `knn_vector` field specifies that *nmslib*'s implementation of *hnsw* should be used for Approximate k-NN search. During indexing, *nmslib* will build the corresponding *hnsw* segment files.
+
+```json
+"my_vector": {
+  "type": "knn_vector",
+  "dimension": 4,
+  "method": {
+    "name": "hnsw",
+    "space_type": "l2",
+    "engine": "nmslib",
+    "parameters": {
+      "ef_construction": 128,
+      "m": 24
+    }
+  }
+}
+```
+
+## Model IDs
+
+Model IDs are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the
+model has to be created with the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model). The
+model contains the information needed to initialize the native library segment files.
+
+```json
+  "type": "knn_vector",
+  "model_id": "my-model"
+}
+```
+
+However, if you intend to just use painless scripting or a k-NN score script, you only need to pass the dimension.
+ ```json
+   "type": "knn_vector",
+   "dimension": 128
+ }
+ ```
+
+## Lucene byte vector
+
+By default, k-NN vectors are `float` vectors, where each dimension is 4 bytes. If you want to save storage space, you can use `byte` vectors with the `lucene` engine. In a `byte` vector, each dimension is a signed 8-bit integer in the [-128, 127] range. 
+ 
+Byte vectors are supported only for the `lucene` engine. They are not supported for the `nmslib` and `faiss` engines.
+{: .note}
+
+When using `byte` vectors, expect some loss of precision in the recall compared to using `float` vectors. Byte vectors are useful in large-scale applications and use cases that prioritize a reduced memory footprint in exchange for a minimal loss of recall.
+{: .important}
+ 
+Introduced in k-NN plugin version 2.9, the optional `data_type` parameter defines the data type of a vector. The default value of this parameter is `float`.
+
+To use a `byte` vector, set the `data_type` parameter to `byte` when creating mappings for an index:
+
+ ```json
+PUT test-index
+{
+  "settings": {
+    "index": {
+      "knn": true,
+      "knn.algo_param.ef_search": 100
+    }
+  },
+  "mappings": {
+    "properties": {
+      "my_vector1": {
+        "type": "knn_vector",
+        "dimension": 3,
+        "data_type": "byte",
+        "method": {
+          "name": "hnsw",
+          "space_type": "l2",
+          "engine": "lucene",
+          "parameters": {
+            "ef_construction": 128,
+            "m": 24
+          }
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Then ingest documents as usual. Make sure each dimension in the vector is in the supported [-128, 127] range:
+
+```json
+PUT test-index/_doc/1
+{
+  "my_vector1": [-126, 28, 127]
+}
+```
+{% include copy-curl.html %}
+
+```json
+PUT test-index/_doc/2
+{
+  "my_vector1": [100, -128, 0]
+}
+```
+{% include copy-curl.html %}
+
+When querying, be sure to use a `byte` vector:
+
+```json
+GET test-index/_search
+{
+  "size": 2,
+  "query": {
+    "knn": {
+      "my_vector1": {
+        "vector": [26, -120, 99],
+        "k": 2
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
--- a/_search-plugins/knn/approximate-knn.md
+++ b/_search-plugins/knn/approximate-knn.md
@ -1,6 +1,6 @@
 ---
 layout: default
-title: Approximate search
+title: Approximate k-NN search
 nav_order: 15
 parent: k-NN
 has_children: false
@ -79,7 +79,7 @@ PUT my-knn-index-1
 }
 ```

-In the example above, both `knn_vector` fields are configured from method definitions. Additionally, `knn_vector` fields can also be configured from models. You can learn more about this in the [knn_vector data type]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#knn_vector-data-type) section.
+In the example above, both `knn_vector` fields are configured from method definitions. Additionally, `knn_vector` fields can also be configured from models. You can learn more about this in the [knn_vector data type]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) section.

 The `knn_vector` data type supports a vector of floats that can have a dimension count of up to 16,000 for the nmslib and faiss engines, as set by the dimension mapping parameter. The maximum dimension count for the Lucene library is 1,024.

--- a/_search-plugins/knn/knn-index.md
+++ b/_search-plugins/knn/knn-index.md
@ -1,133 +1,15 @@
 ---
 layout: default
-title: k-NN Index
+title: k-NN index
 nav_order: 5
 parent: k-NN
 has_children: false
 ---

-# k-NN Index
-
-## knn_vector data type
+# k-NN index

 The k-NN plugin introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors
-into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. In general, a `knn_vector` field can be built either by providing a method definition or specifying a model id.
-
-Method definitions are used when the underlying Approximate k-NN algorithm does not require training. For example, the following `knn_vector` field specifies that *nmslib*'s implementation of *hnsw* should be used for Approximate k-NN search. During indexing, *nmslib* will build the corresponding *hnsw* segment files.
-
-```json
-"my_vector": {
-  "type": "knn_vector",
-  "dimension": 4,
-  "method": {
-    "name": "hnsw",
-    "space_type": "l2",
-    "engine": "nmslib",
-    "parameters": {
-      "ef_construction": 128,
-      "m": 24
-    }
-  }
-}
-```
-
-Model IDs are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the
-model has to be created with the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model). The
-model contains the information needed to initialize the native library segment files.
-
-```json
-  "type": "knn_vector",
-  "model_id": "my-model"
-}
-```
-
-However, if you intend to just use painless scripting or a k-NN score script, you only need to pass the dimension.
- ```json
-   "type": "knn_vector",
-   "dimension": 128
- }
- ```
-
-### Lucene byte vector
-
-By default, k-NN vectors are `float` vectors, where each dimension is 4 bytes. If you want to save storage space, you can use `byte` vectors with the `lucene` engine. In a `byte` vector, each dimension is a signed 8-bit integer in the [-128, 127] range. 
- 
-Byte vectors are supported only for the `lucene` engine. They are not supported for the `nmslib` and `faiss` engines.
-{: .note}
-
-When using `byte` vectors, expect some loss of precision in the recall compared to using `float` vectors. Byte vectors are useful in large-scale applications and use cases that prioritize a reduced memory footprint in exchange for a minimal loss of recall.
-{: .important}
- 
-Introduced in k-NN plugin version 2.9, the optional `data_type` parameter defines the data type of a vector. The default value of this parameter is `float`.
-
-To use a `byte` vector, set the `data_type` parameter to `byte` when creating mappings for an index:
-
- ```json
-PUT test-index
-{
-  "settings": {
-    "index": {
-      "knn": true,
-      "knn.algo_param.ef_search": 100
-    }
-  },
-  "mappings": {
-    "properties": {
-      "my_vector1": {
-        "type": "knn_vector",
-        "dimension": 3,
-        "data_type": "byte",
-        "method": {
-          "name": "hnsw",
-          "space_type": "l2",
-          "engine": "lucene",
-          "parameters": {
-            "ef_construction": 128,
-            "m": 24
-          }
-        }
-      }
-    }
-  }
-}
-```
-{% include copy-curl.html %}
-
-Then ingest documents as usual. Make sure each dimension in the vector is in the supported [-128, 127] range:
-
-```json
-PUT test-index/_doc/1
-{
-  "my_vector1": [-126, 28, 127]
-}
-```
-{% include copy-curl.html %}
-
-```json
-PUT test-index/_doc/2
-{
-  "my_vector1": [100, -128, 0]
-}
-```
-{% include copy-curl.html %}
-
-When querying, be sure to use a `byte` vector:
-
-```json
-GET test-index/_search
-{
-  "size": 2,
-  "query": {
-    "knn": {
-      "my_vector1": {
-        "vector": [26, -120, 99],
-        "k": 2
-      }
-    }
-  }
-}
-```
-{% include copy-curl.html %}
+into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. For more information, see [k-NN vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/).

 ## Method definitions