Add k-NN vector field type (#4850)

* Add k-NN vector field type Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Rename topic Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
2023-08-22 13:27:31 -04:00 · 2023-08-22 13:27:31 -04:00 · 6bece563ea
commit 6bece563ea
parent fc14355c1f
4 changed files with 172 additions and 123 deletions
--- a/_field-types/supported-field-types/index.md
+++ b/_field-types/supported-field-types/index.md
@ -27,6 +27,7 @@ IP | [`ip`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/ip/):
 [Autocomplete]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/autocomplete/) |[`completion`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/completion/): Provides autocomplete functionality through a completion suggester.<br> [`search_as_you_type`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/search-as-you-type/): Provides search-as-you-type functionality using both prefix and infix completion. 
 [Geographic]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/geographic/)| [`geo_point`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/geo-point/): A geographic point.<br>[`geo_shape`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/geo-shape/): A geographic shape.
 [Rank]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/rank/) | Boosts or decreases the relevance score of documents (`rank_feature`, `rank_features`).  
 [k-NN vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) | Allows indexing a k-NN vector into OpenSearch and performing different kinds of k-NN search.
 Percolator | [`percolator`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/percolator/): Specifies to treat this field as a query. 
 ## Arrays
--- a/_field-types/supported-field-types/knn-vector.md
+++ b/_field-types/supported-field-types/knn-vector.md
@ -0,0 +1,166 @@
 ---
 layout: default
 title: k-NN vector
 nav_order: 58
 has_children: false
 parent: Supported field types
 ---
 # k-NN vector 
 The k-NN plugin introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors
 into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. In general, a `knn_vector` field can be built either by providing a method definition or specifying a model id.
 ## Example
 For example, to map `my_vector1` as a `knn_vector`, use the following request:
 ```json
 PUT test-index
 {
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 3,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
 }
 ```
 {% include copy-curl.html %}
 ## Method definitions
 Method definitions are used when the underlying Approximate k-NN algorithm does not require training. For example, the following `knn_vector` field specifies that *nmslib*'s implementation of *hnsw* should be used for Approximate k-NN search. During indexing, *nmslib* will build the corresponding *hnsw* segment files.
 ```json
 "my_vector": {
  "type": "knn_vector",
  "dimension": 4,
  "method": {
    "name": "hnsw",
    "space_type": "l2",
    "engine": "nmslib",
    "parameters": {
      "ef_construction": 128,
      "m": 24
    }
  }
 }
 ```
 ## Model IDs
 Model IDs are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the
 model has to be created with the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model). The
 model contains the information needed to initialize the native library segment files.
 ```json
  "type": "knn_vector",
  "model_id": "my-model"
 }
 ```
 However, if you intend to just use painless scripting or a k-NN score script, you only need to pass the dimension.
 ```json
   "type": "knn_vector",
   "dimension": 128
 }
 ```
 ## Lucene byte vector
 By default, k-NN vectors are `float` vectors, where each dimension is 4 bytes. If you want to save storage space, you can use `byte` vectors with the `lucene` engine. In a `byte` vector, each dimension is a signed 8-bit integer in the [-128, 127] range. 
 Byte vectors are supported only for the `lucene` engine. They are not supported for the `nmslib` and `faiss` engines.
 {: .note}
 When using `byte` vectors, expect some loss of precision in the recall compared to using `float` vectors. Byte vectors are useful in large-scale applications and use cases that prioritize a reduced memory footprint in exchange for a minimal loss of recall.
 {: .important}
 Introduced in k-NN plugin version 2.9, the optional `data_type` parameter defines the data type of a vector. The default value of this parameter is `float`.
 To use a `byte` vector, set the `data_type` parameter to `byte` when creating mappings for an index:
 ```json
 PUT test-index
 {
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 3,
        "data_type": "byte",
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
 }
 ```
 {% include copy-curl.html %}
 Then ingest documents as usual. Make sure each dimension in the vector is in the supported [-128, 127] range:
 ```json
 PUT test-index/_doc/1
 {
  "my_vector1": [-126, 28, 127]
 }
 ```
 {% include copy-curl.html %}
 ```json
 PUT test-index/_doc/2
 {
  "my_vector1": [100, -128, 0]
 }
 ```
 {% include copy-curl.html %}
 When querying, be sure to use a `byte` vector:
 ```json
 GET test-index/_search
 {
  "size": 2,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": [26, -120, 99],
        "k": 2
      }
    }
  }
 }
 ```
 {% include copy-curl.html %}
--- a/_search-plugins/knn/approximate-knn.md
+++ b/_search-plugins/knn/approximate-knn.md
@ -1,6 +1,6 @@
 ---
 layout: default
-title: Approximate search
+title: Approximate k-NN search
 nav_order: 15
 parent: k-NN
 has_children: false
@ -79,7 +79,7 @@ PUT my-knn-index-1
 }
 ```
-In the example above, both `knn_vector` fields are configured from method definitions. Additionally, `knn_vector` fields can also be configured from models. You can learn more about this in the [knn_vector data type]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#knn_vector-data-type) section.
+In the example above, both `knn_vector` fields are configured from method definitions. Additionally, `knn_vector` fields can also be configured from models. You can learn more about this in the [knn_vector data type]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) section.
 The `knn_vector` data type supports a vector of floats that can have a dimension count of up to 16,000 for the nmslib and faiss engines, as set by the dimension mapping parameter. The maximum dimension count for the Lucene library is 1,024.
--- a/_search-plugins/knn/knn-index.md
+++ b/_search-plugins/knn/knn-index.md
@ -1,133 +1,15 @@
 ---
 layout: default
-title: k-NN Index
+title: k-NN index
 nav_order: 5
 parent: k-NN
 has_children: false
 ---
-# k-NN Index
+# k-NN index
 ## knn_vector data type
 The k-NN plugin introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors
-into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. In general, a `knn_vector` field can be built either by providing a method definition or specifying a model id.
+into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. For more information, see [k-NN vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/).
 Method definitions are used when the underlying Approximate k-NN algorithm does not require training. For example, the following `knn_vector` field specifies that *nmslib*'s implementation of *hnsw* should be used for Approximate k-NN search. During indexing, *nmslib* will build the corresponding *hnsw* segment files.
 ```json
 "my_vector": {
  "type": "knn_vector",
  "dimension": 4,
  "method": {
    "name": "hnsw",
    "space_type": "l2",
    "engine": "nmslib",
    "parameters": {
      "ef_construction": 128,
      "m": 24
    }
  }
 }
 ```
 Model IDs are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the
 model has to be created with the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model). The
 model contains the information needed to initialize the native library segment files.
 ```json
  "type": "knn_vector",
  "model_id": "my-model"
 }
 ```
 However, if you intend to just use painless scripting or a k-NN score script, you only need to pass the dimension.
 ```json
   "type": "knn_vector",
   "dimension": 128
 }
 ```
 ### Lucene byte vector
 By default, k-NN vectors are `float` vectors, where each dimension is 4 bytes. If you want to save storage space, you can use `byte` vectors with the `lucene` engine. In a `byte` vector, each dimension is a signed 8-bit integer in the [-128, 127] range. 
 Byte vectors are supported only for the `lucene` engine. They are not supported for the `nmslib` and `faiss` engines.
 {: .note}
 When using `byte` vectors, expect some loss of precision in the recall compared to using `float` vectors. Byte vectors are useful in large-scale applications and use cases that prioritize a reduced memory footprint in exchange for a minimal loss of recall.
 {: .important}
 Introduced in k-NN plugin version 2.9, the optional `data_type` parameter defines the data type of a vector. The default value of this parameter is `float`.
 To use a `byte` vector, set the `data_type` parameter to `byte` when creating mappings for an index:
 ```json
 PUT test-index
 {
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 3,
        "data_type": "byte",
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
 }
 ```
 {% include copy-curl.html %}
 Then ingest documents as usual. Make sure each dimension in the vector is in the supported [-128, 127] range:
 ```json
 PUT test-index/_doc/1
 {
  "my_vector1": [-126, 28, 127]
 }
 ```
 {% include copy-curl.html %}
 ```json
 PUT test-index/_doc/2
 {
  "my_vector1": [100, -128, 0]
 }
 ```
 {% include copy-curl.html %}
 When querying, be sure to use a `byte` vector:
 ```json
 GET test-index/_search
 {
  "size": 2,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": [26, -120, 99],
        "k": 2
      }
    }
  }
 }
 ```
 {% include copy-curl.html %}
 ## Method definitions