The approximate k-NN search method uses nearest neighbor algorithms from *nmslib* and *faiss* to power
k-NN search. To see the algorithms that the plugin currently supports, check out the [k-NN Index documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions).
In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest
neighbors. Of the three search methods the plugin provides, this method offers the best search scalability for large
data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach is
The k-NN plugin builds a native library index of the vectors for each "knn-vector field"/ "Lucene segment" pair during
indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about
Lucene segments, see the [Apache Lucene documentation](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description).
These native library indices are loaded into native memory during search and managed by a cache. To learn more about
pre-loading native library indices into memory, refer to the [warmup API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#warmup-operation).
Additionally, you can see what native library indices are already loaded in memory, which you can learn more about in the
[stats API section]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#stats).
`k` is the number of neighbors the search of each graph will return. You must also include the `size` option, which
indicates how many results the query actually returns. The plugin returns `k` amount of results for each shard
(and each segment) and `size` amount of results for the entire query. The plugin supports a maximum `k` value of 10,000.
### Building a k-NN index from a model
For some of the algorithms that we support, the native library index needs to be trained before it can be used. Training
everytime a segment is created would be very expensive, so, instead, we introduce the concept of a *model* that is used
to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model),
passing in the source of training data as well as the method definition of the model. Once training is complete, the
model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to
initialize the segments.
In order to train a model, we first need an OpenSearch index with training data in it. Training data can come from
any `knn_vector` field that has a dimension matching the dimension of the model you want to create. Training data can be
the same data that you are going to index or a separate set. Let's create a training index:
```json
PUT /train-index
{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 0
},
"mappings": {
"properties": {
"train-field": {
"type": "knn_vector",
"dimension": 4
}
}
}
}
```
Notice that `index.knn` is not set in the index settings. This ensures that we do not create native library indices for
A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest
neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how
OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores,
we take 1 / (1 + distance). The k-NN plugin the spaces the plugin supports are below. Not every method supports each of
these spaces. Be sure to check out [the method documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions) to make sure the space you are