Merge pull request #380 from opensearch-project/previous-versions

Added in previous versions
This commit is contained in:
Keith Chan 2022-01-25 14:57:51 -08:00 committed by GitHub
commit bf28e83f64
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 158 additions and 115 deletions

View File

@ -180,7 +180,7 @@ services:
volumes:
- opensearch-data2:/usr/share/opensearch/data
- ./custom-opensearch.yml:/usr/share/opensearch/config/opensearch.yml
opensearch-dashboards
opensearch-dashboards:
volumes:
- ./custom-opensearch_dashboards.yml:/usr/share/opensearch-dashboards/config/opensearch_dashboards.yml
```

View File

@ -51,6 +51,78 @@ bin/opensearch-plugin list
</tr>
</thead>
<tbody>
<tr>
<tr>
<td>1.2.4</td>
<td>
<pre>opensearch-alerting 1.2.4.0
opensearch-anomaly-detection 1.2.4.0
opensearch-asynchronous-search 1.2.4.0
opensearch-cross-cluster-replication 1.2.4.0
opensearch-index-management 1.2.4.0
opensearch-job-scheduler 1.2.4.0
opensearch-knn 1.2.4.0
opensearch-observability 1.2.4.0
opensearch-performance-analyzer 1.2.4.0
opensearch-reports-scheduler 1.2.4.0
opensearch-security 1.2.4.0
opensearch-sql 1.2.4.0
</pre>
</td>
</tr>
<td>1.2.3</td>
<td>
<pre>opensearch-alerting 1.2.3.0
opensearch-anomaly-detection 1.2.3.0
opensearch-asynchronous-search 1.2.3.0
opensearch-cross-cluster-replication 1.2.3.0
opensearch-index-management 1.2.3.0
opensearch-job-scheduler 1.2.3.0
opensearch-knn 1.2.3.0
opensearch-observability 1.2.3.0
opensearch-performance-analyzer 1.2.3.0
opensearch-reports-scheduler 1.2.3.0
opensearch-security 1.2.3.0
opensearch-sql 1.2.3.0
</pre>
</td>
</tr>
<tr>
<td>1.2.2</td>
<td>
<pre>opensearch-alerting 1.2.2.0
opensearch-anomaly-detection 1.2.2.0
opensearch-asynchronous-search 1.2.2.0
opensearch-cross-cluster-replication 1.2.2.0
opensearch-index-management 1.2.2.0
opensearch-job-scheduler 1.2.2.0
opensearch-knn 1.2.2.0
opensearch-observability 1.2.2.0
opensearch-performance-analyzer 1.2.2.0
opensearch-reports-scheduler 1.2.2.0
opensearch-security 1.2.2.0
opensearch-sql 1.2.2.0
</pre>
</td>
</tr>
<tr>
<td>1.2.1</td>
<td>
<pre>opensearch-alerting 1.2.1.0
opensearch-anomaly-detection 1.2.1.0
opensearch-asynchronous-search 1.2.1.0
opensearch-cross-cluster-replication 1.2.1.0
opensearch-index-management 1.2.1.0
opensearch-job-scheduler 1.2.1.0
opensearch-knn 1.2.1.0
opensearch-observability 1.2.1.0
opensearch-performance-analyzer 1.2.1.0
opensearch-reports-scheduler 1.2.1.0
opensearch-security 1.2.1.0
opensearch-sql 1.2.1.0
</pre>
</td>
</tr>
<tr>
<td>1.2.0</td>
<td>
@ -67,6 +139,8 @@ opensearch-reports-scheduler 1.2.0.0
opensearch-security 1.2.0.0
opensearch-sql 1.2.0.0
</pre>
</td>
</tr>
<tr>
<td>1.1.0</td>
<td>
@ -83,8 +157,6 @@ opensearch-reports-scheduler 1.1.0.0
opensearch-security 1.1.0.0
opensearch-sql 1.1.0.0
</pre>
</td>
</tr>
</td>
</tr>
<tr>

View File

@ -21,7 +21,7 @@ This page lists all full-text query types and common options. Given the sheer nu
## Match
Creates a [boolean query](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/search/BooleanQuery.html) that returns results if the search term is present in the field.
Creates a [boolean query](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/BooleanQuery.html) that returns results if the search term is present in the field.
The most basic form of the query provides only a field (`title`) and a term (`wind`):
@ -126,7 +126,7 @@ GET _search
## Match boolean prefix
Similar to [match](#match), but creates a [prefix query](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/search/PrefixQuery.html) out of the last term in the query string.
Similar to [match](#match), but creates a [prefix query](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/PrefixQuery.html) out of the last term in the query string.
```json
GET _search
@ -164,7 +164,7 @@ GET _search
## Match phrase
Creates a [phrase query](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/search/PhraseQuery.html) that matches a sequence of terms.
Creates a [phrase query](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/PhraseQuery.html) that matches a sequence of terms.
```json
GET _search
@ -198,7 +198,7 @@ GET _search
## Match phrase prefix
Similar to [match phrase](#match-phrase), but creates a [prefix query](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/search/PrefixQuery.html) out of the last term in the query string.
Similar to [match phrase](#match-phrase), but creates a [prefix query](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/PrefixQuery.html) out of the last term in the query string.
```json
GET _search
@ -410,7 +410,7 @@ Option | Valid values | Description
`allow_leading_wildcard` | Boolean | Whether `*` and `?` are allowed as the first character of a search term. The default is true.
`analyze_wildcard` | Boolean | Whether OpenSearch should attempt to analyze wildcard terms. Some analyzers do a poor job at this task, so the default is false.
`analyzer` | `standard, simple, whitespace, stop, keyword, pattern, <language>, fingerprint` | The analyzer you want to use for the query. Different analyzers have different character filters, tokenizers, and token filters. The `stop` analyzer, for example, removes stop words (e.g. "an," "but," "this") from the query string.
`auto_generate_synonyms_phrase_query` | Boolean | A value of true (default) automatically generates [phrase queries](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/search/PhraseQuery.html) for multi-term synonyms. For example, if you have the synonym `"ba, batting average"` and search for "ba," OpenSearch searches for `ba OR "batting average"` (if this option is true) or `ba OR (batting AND average)` (if this option is false).
`auto_generate_synonyms_phrase_query` | Boolean | A value of true (default) automatically generates [phrase queries](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/PhraseQuery.html) for multi-term synonyms. For example, if you have the synonym `"ba, batting average"` and search for "ba," OpenSearch searches for `ba OR "batting average"` (if this option is true) or `ba OR (batting AND average)` (if this option is false).
`boost` | Floating-point | Boosts the clause by the given multiplier. Useful for weighing clauses in compound queries. The default is 1.0.
`cutoff_frequency` | Between `0.0` and `1.0` or a positive integer | This value lets you define high and low frequency terms based on number of occurrences in the index. Numbers between 0 and 1 are treated as a percentage. For example, 0.10 is 10%. This value means that if a word occurs within the search field in more than 10% of the documents on the shard, OpenSearch considers the word "high frequency" and deemphasizes it when calculating search score.<br /><br />Because this setting is *per shard*, testing its impact on search results can be challenging unless a cluster has many documents.
`enable_position_increments` | Boolean | When true, result queries are aware of position increments. This setting is useful when the removal of stop words leaves an unwanted "gap" between terms. The default is true.
@ -420,7 +420,7 @@ Option | Valid values | Description
`fuzzy_transpositions` | Boolean | Setting `fuzzy_transpositions` to true (default) adds swaps of adjacent characters to the insert, delete, and substitute operations of the `fuzziness` option. For example, the distance between `wind` and `wnid` is 1 if `fuzzy_transpositions` is true (swap "n" and "i") and 2 if it is false (delete "n", insert "n"). <br /><br />If `fuzzy_transpositions` is false, `rewind` and `wnid` have the same distance (2) from `wind`, despite the more human-centric opinion that `wnid` is an obvious typo. The default is a good choice for most use cases.
`lenient` | Boolean | Setting `lenient` to true lets you ignore data type mismatches between the query and the document field. For example, a query string of "8.2" could match a field of type `float`. The default is false.
`low_freq_operator` | `and, or` | The operator for low-frequency terms. The default is `or`. See [Common terms](#common-terms) queries and `operator` in this table.
`max_determinized_states` | Positive integer | The maximum number of "[states](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/util/automaton/Operations.html#DEFAULT_MAX_DETERMINIZED_STATES)" (a measure of complexity) that Lucene can create for query strings that contain regular expressions (e.g. `"query": "/wind.+?/"`). Larger numbers allow for queries that use more memory. The default is 10,000.
`max_determinized_states` | Positive integer | The maximum number of "[states](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/util/automaton/Operations.html#DEFAULT_MAX_DETERMINIZED_STATES)" (a measure of complexity) that Lucene can create for query strings that contain regular expressions (e.g. `"query": "/wind.+?/"`). Larger numbers allow for queries that use more memory. The default is 10,000.
`max_expansions` | Positive integer | Fuzzy queries "expand to" a number of matching terms that are within the distance specified in `fuzziness`. Then OpenSearch tries to match those terms against its indices. `max_expansions` specifies the maximum number of terms that the fuzzy query expands to. The default is 50.
`minimum_should_match` | Positive or negative integer, positive or negative percentage, combination | If the query string contains multiple search terms and you used the `or` operator, the number of terms that need to match for the document to be considered a match. For example, if `minimum_should_match` is 2, "wind often rising" does not match "The Wind Rises." If `minimum_should_match` is 1, it matches. This option also has `low_freq` and `high_freq` properties for [Common terms](#common-terms) queries.
`operator` | `or, and` | If the query string contains multiple search terms, whether all terms need to match (`and`) or only one term needs to match (`or`) for a document to be considered a match.
@ -428,7 +428,7 @@ Option | Valid values | Description
`prefix_length` | `0` (default) or a positive integer | The number of leading characters that are not considered in fuzziness.
`quote_field_suffix` | String | This option lets you search different fields depending on whether terms are wrapped in quotes. For example, if `quote_field_suffix` is `".exact"` and you search for `"lightly"` (in quotes) in the `title` field, OpenSearch searches the `title.exact` field. This second field might use a different type (e.g. `keyword` rather than `text`) or a different analyzer. The default is null.
`rewrite` | `constant_score, scoring_boolean, constant_score_boolean, top_terms_N, top_terms_boost_N, top_terms_blended_freqs_N` | Determines how OpenSearch rewrites and scores multi-term queries. The default is `constant_score`.
`slop` | `0` (default) or a positive integer | Controls the degree to which words in a query can be misordered and still be considered a match. From the [Lucene documentation](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/search/PhraseQuery.html#getSlop--): "The number of other words permitted between words in query phrase. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit re-orderings of phrases, the slop must be at least two. A value of zero requires an exact match."
`slop` | `0` (default) or a positive integer | Controls the degree to which words in a query can be misordered and still be considered a match. From the [Lucene documentation](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop--): "The number of other words permitted between words in query phrase. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit re-orderings of phrases, the slop must be at least two. A value of zero requires an exact match."
`tie_breaker` | `0.0` (default) to `1.0` | Changes the way OpenSearch scores searches. For example, a `type` of `best_fields` typically uses the highest score from any one field. If you specify a `tie_breaker` value between 0.0 and 1.0, the score changes to highest score + `tie_breaker` * score for all other matching fields. If you specify a value of 1.0, OpenSearch adds together the scores for all matching fields (effectively defeating the purpose of `best_fields`).
`time_zone` | UTC offset | The time zone to use (e.g. `-08:00`) if the query string contains a date range (e.g. `"query": "wind rises release_date[2012-01-01 TO 2014-01-01]"`). The default is `UTC`.
`type` | `best_fields, most_fields, cross-fields, phrase, phrase_prefix` | Determines how OpenSearch executes the query and scores the results. The default is `best_fields`.

View File

@ -448,5 +448,5 @@ GET shakespeare/_search
A few important notes:
- Regular expressions are applied to the terms in the field (i.e. tokens), not the entire field.
- Regular expressions use the Lucene syntax, which differs from more standardized implementations. Test thoroughly to ensure that you receive the results you expect. To learn more, see [the Lucene documentation](https://lucene.apache.org/core/{{site.lucene_version}}/core/index.html).
- Regular expressions use the Lucene syntax, which differs from more standardized implementations. Test thoroughly to ensure that you receive the results you expect. To learn more, see [the Lucene documentation](https://lucene.apache.org/core/8_9_0/core/index.html).
- `regexp` queries can be expensive operations and require the `search.allow_expensive_queries` setting to be set to `true`. Before making frequent `regexp` queries, test their impact on cluster performance and examine alternative queries for achieving similar results.

View File

@ -9,27 +9,22 @@ has_math: true
# Approximate k-NN search
The approximate k-NN search method uses nearest neighbor algorithms from *nmslib* and *faiss* to power
k-NN search. To see the algorithms that the plugin currently supports, check out the [k-NN Index documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions).
In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest
neighbors. Of the three search methods the plugin provides, this method offers the best search scalability for large
data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach is
preferred.
The approximate k-NN search method uses nearest neighbor algorithms from *nmslib* and *faiss* to power
k-NN search. To see the algorithms that the plugin currently supports, check out the [k-NN Index documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions).
In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest neighbors. Of the three search methods the plugin provides, this method offers the best search scalability for large data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach is preferred.
The k-NN plugin builds a native library index of the vectors for each "knn-vector field"/ "Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, see the [Apache Lucene documentation](https://lucene.apache.org/core/8_11_1/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description).
The k-NN plugin builds a native library index of the vectors for each "knn-vector field"/ "Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, see the [Apache Lucene documentation](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description).
These native library indices are loaded into native memory during search and managed by a cache. To learn more about
pre-loading native library indices into memory, refer to the [warmup API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#warmup-operation). Additionally, you can see what native library indices are already loaded in memory, which you can learn more about in the [stats API section]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#stats).
Because the native library indices are constructed during indexing, it is not possible to apply a filter on an index
and then use this search method. All filters are applied on the results produced by the approximate nearest neighbor
search.
and then use this search method. All filters are applied on the results produced by the approximate nearest neighbor search.
## Get started with approximate k-NN
To use the k-NN plugin's approximate search functionality, you must first create a k-NN index with setting `index.knn`
to `true`. This setting tells the plugin to create native library indices for the index.
To use the k-NN plugin's approximate search functionality, you must first create a k-NN index with setting `index.knn` to `true`. This setting tells the plugin to create native library indices for the index.
Next, you must add one or more fields of the `knn_vector` data type. This example creates an index with two
Next, you must add one or more fields of the `knn_vector` data type. This example creates an index with two
`knn_vector`'s, one using *faiss*, the other using *nmslib*, fields:
```json
@ -74,14 +69,12 @@ PUT my-knn-index-1
}
```
In the example above, both `knn_vector`'s are configured from method definitions. Additionally, `knn_vector`'s can also
be configured from models. Learn more about it [here]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#knn_vector-data-type)!
In the example above, both `knn_vector`s are configured from method definitions. Additionally, `knn_vector`s can also be configured from models. Learn more about it [here]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#knn_vector-data-type)!
The `knn_vector` data type supports a vector of floats that can have a dimension of up to 10,000, as set by the
The `knn_vector` data type supports a vector of floats that can have a dimension of up to 10,000, as set by the
dimension mapping parameter.
In OpenSearch, codecs handle the storage and retrieval of indices. The k-NN plugin uses a custom codec to write vector
data to native library indices so that the underlying k-NN search library can read it.
In OpenSearch, codecs handle the storage and retrieval of indices. The k-NN plugin uses a custom codec to write vector data to native library indices so that the underlying k-NN search library can read it.
{: .tip }
After you create the index, you can add some data to it:
@ -126,22 +119,16 @@ GET my-knn-index-1/_search
}
```
`k` is the number of neighbors the search of each graph will return. You must also include the `size` option, which
indicates how many results the query actually returns. The plugin returns `k` amount of results for each shard
`k` is the number of neighbors the search of each graph will return. You must also include the `size` option, which
indicates how many results the query actually returns. The plugin returns `k` amount of results for each shard
(and each segment) and `size` amount of results for the entire query. The plugin supports a maximum `k` value of 10,000.
### Building a k-NN index from a model
For some of the algorithms that we support, the native library index needs to be trained before it can be used. Training
everytime a segment is created would be very expensive, so, instead, we introduce the concept of a *model* that is used
to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model),
passing in the source of training data as well as the method definition of the model. Once training is complete, the
model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to
initialize the segments.
For some of the algorithms that we support, the native library index needs to be trained before it can be used. It would be expensive to training every newly created segment, so, instead, we introduce the concept of a *model* that is used to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model), passing in the source of training data as well as the method definition of the model. Once training is complete, the model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to initialize the segments.
In order to train a model, we first need an OpenSearch index with training data in it. Training data can come from
any `knn_vector` field that has a dimension matching the dimension of the model you want to create. Training data can be
the same data that you are going to index or a separate set. Let's create a training index:
To train a model, we first need an OpenSearch index with training data in it. Training data can come from
any `knn_vector` field that has a dimension matching the dimension of the model you want to create. Training data can be the same data that you are going to index or have in a separate set. Let's create a training index:
```json
PUT /train-index
@ -159,12 +146,12 @@ PUT /train-index
}
}
}
```
```
Notice that `index.knn` is not set in the index settings. This ensures that we do not create native library indices for
this index.
Notice that `index.knn` is not set in the index settings. This ensures that we do not create native library indices for this index.
Next, let's add some data to it:
```json
POST _bulk
{ "index": { "_index": "train-index", "_id": "1" } }
@ -178,7 +165,8 @@ POST _bulk
...
```
After indexing into the training index completes, we can call our the Train API:
After indexing into the training index completes, we can call the Train API:
```json
POST /_plugins/_knn/models/_train/my-model
{
@ -204,15 +192,17 @@ POST /_plugins/_knn/models/_train/my-model
```
The Train API will return as soon as the training job is started. To check its status, we can use the Get Model API:
```json
GET /_plugins/_knn/models/my-model?filter_path=state&pretty
{
"state": "training"
}
```
```
Once the model enters the "created" state, we can create an index that will use this model to initialize it's native
Once the model enters the "created" state, we can create an index that will use this model to initialize it's native
library indices:
```json
PUT /target-index
{
@ -249,8 +239,7 @@ POST _bulk
After data is ingested, it can be search just like any other `knn_vector` field!
### Using approximate k-NN with filters
If you use the `knn` query alongside filters or other clauses (e.g. `bool`, `must`, `match`), you might receive fewer
than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:
If you use the `knn` query alongside filters or other clauses (e.g. `bool`, `must`, `match`), you might receive fewer than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:
```json
GET my-knn-index-1/_search
@ -277,12 +266,7 @@ GET my-knn-index-1/_search
## Spaces
A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest
neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how
OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores,
we take 1 / (1 + distance). The k-NN plugin the spaces the plugin supports are below. Not every method supports each of
these spaces. Be sure to check out [the method documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions) to make sure the space you are
interested in is supported.
A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). The k-NN plugin the spaces the plugin supports are below. Not every method supports each of these spaces. Be sure to check out [the method documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions) to make sure the space you are interested in is supported.
<table>
<thead style="text-align: left">
@ -321,7 +305,7 @@ interested in is supported.
</tr>
</table>
The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
smaller scores with closer results, they return `1 - cosineSimilarity` for cosine similarity space---that's why `1 -` is
The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
smaller scores with closer results, they return `1 - cosineSimilarity` for cosine similarity space---that's why `1 -` is
included in the distance function.
{: .note }

View File

@ -18,7 +18,7 @@ This plugin supports three different methods for obtaining the k-nearest neighbo
1. **Approximate k-NN**
The first method takes an approximate nearest neighbor approach---it uses one of several different algorithms to return the approximate k-nearest neighbors to a query vector. Usually, these algorithms sacrifice indexing speed and search accuracy in return for performance benefits such as lower latency, smaller memory footprints and more scalable search. To learn more about the algorithms, please refer to [*nmslib*](https://github.com/nmslib/nmslib/blob/master/manual/README.md)'s and [*faiss*](https://github.com/facebookresearch/faiss/wiki)'s documentation.
The first method takes an approximate nearest neighbor approach---it uses one of several algorithms to return the approximate k-nearest neighbors to a query vector. Usually, these algorithms sacrifice indexing speed and search accuracy in return for performance benefits such as lower latency, smaller memory footprints and more scalable search. To learn more about the algorithms, refer to [*nmslib*](https://github.com/nmslib/nmslib/blob/master/manual/README.md)'s and [*faiss*](https://github.com/facebookresearch/faiss/wiki)'s documentation.
Approximate k-NN is the best choice for searches over large indices (i.e. hundreds of thousands of vectors or more) that require low latency. You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched. In this case, you should use either the script scoring method or painless extensions.

View File

@ -8,10 +8,10 @@ has_children: false
# JNI libraries
To integrate [*nmslib*'s](https://github.com/nmslib/nmslib/) and [*faiss*'s](https://github.com/facebookresearch/faiss/) Approximate k-NN functionality (implemented in C++) into the k-NN plugin (implemented in Java), we created a Java Native Interface, which lets the k-NN plugin make calls to the native libraries. To implement this, we create 3 libraries: `libopensearchknn_nmslib`, the JNI library that interfaces with nmslib, `libopensearchknn_faiss`, the JNI library that interfaces with faiss, and `libopensearchknn_common`, a library containing common shared functionality between native libraries.
To integrate [*nmslib*'s](https://github.com/nmslib/nmslib/) and [*faiss*'s](https://github.com/facebookresearch/faiss/) Approximate k-NN functionality (implemented in C++) into the k-NN plugin (implemented in Java), we created a Java Native Interface, which lets the k-NN plugin make calls to the native libraries. We create 3 libraries: `libopensearchknn_nmslib`, the JNI library that interfaces with nmslib, `libopensearchknn_faiss`, the JNI library that interfaces with faiss, and `libopensearchknn_common`, a library containing common shared functionality between native libraries.
The libraries `libopensearchknn_faiss` and `libopensearchknn_nmslib` are lazily loaded when they are first called in the plugin. This means that if you are only planning on using one of the libraries, the other one will never be loaded.
The libraries `libopensearchknn_faiss` and `libopensearchknn_nmslib` are lazily loaded when they are first called in the plugin. This means that if you are only planning on using one of the libraries, the plugin never loads the other library.
For building the libraries from source, please refer to the [DEVELOPER_GUIDE](https://github.com/opensearch-project/k-NN/blob/main/DEVELOPER_GUIDE.md).
To build the libraries from source, refer to the [DEVELOPER_GUIDE](https://github.com/opensearch-project/k-NN/blob/main/DEVELOPER_GUIDE.md).
For more information about JNI, see [Java Native Interface](https://en.wikipedia.org/wiki/Java_Native_Interface) on Wikipedia.

View File

@ -11,33 +11,28 @@ has_children: false
## knn_vector data type
The k-NN plugin introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors
into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and
can serve many different k-NN workloads. In general, a `knn_vector` field can be built either by providing a method
definition or specifying a model id.
into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. In general, a `knn_vector` field can be built either by providing a method definition or specifying a model id.
Method definitions are used when the underlying Approximate k-NN algorithm does not
require training. For example, the following `knn_vector` field specifies that *nmslib*'s implementation of *hnsw*
should be used for Approximate k-NN search. During indexing, *nmslib* will build the corresponding *hnsw* segment
files.
Method definitions are used when the underlying Approximate k-NN algorithm does not require training. For example, the following `knn_vector` field specifies that *nmslib*'s implementation of *hnsw* should be used for Approximate k-NN search. During indexing, *nmslib* will build the corresponding *hnsw* segment files.
```json
"my_vector": {
"type": "knn_vector",
"dimension": 4,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "nmslib",
"parameters": {
"ef_construction": 128,
"m": 24
}
"name": "hnsw",
"space_type": "l2",
"engine": "nmslib",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
}
```
Model id's are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the
model has to be created with the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model). The
```
Model IDs are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the
model has to be created with the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model). The
model contains the information needed to initialize the native library segment files.
```json
@ -55,13 +50,10 @@ However, if you intend to just use painless scripting or a k-NN score script, yo
## Method Definitions
A method definition refers to the underlying configuration of the Approximate k-NN algorithm you want to use. Method
definitions are used to either create a `knn_vector` field (when the method does not require training) or
[create a model during training]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model) that can then be
used to [create a `knn_vector` field]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model).
A method definition refers to the underlying configuration of the Approximate k-NN algorithm you want to use. Method definitions are used to either create a `knn_vector` field (when the method does not require training) or [create a model during training]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model) that can then be used to [create a `knn_vector` field]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model).
A method definition will always contain the name of the method, the space_type the method is built for, the engine
(the native library) to use, and a map of parameters.
A method definition will always contain the name of the method, the space_type the method is built for, the engine
(the native library) to use, and a map of parameters.
Mapping Parameter | Required | Default | Updatable | Description
:--- | :--- | :--- | :--- | :---
@ -107,45 +99,46 @@ Paramater Name | Required | Default | Updatable | Description
Paramater Name | Required | Default | Updatable | Description
:--- | :--- | :--- | :--- | :---
`nlists` | false | 4 | false | Number of buckets to partition vectors into. Higher values may lead to more accurate searches, at the expense of memory and training latency. For more information about choosing the right value, refer to [*faiss*'s documentation](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).
`nprobes` | false | 1 | false | Number of buckets to search over during query. Higher values lead to more accurate but slower searches.
`nlists` | false | 4 | false | Number of buckets to partition vectors into. Higher values may lead to more accurate searches, at the expense of memory and training latency. For more information about choosing the right value, refer to [*faiss*'s documentation](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).
`nprobes` | false | 1 | false | Number of buckets to search over during query. Higher values lead to more accurate but slower searches.
`encoder` | false | flat | false | Encoder definition for encoding vectors. Encoders can reduce the memory footprint of your index, at the expense of search accuracy.
For more information about setting these parameters, please refer to [*faiss*'s documentation](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes).
#### IVF training requirements
The IVF algorithm requires a training step. To create an index that uses IVF, you need to train a model with the
[Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model), passing the IVF method definition. IVF requires that, at a minimum, there should be `nlist` training
data points, but it is [recommended to use more](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset).
The IVF algorithm requires a training step. To create an index that uses IVF, you need to train a model with the
[Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model), passing the IVF method definition. IVF requires that, at a minimum, there should be `nlist` training
data points, but it is [recommended to use more](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset).
Training data can either the same data that is going to be ingested or a separate set of data.
### Supported faiss encoders
Encoders can be used to reduce the memory footprint of a k-NN index at the expense of search accuracy. *faiss* has
several different encoder types, but currently, the plugin only supports *flat* and *pq* encoding.
You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. *faiss* has
several encoder types, but currently, the plugin only supports *flat* and *pq* encoding.
An example method definition that specifies an encoder may look something like this:
```json
"method": {
"name":"hnsw",
"engine":"faiss",
"parameters":{
"encoder":{
"name":"pq",
"parameters":{
"code_size": 8,
"m": 8
}
}
"name":"hnsw",
"engine":"faiss",
"parameters":{
"encoder":{
"name":"pq",
"parameters":{
"code_size": 8,
"m": 8
}
}
}
}
```
Encoder Name | Requires Training? | Description
:--- | :--- | :---
`flat` | false | Encode vectors as floating point arrays. This encoding does not reduce memory footprint.
`pq` | true | Short for product quantization, it is a lossy compression technique that encodes a vector into a fixed size of bytes using clustering, with the goal of minimizing the drop in k-NN search accuracy. From a high level, vectors are broken up into `m` subvectors, and then each subvector is represented by a `code_size` code obtained from a code book produced during training. For more details on product quantization, here is a [great blog post](https://medium.com/dotstar/understanding-faiss-part-2-79d90b1e5388)!
`pq` | true | Short for product quantization, it is a lossy compression technique that encodes a vector into a fixed size of bytes using clustering, with the goal of minimizing the drop in k-NN search accuracy. From a high level, vectors are broken up into `m` subvectors, and then each subvector is represented by a `code_size` code obtained from a code book produced during training. For more details on product quantization, here is a [great blog post](https://medium.com/dotstar/understanding-faiss-part-2-79d90b1e5388)!
#### PQ Parameters
@ -156,22 +149,18 @@ Paramater Name | Required | Default | Updatable | Description
### Choosing the right method
There are a lot of options to choose from when building your `knn_vector` field. To determine the correct methods and
parameters to choose, you should first understand what requirements you have for your workload and what trade-offs you
are willing to make. Factors to consider are (1) query latency, (2) query quality, (3) memory limits,
(4) indexing latency.
There are a lot of options to choose from when building your `knn_vector` field. To determine the correct methods and parameters to choose, you should first understand what requirements you have for your workload and what trade-offs you are willing to make. Factors to consider are (1) query latency, (2) query quality, (3) memory limits, (4) indexing latency.
If memory is not a concern, HNSW offers a very strong query latency/query quality tradeoff.
If you want to use less memory and index faster than HNSW, while maintaining similar query quality, you should evaluate IVF.
If memory is a concern, consider adding a PQ encoder to your HNSW or IVF index. Because PQ is a lossy encoding, query
quality will drop.
If memory is a concern, consider adding a PQ encoder to your HNSW or IVF index. Because PQ is a lossy encoding, query quality will drop.
### Memory Estimation
In a typical OpenSearch cluster, a certain portion of RAM is set aside for the JVM heap. The k-NN plugin allocates
native library indices to a portion of the remaining RAM. This portion's size is determined by
In a typical OpenSearch cluster, a certain portion of RAM is set aside for the JVM heap. The k-NN plugin allocates
native library indices to a portion of the remaining RAM. This portion's size is determined by
the `circuit_breaker_limit` cluster setting. By default, the limit is set at 50%.
Having a replica doubles the total number of vectors.
@ -202,10 +191,7 @@ As an example, assume you have a million vectors with a dimension of 256 and nli
Additionally, the k-NN plugin introduces several index settings that can be used to configure the k-NN structure as well.
At the moment, several parameters defined in the settings are in the deprecation process. Those parameters should be set
in the mapping instead of the index settings. Parameters set in the mapping will override the parameters set in the
index settings. Setting the parameters in the mapping allows an index to have multiple `knn_vector` fields with
different parameters.
At the moment, several parameters defined in the settings are in the deprecation process. Those parameters should be set in the mapping instead of the index settings. Parameters set in the mapping will override the parameters set in the index settings. Setting the parameters in the mapping allows an index to have multiple `knn_vector` fields with different parameters.
Setting | Default | Updateable | Description
:--- | :--- | :--- | :---

View File

@ -45,7 +45,7 @@ source ./bin/activate
2. Install the CLI:
```
pip3 install opensearch-sql-cli
pip3 install opensearchsql
```
The SQL CLI only works with Python 3.
@ -63,7 +63,7 @@ When you first launch the SQL CLI, a configuration file is automatically created
You can configure the following connection properties:
- `endpoint`: You do not need to specify an option, anything that follows the launch command `opensearchsql` is considered as the endpoint. If you do not provide an endpoint, by default, the SQL CLI connects to http://localhost:9200.
- `endpoint`: You do not need to specify an option. Anything that follows the launch command `opensearchsql` is considered as the endpoint. If you do not provide an endpoint, by default, the SQL CLI connects to http://localhost:9200.
- `-u/-w`: Supports username and password for HTTP basic authentication, such as with the security plugin or fine-grained access control for Amazon OpenSearch Service.
- `--aws-auth`: Turns on AWS sigV4 authentication to connect to an Amazon OpenSearch endpoint. Use with the AWS CLI (`aws configure`) to retrieve the local AWS configuration to authenticate and connect.
@ -97,5 +97,6 @@ Run a single query with the following options:
## CLI options
- `-l`: Query language option. Available options are `sql` and `ppl`. Default is `sql`
- `-p`: Always use pager to display output
- `--clirc`: Provide path for the configuration file