Add k-NN Faiss filtering documentation (#4476)
* Add k-NN Faiss filtering documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Move the note Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add faiss and a filter table Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Refactor boolean filtering section Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Clarified that Faiss works with hnsw only Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add more Faiss filtering information Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/knn/filter-search-knn.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Implemented editorial comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented one more editorial comment Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
This commit is contained in:
parent
06665364fd
commit
6c83dfd87c
|
@ -242,30 +242,8 @@ POST _bulk
|
|||
After data is ingested, it can be search just like any other `knn_vector` field!
|
||||
|
||||
### Using approximate k-NN with filters
|
||||
If you use the `knn` query alongside filters or other clauses (e.g. `bool`, `must`, `match`), you might receive fewer than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:
|
||||
|
||||
```json
|
||||
GET my-knn-index-1/_search
|
||||
{
|
||||
"size": 2,
|
||||
"query": {
|
||||
"knn": {
|
||||
"my_vector2": {
|
||||
"vector": [2, 3, 5, 6],
|
||||
"k": 2
|
||||
}
|
||||
}
|
||||
},
|
||||
"post_filter": {
|
||||
"range": {
|
||||
"price": {
|
||||
"gte": 5,
|
||||
"lte": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
To learn about using filters with k-NN search, see [k-NN search with filters]({{site.url}}{{site.baseurl}}/search-plugins/knn/filter-search-knn/).
|
||||
|
||||
## Spaces
|
||||
|
||||
|
|
|
@ -11,12 +11,24 @@ has_math: true
|
|||
|
||||
To refine k-NN results, you can filter a k-NN search using one of the following methods:
|
||||
|
||||
- [Scoring script filter](#scoring-script-filter): This approach involves pre-filtering a document set and then running an exact k-NN search on the filtered subset. It does not scale for large filtered subsets.
|
||||
- [Efficient k-NN filtering](#efficient-k-nn-filtering): This approach applies filtering _during_ the k-NN search, as opposed to before or after the k-NN search, which ensures that `k` results are returned (if there are at least `k` results in total). This approach is supported by the following engines:
|
||||
- Lucene engine with a Hierarchical Navigable Small World (HNSW) algorithm (k-NN plugin versions 2.4 and later)
|
||||
- Faiss engine with an HNSW algorithm (k-NN plugin versions 2.9 or later)
|
||||
|
||||
- [Boolean filter](#boolean-filter-with-ann-search): This approach runs an [approximate nearest neighbor (ANN)]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn) search and then applies a filter to the results. Because of post-filtering, it may return significantly fewer than `k` results for a restrictive filter.
|
||||
- [Post-filtering](#post-filtering): Because it is performed after the k-NN search, this approach may return significantly fewer than `k` results for a restrictive filter. You can use the following two filtering strategies for this approach:
|
||||
- [Boolean post-filter](#boolean-filter-with-ann-search): This approach runs an [approximate nearest neighbor (ANN)]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/) search and then applies a filter to the results. The two query parts are executed independently, and then the results are combined based on the query operator (`should`, `must`, and so on) provided in the query.
|
||||
- [The `post_filter` parameter](#post-filter-parameter): This approach runs an [ANN]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/) search on the full dataset and then applies the filter to the k-NN results.
|
||||
|
||||
- [Lucene k-NN filter](#using-a-lucene-k-nn-filter): This approach applies filtering _during_ the k-NN search, as opposed to before or after the k-NN search, which ensures that `k` results are returned. You can only use this method with the Hierarchical Navigable Small World (HNSW) algorithm implemented by the Lucene search engine in k-NN plugin versions 2.4 and later.
|
||||
- [Scoring script filter](#scoring-script-filter): This approach involves pre-filtering a document set and then running an exact k-NN search on the filtered subset. It may have high latency and does not scale when filtered subsets are large.
|
||||
|
||||
The following table summarizes the preceding filtering use cases.
|
||||
|
||||
Filter | When the filter is applied | Type of search | Supported engines and methods | Where to place the `filter` clause
|
||||
:--- | :--- | :--- | :---
|
||||
Efficient k-NN filtering | During search (a hybrid of pre- and post-filtering) | Approximate | - `lucene` (`hnsw`) <br> - `faiss` (`hnsw`) | Inside the k-NN query clause.
|
||||
Boolean filter | After search (post-filtering) | Approximate | - `lucene`<br> - `nmslib`<br> - `faiss` | Outside the k-NN query clause. Must be a leaf clause.
|
||||
The `post_filter` parameter | After search (post-filtering) | Approximate | - `lucene`<br> - `nmslib`<br> - `faiss` | Outside the k-NN query clause.
|
||||
Scoring script filter | Before search (pre-filtering) | Exact | N/A | Inside the script score query clause.
|
||||
|
||||
## Filtered search optimization
|
||||
|
||||
|
@ -31,56 +43,71 @@ Once you've estimated the number of documents in your index, the restrictiveness
|
|||
| Number of documents in an index | Percentage of documents the filter returns | k | Filtering method to use for higher recall | Filtering method to use for lower latency |
|
||||
| :-- | :-- | :-- | :-- | :-- |
|
||||
| 10M | 2.5 | 100 | Scoring script | Scoring script |
|
||||
| 10M | 38 | 100 | Lucene filter | Boolean filter |
|
||||
| 10M | 80 | 100 | Scoring script | Lucene filter |
|
||||
| 1M | 2.5 | 100 | Lucene filter | Scoring script |
|
||||
| 1M | 38 | 100 | Lucene filter | Lucene filter/scoring script |
|
||||
| 1M | 80 | 100 | Boolean filter | Lucene filter |
|
||||
| 10M | 38 | 100 | Efficient k-NN filtering | Boolean filter |
|
||||
| 10M | 80 | 100 | Scoring script | Efficient k-NN filtering |
|
||||
| 1M | 2.5 | 100 | Efficient k-NN filtering | Scoring script |
|
||||
| 1M | 38 | 100 | Efficient k-NN filtering | Efficient k-NN filtering/scoring script |
|
||||
| 1M | 80 | 100 | Efficient k-NN filtering | Boolean filter |
|
||||
|
||||
## Scoring script filter
|
||||
## Efficient k-NN filtering
|
||||
|
||||
A scoring script filter first filters the documents and then uses a brute-force exact k-NN search on the results. For example, the following query searches for hotels with a rating between 8 and 10, inclusive, that provide parking and then performs a k-NN search to return the 3 hotels that are closest to the specified `location`:
|
||||
You can perform efficient k-NN filtering with the `lucene` or `faiss` engines.
|
||||
|
||||
### Lucene k-NN filter implementation
|
||||
|
||||
k-NN plugin version 2.2 introduced support for running k-NN searches with the Lucene engine using HNSW graphs. Starting with version 2.4, which is based on Lucene version 9.4, you can use Lucene filters for k-NN searches.
|
||||
|
||||
When you specify a Lucene filter for a k-NN search, the Lucene algorithm decides whether to perform an exact k-NN search with pre-filtering or an approximate search with modified post-filtering. The algorithm uses the following variables:
|
||||
|
||||
- N: The number of documents in the index.
|
||||
- P: The number of documents in the document subset after the filter is applied (P <= N).
|
||||
- k: The maximum number of vectors to return in the response.
|
||||
|
||||
The following flow chart outlines the Lucene algorithm.
|
||||
|
||||
![Lucene algorithm for filtering]({{site.url}}{{site.baseurl}}/images/lucene-algorithm.png)
|
||||
|
||||
For more information about the Lucene filtering implementation and the underlying `KnnVectorQuery`, see the [Apache Lucene documentation](https://issues.apache.org/jira/browse/LUCENE-10382).
|
||||
|
||||
### Using a Lucene k-NN filter
|
||||
|
||||
Consider a dataset that includes 12 documents containing hotel information. The following image shows all hotels on an xy coordinate plane by location. Additionally, the points for hotels that have a rating between 8 and 10, inclusive, are depicted with orange dots, and hotels that provide parking are depicted with green circles. The search point is colored in red:
|
||||
|
||||
![Graph of documents with filter criteria]({{site.url}}{{site.baseurl}}/images/knn-doc-set-for-filtering.png)
|
||||
|
||||
In this example, you will create an index and search for the three hotels with high ratings and parking that are the closest to the search location.
|
||||
|
||||
**Step 1: Create a new index**
|
||||
|
||||
Before you can run a k-NN search with a filter, you need to create an index with a `knn_vector` field. For this field, you need to specify `lucene` as the engine and `hnsw` as the `method` in the mapping.
|
||||
|
||||
The following request creates a new index called `hotels-index` with a `knn-filter` field called `location`:
|
||||
|
||||
```json
|
||||
POST /hotels-index/_search
|
||||
PUT /hotels-index
|
||||
{
|
||||
"size": 3,
|
||||
"query": {
|
||||
"script_score": {
|
||||
"query": {
|
||||
"bool": {
|
||||
"filter": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{
|
||||
"range": {
|
||||
"rating": {
|
||||
"gte": 8,
|
||||
"lte": 10
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"term": {
|
||||
"parking": "true"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
"settings": {
|
||||
"index": {
|
||||
"knn": true,
|
||||
"knn.algo_param.ef_search": 100,
|
||||
"number_of_shards": 1,
|
||||
"number_of_replicas": 0
|
||||
}
|
||||
},
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "knn_vector",
|
||||
"dimension": 2,
|
||||
"method": {
|
||||
"name": "hnsw",
|
||||
"space_type": "l2",
|
||||
"engine": "lucene",
|
||||
"parameters": {
|
||||
"ef_construction": 100,
|
||||
"m": 16
|
||||
}
|
||||
}
|
||||
},
|
||||
"script": {
|
||||
"source": "knn_score",
|
||||
"lang": "knn",
|
||||
"params": {
|
||||
"field": "location",
|
||||
"query_value": [
|
||||
5.0,
|
||||
4.0
|
||||
],
|
||||
"space_type": "l2"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -88,7 +115,405 @@ POST /hotels-index/_search
|
|||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
## Boolean filter with ANN search
|
||||
**Step 2: Add data to your index**
|
||||
|
||||
Next, add data to your index.
|
||||
|
||||
The following request adds 12 documents that contain hotel location, rating, and parking information:
|
||||
|
||||
```json
|
||||
POST /_bulk
|
||||
{ "index": { "_index": "hotels-index", "_id": "1" } }
|
||||
{ "location": [5.2, 4.4], "parking" : "true", "rating" : 5 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "2" } }
|
||||
{ "location": [5.2, 3.9], "parking" : "false", "rating" : 4 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "3" } }
|
||||
{ "location": [4.9, 3.4], "parking" : "true", "rating" : 9 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "4" } }
|
||||
{ "location": [4.2, 4.6], "parking" : "false", "rating" : 6}
|
||||
{ "index": { "_index": "hotels-index", "_id": "5" } }
|
||||
{ "location": [3.3, 4.5], "parking" : "true", "rating" : 8 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "6" } }
|
||||
{ "location": [6.4, 3.4], "parking" : "true", "rating" : 9 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "7" } }
|
||||
{ "location": [4.2, 6.2], "parking" : "true", "rating" : 5 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "8" } }
|
||||
{ "location": [2.4, 4.0], "parking" : "true", "rating" : 8 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "9" } }
|
||||
{ "location": [1.4, 3.2], "parking" : "false", "rating" : 5 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "10" } }
|
||||
{ "location": [7.0, 9.9], "parking" : "true", "rating" : 9 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "11" } }
|
||||
{ "location": [3.0, 2.3], "parking" : "false", "rating" : 6 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "12" } }
|
||||
{ "location": [5.0, 1.0], "parking" : "true", "rating" : 3 }
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
**Step 3: Search your data with a filter**
|
||||
|
||||
Now you can create a k-NN search with filters. In the k-NN query clause, include the point of interest that is used to search for nearest neighbors, the number of nearest neighbors to return (`k`), and a filter with the restriction criteria. Depending on how restrictive you want your filter to be, you can add multiple query clauses to a single request.
|
||||
|
||||
The following request creates a k-NN query that searches for the top three hotels near the location with the coordinates `[5, 4]` that are rated between 8 and 10, inclusive, and provide parking:
|
||||
|
||||
```json
|
||||
POST /hotels-index/_search
|
||||
{
|
||||
"size": 3,
|
||||
"query": {
|
||||
"knn": {
|
||||
"location": {
|
||||
"vector": [
|
||||
5,
|
||||
4
|
||||
],
|
||||
"k": 3,
|
||||
"filter": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{
|
||||
"range": {
|
||||
"rating": {
|
||||
"gte": 8,
|
||||
"lte": 10
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"term": {
|
||||
"parking": "true"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
The response returns the three hotels that are nearest to the search point and have met the filter criteria:
|
||||
|
||||
```json
|
||||
{
|
||||
"took" : 47,
|
||||
"timed_out" : false,
|
||||
"_shards" : {
|
||||
"total" : 1,
|
||||
"successful" : 1,
|
||||
"skipped" : 0,
|
||||
"failed" : 0
|
||||
},
|
||||
"hits" : {
|
||||
"total" : {
|
||||
"value" : 3,
|
||||
"relation" : "eq"
|
||||
},
|
||||
"max_score" : 0.72992706,
|
||||
"hits" : [
|
||||
{
|
||||
"_index" : "hotels-index",
|
||||
"_id" : "3",
|
||||
"_score" : 0.72992706,
|
||||
"_source" : {
|
||||
"location" : [
|
||||
4.9,
|
||||
3.4
|
||||
],
|
||||
"parking" : "true",
|
||||
"rating" : 9
|
||||
}
|
||||
},
|
||||
{
|
||||
"_index" : "hotels-index",
|
||||
"_id" : "6",
|
||||
"_score" : 0.3012048,
|
||||
"_source" : {
|
||||
"location" : [
|
||||
6.4,
|
||||
3.4
|
||||
],
|
||||
"parking" : "true",
|
||||
"rating" : 9
|
||||
}
|
||||
},
|
||||
{
|
||||
"_index" : "hotels-index",
|
||||
"_id" : "5",
|
||||
"_score" : 0.24154587,
|
||||
"_source" : {
|
||||
"location" : [
|
||||
3.3,
|
||||
4.5
|
||||
],
|
||||
"parking" : "true",
|
||||
"rating" : 8
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
For more ways to construct a filter, see [Constructing a filter](#constructing-a-filter).
|
||||
|
||||
### Faiss k-NN filter implementation
|
||||
|
||||
Starting with k-NN plugin version 2.9, you can use `faiss` filters for k-NN searches.
|
||||
|
||||
When you specify a Faiss filter for a k-NN search, the Faiss algorithm decides whether to perform an exact k-NN search with pre-filtering or an approximate search with modified post-filtering. The algorithm uses the following variables:
|
||||
|
||||
- N: The number of documents in the index.
|
||||
- P: The number of documents in the document subset after the filter is applied (P <= N).
|
||||
- k: The maximum number of vectors to return in the response.
|
||||
|
||||
The following flow chart outlines the Faiss algorithm.
|
||||
|
||||
![Faiss algorithm for filtering]({{site.url}}{{site.baseurl}}/images/faiss-algorithm.jpg)
|
||||
|
||||
### Using a Faiss efficient filter
|
||||
|
||||
Consider an index that contains information about different shirts for an e-commerce application. You want to find the top-rated shirts that are similar to the one you already have but would like to restrict the results by shirt size.
|
||||
|
||||
In this example, you will create an index and search for shirts that are similar to the shirt you provide.
|
||||
|
||||
**Step 1: Create a new index**
|
||||
|
||||
Before you can run a k-NN search with a filter, you need to create an index with a `knn_vector` field. For this field, you need to specify `faiss` and `hnsw` as the `method` in the mapping.
|
||||
|
||||
The following request creates an index that contains vector representations of shirts:
|
||||
|
||||
```json
|
||||
PUT /products-shirts
|
||||
{
|
||||
"settings": {
|
||||
"index": {
|
||||
"knn": true
|
||||
}
|
||||
},
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"item_vector": {
|
||||
"type": "knn_vector",
|
||||
"dimension": 3,
|
||||
"method": {
|
||||
"name": "hnsw",
|
||||
"space_type": "l2",
|
||||
"engine": "faiss"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
**Step 2: Add data to your index**
|
||||
|
||||
Next, add data to your index.
|
||||
|
||||
The following request adds 12 documents that contain information about shirts, including their vector representation, size, and rating:
|
||||
|
||||
```json
|
||||
POST /_bulk?refresh
|
||||
{ "index": { "_index": "products-shirts", "_id": "1" } }
|
||||
{ "item_vector": [5.2, 4.4, 8.4], "size" : "large", "rating" : 5 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "2" } }
|
||||
{ "item_vector": [5.2, 3.9, 2.9], "size" : "small", "rating" : 4 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "3" } }
|
||||
{ "item_vector": [4.9, 3.4, 2.2], "size" : "xlarge", "rating" : 9 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "4" } }
|
||||
{ "item_vector": [4.2, 4.6, 5.5], "size" : "large", "rating" : 6}
|
||||
{ "index": { "_index": "products-shirts", "_id": "5" } }
|
||||
{ "item_vector": [3.3, 4.5, 8.8], "size" : "medium", "rating" : 8 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "6" } }
|
||||
{ "item_vector": [6.4, 3.4, 6.6], "size" : "small", "rating" : 9 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "7" } }
|
||||
{ "item_vector": [4.2, 6.2, 4.6], "size" : "small", "rating" : 5 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "8" } }
|
||||
{ "item_vector": [2.4, 4.0, 3.0], "size" : "small", "rating" : 8 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "9" } }
|
||||
{ "item_vector": [1.4, 3.2, 9.0], "size" : "small", "rating" : 5 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "10" } }
|
||||
{ "item_vector": [7.0, 9.9, 9.0], "size" : "xlarge", "rating" : 9 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "11" } }
|
||||
{ "item_vector": [3.0, 2.3, 2.0], "size" : "large", "rating" : 6 }
|
||||
{ "index": { "_index": "products-shirts", "_id": "12" } }
|
||||
{ "item_vector": [5.0, 1.0, 4.0], "size" : "large", "rating" : 3 }
|
||||
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
**Step 3: Search your data with a filter**
|
||||
|
||||
Now you can create a k-NN search with filters. In the k-NN query clause, include the vector representation of the shirt that is used to search for similar ones, the number of nearest neighbors to return (`k`), and a filter by size and rating.
|
||||
|
||||
The following request searches for size small shirts rated between 7 and 10, inclusive:
|
||||
|
||||
```json
|
||||
POST /products-shirts/_search
|
||||
{
|
||||
"size": 2,
|
||||
"query": {
|
||||
"knn": {
|
||||
"item_vector": {
|
||||
"vector": [
|
||||
2, 4, 3
|
||||
],
|
||||
"k": 10,
|
||||
"filter": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{
|
||||
"range": {
|
||||
"rating": {
|
||||
"gte": 7,
|
||||
"lte": 10
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"term": {
|
||||
"size": "small"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
The response returns the two matching documents:
|
||||
|
||||
```json
|
||||
{
|
||||
"took": 2,
|
||||
"timed_out": false,
|
||||
"_shards": {
|
||||
"total": 1,
|
||||
"successful": 1,
|
||||
"skipped": 0,
|
||||
"failed": 0
|
||||
},
|
||||
"hits": {
|
||||
"total": {
|
||||
"value": 2,
|
||||
"relation": "eq"
|
||||
},
|
||||
"max_score": 0.8620689,
|
||||
"hits": [
|
||||
{
|
||||
"_index": "products-shirts",
|
||||
"_id": "8",
|
||||
"_score": 0.8620689,
|
||||
"_source": {
|
||||
"item_vector": [
|
||||
2.4,
|
||||
4,
|
||||
3
|
||||
],
|
||||
"size": "small",
|
||||
"rating": 8
|
||||
}
|
||||
},
|
||||
{
|
||||
"_index": "products-shirts",
|
||||
"_id": "6",
|
||||
"_score": 0.029691212,
|
||||
"_source": {
|
||||
"item_vector": [
|
||||
6.4,
|
||||
3.4,
|
||||
6.6
|
||||
],
|
||||
"size": "small",
|
||||
"rating": 9
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
For more ways to construct a filter, see [Constructing a filter](#constructing-a-filter).
|
||||
|
||||
### Constructing a filter
|
||||
|
||||
There are multiple ways to construct a filter for the same condition. For example, you can use the following constructs to create a filter that returns hotels that provide parking:
|
||||
|
||||
- A `term` query clause in the `should` clause
|
||||
- A `wildcard` query clause in the `should` clause
|
||||
- A `regexp` query clause in the `should` clause
|
||||
- A `must_not` clause to eliminate hotels with `parking` set to `false`.
|
||||
|
||||
The following request illustrates these four different ways of searching for hotels with parking:
|
||||
|
||||
```json
|
||||
POST /hotels-index/_search
|
||||
{
|
||||
"size": 3,
|
||||
"query": {
|
||||
"knn": {
|
||||
"location": {
|
||||
"vector": [ 5.0, 4.0 ],
|
||||
"k": 3,
|
||||
"filter": {
|
||||
"bool": {
|
||||
"must": {
|
||||
"range": {
|
||||
"rating": {
|
||||
"gte": 1,
|
||||
"lte": 6
|
||||
}
|
||||
}
|
||||
},
|
||||
"should": [
|
||||
{
|
||||
"term": {
|
||||
"parking": "true"
|
||||
}
|
||||
},
|
||||
{
|
||||
"wildcard": {
|
||||
"parking": {
|
||||
"value": "t*e"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"regexp": {
|
||||
"parking": "[a-zA-Z]rue"
|
||||
}
|
||||
}
|
||||
],
|
||||
"must_not": [
|
||||
{
|
||||
"term": {
|
||||
"parking": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"minimum_should_match": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
## Post-filtering
|
||||
|
||||
You can achieve post-filtering with a Boolean filter or by providing the `post_filter` parameter.
|
||||
|
||||
### Boolean filter with ANN search
|
||||
|
||||
A Boolean filter consists of a Boolean query that contains a k-NN query and a filter. For example, the following query searches for hotels that are closest to the specified `location` and then filters the results to return hotels with a rating between 8 and 10, inclusive, that provide parking:
|
||||
|
||||
|
@ -198,272 +623,80 @@ The response includes documents containing the matching hotels:
|
|||
}
|
||||
```
|
||||
|
||||
The location of the `filter` clause matters when it's used with a k-NN query clause. If the `filter` clause is outside the k-NN query clause, it must be a leaf clause. In this case, the filter is applied after the k-NN search and works exactly like the `post_filter` keyword. If the `filter` clause is within the k-NN query clause, it works as a hybrid of pre- and post-filtering (this option is only supported for the Lucene search engine).
|
||||
### post-filter parameter
|
||||
|
||||
## Lucene k-NN filter implementation
|
||||
|
||||
k-NN plugin version 2.2 introduced support for running k-NN searches with the Lucene engine using HNSW graphs. Starting with version 2.4, which is based on Lucene version 9.4, you can use Lucene filters for k-NN searches.
|
||||
|
||||
When you specify a Lucene filter for a k-NN search, the Lucene algorithm decides whether to perform an exact k-NN search with pre-filtering or an approximate search with modified post-filtering. The algorithm uses the following variables:
|
||||
|
||||
- N: The number of documents in the index.
|
||||
- P: The number of documents in the document subset after the filter is applied (P <= N).
|
||||
- k: The maximum number of vectors to return in the response.
|
||||
|
||||
The following flow chart outlines the Lucene algorithm.
|
||||
|
||||
![Lucene algorithm for filtering]({{site.url}}{{site.baseurl}}/images/lucene-algorithm.png)
|
||||
|
||||
For more information about the Lucene filtering implementation and the underlying `KnnVectorQuery`, see the [Apache Lucene documentation](https://issues.apache.org/jira/browse/LUCENE-10382).
|
||||
|
||||
## Using a Lucene k-NN filter
|
||||
|
||||
Consider a dataset that includes 12 documents containing hotel information. The following image shows all hotels on an xy coordinate plane by location. Additionally, the points for hotels that have a rating between 8 and 10, inclusive, are depicted with orange dots, and hotels that provide parking are depicted with green circles. The search point is colored in red:
|
||||
|
||||
![Graph of documents with filter criteria]({{site.url}}{{site.baseurl}}/images/knn-doc-set-for-filtering.png)
|
||||
|
||||
In this example, you will create an index and search for the three hotels with high ratings and parking that are the closest to the search location.
|
||||
|
||||
### Step 1: Create a new index
|
||||
|
||||
Before you can run a k-NN search with a filter, you need to create an index with a `knn_vector` field. For this field, you need to specify `lucene` as the engine and `hnsw` as the `method` in the mapping.
|
||||
|
||||
The following request creates a new index called `hotels-index` with a `knn-filter` field called `location`:
|
||||
If you use the `knn` query alongside filters or other clauses (for example, `bool`, `must`, `match`), you might receive fewer than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:
|
||||
|
||||
```json
|
||||
PUT /hotels-index
|
||||
GET my-knn-index-1/_search
|
||||
{
|
||||
"settings": {
|
||||
"index": {
|
||||
"knn": true,
|
||||
"knn.algo_param.ef_search": 100,
|
||||
"number_of_shards": 1,
|
||||
"number_of_replicas": 0
|
||||
"size": 2,
|
||||
"query": {
|
||||
"knn": {
|
||||
"my_vector2": {
|
||||
"vector": [2, 3, 5, 6],
|
||||
"k": 2
|
||||
}
|
||||
}
|
||||
},
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "knn_vector",
|
||||
"dimension": 2,
|
||||
"method": {
|
||||
"name": "hnsw",
|
||||
"space_type": "l2",
|
||||
"engine": "lucene",
|
||||
"parameters": {
|
||||
"ef_construction": 100,
|
||||
"m": 16
|
||||
}
|
||||
}
|
||||
"post_filter": {
|
||||
"range": {
|
||||
"price": {
|
||||
"gte": 5,
|
||||
"lte": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
### Step 2: Add data to your index
|
||||
## Scoring script filter
|
||||
|
||||
Next, add data to your index.
|
||||
|
||||
The following request adds 12 documents that contain hotel location, rating, and parking information:
|
||||
|
||||
```json
|
||||
POST /_bulk
|
||||
{ "index": { "_index": "hotels-index", "_id": "1" } }
|
||||
{ "location": [5.2, 4.4], "parking" : "true", "rating" : 5 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "2" } }
|
||||
{ "location": [5.2, 3.9], "parking" : "false", "rating" : 4 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "3" } }
|
||||
{ "location": [4.9, 3.4], "parking" : "true", "rating" : 9 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "4" } }
|
||||
{ "location": [4.2, 4.6], "parking" : "false", "rating" : 6}
|
||||
{ "index": { "_index": "hotels-index", "_id": "5" } }
|
||||
{ "location": [3.3, 4.5], "parking" : "true", "rating" : 8 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "6" } }
|
||||
{ "location": [6.4, 3.4], "parking" : "true", "rating" : 9 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "7" } }
|
||||
{ "location": [4.2, 6.2], "parking" : "true", "rating" : 5 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "8" } }
|
||||
{ "location": [2.4, 4.0], "parking" : "true", "rating" : 8 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "9" } }
|
||||
{ "location": [1.4, 3.2], "parking" : "false", "rating" : 5 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "10" } }
|
||||
{ "location": [7.0, 9.9], "parking" : "true", "rating" : 9 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "11" } }
|
||||
{ "location": [3.0, 2.3], "parking" : "false", "rating" : 6 }
|
||||
{ "index": { "_index": "hotels-index", "_id": "12" } }
|
||||
{ "location": [5.0, 1.0], "parking" : "true", "rating" : 3 }
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
### Step 3: Search your data with a filter
|
||||
|
||||
Now you can create a k-NN search with filters. In the k-NN query clause, include the point of interest that is used to search for nearest neighbors, the number of nearest neighbors to return (`k`), and a filter with the restriction criteria. Depending on how restrictive you want your filter to be, you can add multiple query clauses to a single request.
|
||||
|
||||
The following request creates a k-NN query that searches for the top three hotels near the location with the coordinates `[5, 4]` that are rated between 8 and 10, inclusive, and provide parking:
|
||||
A scoring script filter first filters the documents and then uses a brute-force exact k-NN search on the results. For example, the following query searches for hotels with a rating between 8 and 10, inclusive, that provide parking and then performs a k-NN search to return the 3 hotels that are closest to the specified `location`:
|
||||
|
||||
```json
|
||||
POST /hotels-index/_search
|
||||
{
|
||||
"size": 3,
|
||||
"query": {
|
||||
"knn": {
|
||||
"location": {
|
||||
"vector": [
|
||||
5,
|
||||
4
|
||||
],
|
||||
"k": 3,
|
||||
"filter": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{
|
||||
"range": {
|
||||
"rating": {
|
||||
"gte": 8,
|
||||
"lte": 10
|
||||
"script_score": {
|
||||
"query": {
|
||||
"bool": {
|
||||
"filter": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{
|
||||
"range": {
|
||||
"rating": {
|
||||
"gte": 8,
|
||||
"lte": 10
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"term": {
|
||||
"parking": "true"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"term": {
|
||||
"parking": "true"
|
||||
}
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"script": {
|
||||
"source": "knn_score",
|
||||
"lang": "knn",
|
||||
"params": {
|
||||
"field": "location",
|
||||
"query_value": [
|
||||
5.0,
|
||||
4.0
|
||||
],
|
||||
"space_type": "l2"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
The response returns the three hotels that are nearest to the search point and have met the filter criteria:
|
||||
|
||||
```json
|
||||
{
|
||||
"took" : 47,
|
||||
"timed_out" : false,
|
||||
"_shards" : {
|
||||
"total" : 1,
|
||||
"successful" : 1,
|
||||
"skipped" : 0,
|
||||
"failed" : 0
|
||||
},
|
||||
"hits" : {
|
||||
"total" : {
|
||||
"value" : 3,
|
||||
"relation" : "eq"
|
||||
},
|
||||
"max_score" : 0.72992706,
|
||||
"hits" : [
|
||||
{
|
||||
"_index" : "hotels-index",
|
||||
"_id" : "3",
|
||||
"_score" : 0.72992706,
|
||||
"_source" : {
|
||||
"location" : [
|
||||
4.9,
|
||||
3.4
|
||||
],
|
||||
"parking" : "true",
|
||||
"rating" : 9
|
||||
}
|
||||
},
|
||||
{
|
||||
"_index" : "hotels-index",
|
||||
"_id" : "6",
|
||||
"_score" : 0.3012048,
|
||||
"_source" : {
|
||||
"location" : [
|
||||
6.4,
|
||||
3.4
|
||||
],
|
||||
"parking" : "true",
|
||||
"rating" : 9
|
||||
}
|
||||
},
|
||||
{
|
||||
"_index" : "hotels-index",
|
||||
"_id" : "5",
|
||||
"_score" : 0.24154587,
|
||||
"_source" : {
|
||||
"location" : [
|
||||
3.3,
|
||||
4.5
|
||||
],
|
||||
"parking" : "true",
|
||||
"rating" : 8
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Note that there are multiple ways to construct a filter that returns hotels that provide parking, for example:
|
||||
|
||||
- A `term` query clause in the `should` clause
|
||||
- A `wildcard` query clause in the `should` clause
|
||||
- A `regexp` query clause in the `should` clause
|
||||
- A `must_not` clause to eliminate hotels with `parking` set to `false`.
|
||||
|
||||
The following request illustrates these four different ways of searching for hotels with parking:
|
||||
|
||||
```json
|
||||
POST /hotels-index/_search
|
||||
{
|
||||
"size": 3,
|
||||
"query": {
|
||||
"knn": {
|
||||
"location": {
|
||||
"vector": [ 5.0, 4.0 ],
|
||||
"k": 3,
|
||||
"filter": {
|
||||
"bool": {
|
||||
"must": {
|
||||
"range": {
|
||||
"rating": {
|
||||
"gte": 1,
|
||||
"lte": 6
|
||||
}
|
||||
}
|
||||
},
|
||||
"should": [
|
||||
{
|
||||
"term": {
|
||||
"parking": "true"
|
||||
}
|
||||
},
|
||||
{
|
||||
"wildcard": {
|
||||
"parking": {
|
||||
"value": "t*e"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"regexp": {
|
||||
"parking": "[a-zA-Z]rue"
|
||||
}
|
||||
}
|
||||
],
|
||||
"must_not": [
|
||||
{
|
||||
"term": {
|
||||
"parking": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"minimum_should_match": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
Binary file not shown.
After Width: | Height: | Size: 26 KiB |
Loading…
Reference in New Issue