297 lines
8.3 KiB
Markdown
297 lines
8.3 KiB
Markdown
---
|
|
layout: default
|
|
title: Text search
|
|
nav_order: 10
|
|
has_children: false
|
|
parent: Neural search
|
|
---
|
|
|
|
# Text search
|
|
|
|
Use text search for text data. In neural search, text search is facilitated by text embedding models. Text search creates a dense vector (a list of floats) and ingests data into a k-NN index.
|
|
|
|
**PREREQUISITE**<br>
|
|
Before using text search, you must set up a text embedding model. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/).
|
|
{: .note}
|
|
|
|
## Using text search
|
|
|
|
To use text search, follow these steps:
|
|
|
|
1. [Create an ingest pipeline](#step-1-create-an-ingest-pipeline).
|
|
1. [Create an index for ingestion](#step-2-create-an-index-for-ingestion).
|
|
1. [Ingest documents into the index](#step-3-ingest-documents-into-the-index).
|
|
1. [Search the index using neural search](#step-4-search-the-index-using-neural-search).
|
|
|
|
## Step 1: Create an ingest pipeline
|
|
|
|
To generate vector embeddings, you need to create an [ingest pipeline]({{site.url}}{{site.baseurl}}/api-reference/ingest-apis/index/) that contains a [`text_embedding` processor]({{site.url}}{{site.baseurl}}/api-reference/ingest-apis/processors/text-embedding/), which will convert the text in a document field to vector embeddings. The processor's `field_map` determines the input fields from which to generate vector embeddings and the output fields in which to store the embeddings.
|
|
|
|
The following example request creates an ingest pipeline where the text from `passage_text` will be converted into text embeddings and the embeddings will be stored in `passage_embedding`:
|
|
|
|
```json
|
|
PUT /_ingest/pipeline/nlp-ingest-pipeline
|
|
{
|
|
"description": "A text embedding pipeline",
|
|
"processors": [
|
|
{
|
|
"text_embedding": {
|
|
"model_id": "bQ1J8ooBpBj3wT4HVUsb",
|
|
"field_map": {
|
|
"passage_text": "passage_embedding"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
{% include copy-curl.html %}
|
|
|
|
## Step 2: Create an index for ingestion
|
|
|
|
In order to use the text embedding processor defined in your pipeline, create a k-NN index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`.
|
|
|
|
The following example request creates a k-NN index that is set up with a default ingest pipeline:
|
|
|
|
```json
|
|
PUT /my-nlp-index
|
|
{
|
|
"settings": {
|
|
"index.knn": true,
|
|
"default_pipeline": "nlp-ingest-pipeline"
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"id": {
|
|
"type": "text"
|
|
},
|
|
"passage_embedding": {
|
|
"type": "knn_vector",
|
|
"dimension": 768,
|
|
"method": {
|
|
"engine": "lucene",
|
|
"space_type": "l2",
|
|
"name": "hnsw",
|
|
"parameters": {}
|
|
}
|
|
},
|
|
"passage_text": {
|
|
"type": "text"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
{% include copy-curl.html %}
|
|
|
|
For more information about creating a k-NN index and its supported methods, see [k-NN index]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/).
|
|
|
|
## Step 3: Ingest documents into the index
|
|
|
|
To ingest documents into the index created in the previous step, send the following requests:
|
|
|
|
```json
|
|
PUT /my-nlp-index/_doc/1
|
|
{
|
|
"passage_text": "Hello world",
|
|
"id": "s1"
|
|
}
|
|
```
|
|
{% include copy-curl.html %}
|
|
|
|
```json
|
|
PUT /my-nlp-index/_doc/2
|
|
{
|
|
"passage_text": "Hi planet",
|
|
"id": "s2"
|
|
}
|
|
```
|
|
{% include copy-curl.html %}
|
|
|
|
Before the document is ingested into the index, the ingest pipeline runs the `text_embedding` processor on the document, generating text embeddings for the `passage_text` field. The indexed document includes the `passage_text` field, which contains the original text, and the `passage_embedding` field, which contains the vector embeddings.
|
|
|
|
## Step 4: Search the index using neural search
|
|
|
|
To perform vector search on your index, use the `neural` query clause either in the [k-NN plugin API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api/#search-model) or [Query DSL]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/index/) queries. You can refine the results by using a [k-NN search filter]({{site.url}}{{site.baseurl}}/search-plugins/knn/filter-search-knn/).
|
|
|
|
The following example request uses a Boolean query to combine a filter clause and two query clauses---a neural query and a `match` query. The `script_score` query assigns custom weights to the query clauses:
|
|
|
|
```json
|
|
GET /my-nlp-index/_search
|
|
{
|
|
"_source": {
|
|
"excludes": [
|
|
"passage_embedding"
|
|
]
|
|
},
|
|
"query": {
|
|
"bool": {
|
|
"filter": {
|
|
"wildcard": { "id": "*1" }
|
|
},
|
|
"should": [
|
|
{
|
|
"script_score": {
|
|
"query": {
|
|
"neural": {
|
|
"passage_embedding": {
|
|
"query_text": "Hi world",
|
|
"model_id": "bQ1J8ooBpBj3wT4HVUsb",
|
|
"k": 100
|
|
}
|
|
}
|
|
},
|
|
"script": {
|
|
"source": "_score * 1.5"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"script_score": {
|
|
"query": {
|
|
"match": {
|
|
"passage_text": "Hi world"
|
|
}
|
|
},
|
|
"script": {
|
|
"source": "_score * 1.7"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
{% include copy-curl.html %}
|
|
|
|
The response contains the matching document:
|
|
|
|
```json
|
|
{
|
|
"took" : 36,
|
|
"timed_out" : false,
|
|
"_shards" : {
|
|
"total" : 1,
|
|
"successful" : 1,
|
|
"skipped" : 0,
|
|
"failed" : 0
|
|
},
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : 1.2251667,
|
|
"hits" : [
|
|
{
|
|
"_index" : "my-nlp-index",
|
|
"_id" : "1",
|
|
"_score" : 1.2251667,
|
|
"_source" : {
|
|
"passage_text" : "Hello world",
|
|
"id" : "s1"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Setting a default model on an index or field
|
|
|
|
A [`neural`]({{site.url}}{{site.baseurl}}/query-dsl/specialized/neural/) query requires a model ID for generating vector embeddings. To eliminate passing the model ID with each neural query request, you can set a default model on a k-NN index or a field.
|
|
|
|
First, create a [search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/) with a [`neural_query_enricher`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/neural-query-enricher/) request processor. To set a default model for an index, provide the model ID in the `default_model_id` parameter. To set a default model for a specific field, provide the field name and the corresponding model ID in the `neural_field_default_id` map. If you provide both `default_model_id` and `neural_field_default_id`, `neural_field_default_id` takes precedence:
|
|
|
|
```json
|
|
PUT /_search/pipeline/default_model_pipeline
|
|
{
|
|
"request_processors": [
|
|
{
|
|
"neural_query_enricher" : {
|
|
"default_model_id": "bQ1J8ooBpBj3wT4HVUsb",
|
|
"neural_field_default_id": {
|
|
"my_field_1": "uZj0qYoBMtvQlfhaYeud",
|
|
"my_field_2": "upj0qYoBMtvQlfhaZOuM"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
{% include copy-curl.html %}
|
|
|
|
Then set the default model for your index:
|
|
|
|
```json
|
|
PUT /my-nlp-index/_settings
|
|
{
|
|
"index.search.default_pipeline" : "default_model_pipeline"
|
|
}
|
|
```
|
|
{% include copy-curl.html %}
|
|
|
|
You can now omit the model ID when searching:
|
|
|
|
```json
|
|
GET /my-nlp-index/_search
|
|
{
|
|
"_source": {
|
|
"excludes": [
|
|
"passage_embedding"
|
|
]
|
|
},
|
|
"query": {
|
|
"neural": {
|
|
"passage_embedding": {
|
|
"query_text": "Hi world",
|
|
"k": 100
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
{% include copy-curl.html %}
|
|
|
|
The response contains both documents:
|
|
|
|
```json
|
|
{
|
|
"took" : 41,
|
|
"timed_out" : false,
|
|
"_shards" : {
|
|
"total" : 1,
|
|
"successful" : 1,
|
|
"skipped" : 0,
|
|
"failed" : 0
|
|
},
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 2,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : 1.22762,
|
|
"hits" : [
|
|
{
|
|
"_index" : "my-nlp-index",
|
|
"_id" : "2",
|
|
"_score" : 1.22762,
|
|
"_source" : {
|
|
"passage_text" : "Hi planet",
|
|
"id" : "s2"
|
|
}
|
|
},
|
|
{
|
|
"_index" : "my-nlp-index",
|
|
"_id" : "1",
|
|
"_score" : 1.2251667,
|
|
"_source" : {
|
|
"passage_text" : "Hello world",
|
|
"id" : "s1"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
``` |