Sparse encoding

The sparse_encoding processor is used to generate a sparse vector/token and weights from text fields for neural search using sparse retrieval.

PREREQUISITE
Before using the sparse_encoding processor, you must set up a machine learning (ML) model. For more information, see Using custom models within OpenSearch and Semantic search. {: .note}

The following is the syntax for the sparse_encoding processor:

{
  "sparse_encoding": {
    "model_id": "<model_id>",
    "field_map": {
      "<input_field>": "<vector_field>"
    }
  }
}

{% include copy-curl.html %}

Configuration parameters

The following table lists the required and optional parameters for the sparse_encoding processor.

Name	Data type	Required	Description
`model_id`	String	Required	The ID of the model that will be used to generate the embeddings. The model must be deployed in OpenSearch before it can be used in neural search. For more information, see Using custom models within OpenSearch and Semantic search.
`field_map`	Object	Required	Contains key-value pairs that specify the mapping of a text field to a `rank_features` field.
`field_map.<input_field>`	String	Required	The name of the field from which to obtain text for generating vector embeddings.
`field_map.<vector_field>`	String	Required	The name of the vector field in which to store the generated vector embeddings.
`description`	String	Optional	A brief description of the processor.
`tag`	String	Optional	An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type.

Using the processor

Follow these steps to use the processor in a pipeline. You must provide a model ID when creating the processor. For more information, see Using custom models within OpenSearch.

Step 1: Create a pipeline.

The following example request creates an ingest pipeline where the text from passage_text will be converted into text embeddings and the embeddings will be stored in passage_embedding:

PUT /_ingest/pipeline/nlp-ingest-pipeline
{
  "description": "A sparse encoding ingest pipeline",
  "processors": [
    {
      "sparse_encoding": {
        "model_id": "aP2Q8ooBpBj3wT4HVS8a",
        "field_map": {
          "passage_text": "passage_embedding"
        }
      }
    }
  ]
}

{% include copy-curl.html %}

Step 2 (Optional): Test the pipeline.

It is recommended that you test your pipeline before you ingest documents. {: .tip}

To test the pipeline, run the following query:

POST _ingest/pipeline/nlp-ingest-pipeline/_simulate
{
  "docs": [
    {
      "_index": "testindex1",
      "_id": "1",
      "_source":{
         "passage_text": "hello world"
      }
    }
  ]
}

{% include copy-curl.html %}

Response

The response confirms that in addition to the passage_text field, the processor has generated text embeddings in the passage_embedding field:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "testindex1",
        "_id" : "1",
        "_source" : {
          "passage_embedding" : {
            "!" : 0.8708904,
            "door" : 0.8587369,
            "hi" : 2.3929274,
            "worlds" : 2.7839446,
            "yes" : 0.75845814,
            "##world" : 2.5432441,
            "born" : 0.2682308,
            "nothing" : 0.8625516,
            "goodbye" : 0.17146169,
            "greeting" : 0.96817183,
            "birth" : 1.2788506,
            "come" : 0.1623208,
            "global" : 0.4371151,
            "it" : 0.42951578,
            "life" : 1.5750692,
            "thanks" : 0.26481047,
            "world" : 4.7300377,
            "tiny" : 0.5462298,
            "earth" : 2.6555297,
            "universe" : 2.0308156,
            "worldwide" : 1.3903781,
            "hello" : 6.696973,
            "so" : 0.20279501,
            "?" : 0.67785245
          },
          "passage_text" : "hello world"
        },
        "_ingest" : {
          "timestamp" : "2023-10-11T22:35:53.654650086Z"
        }
      }
    }
  ]
}

Next steps

To learn how to use the neural_sparse query for a sparse search, see Neural sparse query.
To learn more about sparse neural search, see Sparse search.
To learn more about using models in OpenSearch, see Using custom models within OpenSearch.
For a semantic search tutorial, see Semantic search.

5.2 KiB Raw Blame History

Sparse encoding

Configuration parameters

Using the processor

Response

Next steps

5.2 KiB

Raw Blame History