opensearch-docs-cn/_query-dsl/compound/function-score.md

25 KiB
Raw Blame History

layout title parent grand_parent nav_order has_math redirect_from
default Function score queries Compound queries Query DSL 60 true
/query-dsl/query-dsl/compound/function-score/

Function score queries

Use a function_score query if you need to alter the relevance scores of documents returned in the results. A function_score query defines a query and one or more functions that can be applied to all results or subsets of the results to recalculate their relevance scores.

Using one scoring function

The most basic example of a function_score query uses one function to recalculate the score. The following query uses a weight function to double all relevance scores. This function applies to all documents in the results because there is no query parameter specified within function_score:

GET shakespeare/_search
{
  "query": {
    "function_score": {
      "weight": "2"
    }
  }
}

{% include copy-curl.html %}

Applying the scoring function to a subset of documents

To apply the scoring function to a subset of documents, provide a query within the function:

GET shakespeare/_search
{
  "query": {
    "function_score": {
      "query": { 
        "match": {
          "play_name": "Hamlet"
        } 
      },
      "weight": "2"
    }
  }
}

{% include copy-curl.html %}

Supported functions

The function_score query type supports the following functions:

  • Built-in:
    • weight: Multiplies a document score by a predefined boost factor.
    • random_score: Provides a random score that is consistent for a single user but different between users.
    • field_value_factor: Uses the value of the specified document field to recalculate the score.
    • Decay functions (gauss, exp, and linear): Recalculates the score using a specified decay function.
  • Custom:
    • script_score: Uses a script to score documents.

The weight function

When you use the weight function, the original relevance score is multiplied by the floating-point value of weight:

GET shakespeare/_search
{
  "query": {
    "function_score": {
      "weight": "2"
    }
  }
}

{% include copy-curl.html %}

Unlike the boost value, the weight function is not normalized.

The random score function

The random_score function provides a random score that is consistent for a single user but different between users. The score is a floating-point number in the [0, 1) range. By default, the random_score function uses internal Lucene document IDs as seed values, making random values irreproducible because documents can be renumbered after merges. To achieve consistency in generating random values, you can provide seed and field parameters. The field must be a field for which fielddata is enabled (commonly, a numeric field). The score is calculated using the seed, the fielddata values for the field, and a salt calculated using the index name and shard ID. Because the index name and shard ID are the same for documents that reside in the same shard, documents with the same field values will be assigned the same score. To ensure different scores for all documents in the same shard, use a field that has unique values for all documents. One option is to use the _seq_no field. However, if you choose this field, the scores can change if the document is updated because of the corresponding _seq_no update.

The following query uses the random_score function with a seed and field:

GET blogs/_search
{
  "query": {
    "function_score": {
      "random_score": {
        "seed": 20,
        "field": "_seq_no"
      }
    }
  }
}

{% include copy-curl.html %}

The field value factor function

The field_value_factor function recalculates the score using the value of the specified document field. If the field is a multi-valued field, only its first value is used for calculations, and the others are not considered.

The field_value_factor function supports the following options:

  • field: The field to use in score calculations.

  • factor: An optional factor by which the field value is multiplied. Default is 1.

  • modifier: One of the modifiers to apply to the field value $v$. The following table lists all supported modifiers.

    Modifier Formula Description
    log \log v$$ Take the base-10 logarithm of the value. Taking a logarithm of a non-positive number is an illegal operation and will result in an error. For values between 0 (exclusive) and 1 (inclusive), this function returns non-negative values that will result in an error. We recommend using log1p or log2p instead of log.
    log1p \log (1 + v) Take the base-10 logarithm of the sum of 1 and the value.
    log2p \log (2 + v) Take the base-10 logarithm of the sum of 2 and the value.
    ln \ln v$$ Take the natural logarithm of the value. Taking a logarithm of a non-positive number is an illegal operation and will result in an error. For values between 0 (exclusive) and 1 (inclusive), this function returns non-negative values that will result in an error. We recommend using ln1p or ln2p instead of ln.
    ln1p \ln (1 + v) Take the natural logarithm of the sum of 1 and the value.
    ln2p \ln (2 + v) Take the natural logarithm of the sum of 2 and the value.
    reciprocal \frac {1}{v} Take the reciprocal of the value.
    square $v^2$ Square the value.
    sqrt \sqrt v$$ Take the square root of the value. Taking a square root of a negative number is an illegal operation and will result in an error. Ensure that $v$ is non-negative.
    none N/A Do not apply any modifier.
  • missing: The value to use if the field is missing from the document. The factor and modifier are applied to this value instead of the missing field value.

For example, the following query uses the field_value_factor function to give more weight to the views field:

GET blogs/_search
{
  "query": {
    "function_score": {
      "field_value_factor": {
        "field": "views",
        "factor": 1.5,
        "modifier": "log1p",
        "missing": 1
      }
    }
  }
}

{% include copy-curl.html %}

The preceding query calculates the relevance score using the following formula:

 \text{score} = \text{original score} \cdot \log(1 + 1.5 \cdot \text{views}) 

The script score function

Using the script_score function, you can write a custom script for scoring documents, optionally incorporating values of fields in the document. The original relevance score is accessible in the _score variable.

The calculated score cannot be negative. A negative score will result in an error. Document scores have positive 32-bit floating-point values. A score with greater precision is converted to the nearest 32-bit floating-point number. {: .important}

For example, the following query uses the script_score function to calculate the score based on the original score and the number of views and likes for the blog post. To give the number of views and likes a lesser weight, this formula takes the logarithm of the sum of views and likes. To make the logarithm valid even if the number of views and likes is 0, 1 is added to their sum:

GET blogs/_search
{
  "query": {
    "function_score": {
      "query": {"match": {"name": "opensearch"}},
      "script_score": {
        "script": "_score * Math.log(1 + doc['likes'].value + doc['views'].value)"
      }
    }
  }
}

{% include copy-curl.html %}

Scripts are compiled and cached for faster performance. Thus, it's preferable to reuse the same script and pass any parameters that the script needs:

GET blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": { "name": "opensearch" }
      },
      "script_score": {
        "script": {
          "params": {
            "add": 1
          },
          "source": "_score * Math.log(params.add + doc['likes'].value + doc['views'].value)"
        }
      }
    }
  }
}

{% include copy-curl.html %}

Decay functions

For many applications, you need to sort the results based on proximity or recency. You can do this with decay functions. Decay functions calculate a document score using one of three decay curves: Gaussian, exponential, or linear.

Decay functions operate only on numeric, date, and geopoint fields. {: .important}

Decay functions calculate scores based on the origin, scale, offset, and decay, as shown in the following figure.

Decay function curves

Example: Geopoint fields

Suppose you're looking for a hotel near your office. You create a hotels index that maps the location field as a geopoint:

PUT hotels
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      }
    }
  }
}

{% include copy-curl.html %}

You index two documents that correspond to nearby hotels:

PUT hotels/_doc/1
{
  "name": "Hotel Within 200",
  "location": { 
    "lat": 40.7105,
    "lon": 74.00
  }
}

{% include copy-curl.html %}

PUT hotels/_doc/2
{
  "name": "Hotel Outside 500",
  "location": { 
    "lat": 40.7115,
    "lon": 74.00
  }
}

{% include copy-curl.html %}

The origin defines the point from which the distance is calculated (the office location). The offset specifies the distance from the origin within which documents are given a full score of 1. You can give hotels within 200 ft of the office the same highest score. The scale defines the decay rate of the graph, and the decay defines the score to assign to a document at the scale + offset distance from the origin. Once you are outside the 200 ft radius, you may decide that if you have to walk another 300 ft to get to a hotel (scale = 300 ft), you'll assign it one quarter of the original score (decay = 0.25).

You create the following query with the origin at (74.00, 40.71):

GET hotels/_search
{
  "query": {
    "function_score": {
      "functions": [
        {
          "exp": {
            "location": { 
              "origin": "40.71,74.00",
              "offset": "200ft",
              "scale":  "300ft",
              "decay": 0.25
            }
          }
        }
      ]
    }
  }
}

{% include copy-curl.html %}

The response contains both hotels. The hotel within 200 ft of the office has a score of 1, and the hotel outside of the 500 ft radius has a score 0.20, which is less than the decay parameter 0.25:

Response {: .text-delta}
{
  "took": 854,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "hotels",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "Hotel Within 200",
          "location": {
            "lat": 40.7105,
            "lon": 74
          }
        }
      },
      {
        "_index": "hotels",
        "_id": "2",
        "_score": 0.20099315,
        "_source": {
          "name": "Hotel Outside 500",
          "location": {
            "lat": 40.7115,
            "lon": 74
          }
        }
      }
    ]
  }
}

Parameters

The following table lists all parameters supported by the gauss, exp, and linear functions.

Parameter Description
origin The point from which to calculate the distance. Must be provided as a number for numeric fields, a date for date fields, or a geopoint for geopoint fields. Required for geopoint and numeric fields. Optional for date fields (defaults to now). For date fields, date math is supported (for example, now-2d).
offset Defines the distance from the origin within which documents are given a score of 1. Optional. Default is 0.
scale Documents at the distance of scale + offset from the origin are assigned a score of decay. Required.
For numeric fields, scale can be any number.
For date fields, scale can be defined as a number with units (5h, 1d). If units are not provided, scale defaults to milliseconds.
For geopoint fields, scale can be defined as a number with units (1mi, 5km). If units are not provided, scale defaults to meters.
decay Defines the score of a document at the distance of scale + offset from the origin. Optional. Default is 0.5.

For fields that are missing from the document, decay functions return a score of 1. {: .note}

Example: Numeric fields

The following query uses the exponential decay function to prioritize blog posts by the number of comments:

GET blogs/_search
{
  "query": {
    "function_score": {
      "functions": [
        {
          "exp": {
            "comments": { 
              "origin":  "20",
              "offset": "5",
              "scale":  "10"
            }
          }
        }
      ]
    }
  }
}

{% include copy-curl.html %}

The first two blog posts in the results have a score of 1 because one is at the origin (20) and the other is at a distance of 16, which is within the offset (the range within which documents receive a full score is calculated as 20 \pm$$ 5 and is [15, 25]). The third blog post is at a distance of scale + offset from the origin (20 (5 + 10) = 15), so it's given the default decay score (0.5):

Response {: .text-delta}
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "blogs",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "Semantic search in OpenSearch",
          "views": 1200,
          "likes": 150,
          "comments": 16,
          "date_posted": "2022-04-17"
        }
      },
      {
        "_index": "blogs",
        "_id": "2",
        "_score": 1,
        "_source": {
          "name": "Get started with OpenSearch 2.7",
          "views": 1400,
          "likes": 100,
          "comments": 20,
          "date_posted": "2022-05-02"
        }
      },
      {
        "_index": "blogs",
        "_id": "3",
        "_score": 0.5,
        "_source": {
          "name": "Distributed tracing with Data Prepper",
          "views": 800,
          "likes": 50,
          "comments": 5,
          "date_posted": "2022-04-25"
        }
      },
      {
        "_index": "blogs",
        "_id": "4",
        "_score": 0.4352753,
        "_source": {
          "name": "A very old blog",
          "views": 100,
          "likes": 20,
          "comments": 3,
          "date_posted": "2000-04-25"
        }
      }
    ]
  }
}

Example: Date fields

The following query uses the Gaussian decay function to prioritize blog posts published around 04/24/2002:

GET blogs/_search
{
  "query": {
    "function_score": {
      "functions": [
        {
          "gauss": {
            "date_posted": { 
              "origin":  "2022-04-24",
              "offset": "1d",
              "scale":  "6d",
              "decay": 0.25
            }
          }
        }
      ]
    }
  }
}

{% include copy-curl.html %}

In the results, the first blog post was published within one day of 04/24/2022, so it has the highest score of 1. The second blog post was published on 04/17/2022, which is within offset + scale (1d + 6d) and therefore has a score equal to decay (0.25). The third blog post was published more than 7 days after 04/24/2022, so it has a lower score. The last blog post has a score of 0 because it was published years ago:

Response {: .text-delta}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "blogs",
        "_id": "3",
        "_score": 1,
        "_source": {
          "name": "Distributed tracing with Data Prepper",
          "views": 800,
          "likes": 50,
          "comments": 5,
          "date_posted": "2022-04-25"
        }
      },
      {
        "_index": "blogs",
        "_id": "1",
        "_score": 0.25,
        "_source": {
          "name": "Semantic search in OpenSearch",
          "views": 1200,
          "likes": 150,
          "comments": 16,
          "date_posted": "2022-04-17"
        }
      },
      {
        "_index": "blogs",
        "_id": "2",
        "_score": 0.15154076,
        "_source": {
          "name": "Get started with OpenSearch 2.7",
          "views": 1400,
          "likes": 100,
          "comments": 20,
          "date_posted": "2022-05-02"
        }
      },
      {
        "_index": "blogs",
        "_id": "4",
        "_score": 0,
        "_source": {
          "name": "A very old blog",
          "views": 100,
          "likes": 20,
          "comments": 3,
          "date_posted": "2000-04-25"
        }
      }
    ]
  }
}

Multi-valued fields

If the field that you specify for decay calculation contains multiple values, you can use the multi_value_mode parameter. This parameter specifies one of the following functions to determine the field value that is used for calculations:

  • min: (Default) The minimum distance from the origin.
  • max: The maximum distance from the origin.
  • avg: The average distance from the origin.
  • sum: The sum of all distances from the origin.

For example, you index a document with an array of distances:

PUT testindex/_doc/1
{
  "distances": [1, 2, 3, 4, 5]
}

The following query uses the max distance of a multi-valued field distances to calculate decay:

GET testindex/_search
{
  "query": {
    "function_score": {
      "functions": [
        {
          "exp": {
            "distances": { 
              "origin":  "6",
              "offset": "5",
              "scale":  "1"
            },
            "multi_value_mode": "max"
          }
        }
      ]
    }
  }
}

{% include copy-curl.html %}

The document is given a score of 1 because the maximum distance from the origin (1) is within the offset from the origin:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "testindex",
        "_id": "1",
        "_score": 1,
        "_source": {
          "distances": [
            1,
            2,
            3,
            4,
            5
          ]
        }
      }
    ]
  }
}

Decay curve calculation

The following formulas define score computation for various decay functions ($v$ denotes the document field value).

Gaussian

 \text{score} = \exp \left(-\frac {(\max(0, \lvert v - \text{origin} \rvert - \text{offset}))^2} {2\sigma^2} \right), 

where \sigma$$ is calculated to ensure that the score is equal to decay at the distance offset + scale from the origin:

 \sigma^2 = - \frac {\text{scale}^2} {2 \ln(\text{decay})} 

Exponential

 \text{score} = \exp (\lambda \cdot \max(0, \lvert v - \text{origin} \rvert - \text{offset})),

where \lambda$$ is calculated to ensure that the score is equal to decay at the distance offset + scale from the origin:

\lambda = \frac {\ln(\text{decay})} {\text{scale}} 

Linear

 \text{score} = \max \left(\frac {s - \max(0, \lvert v - \text{origin} \rvert - \text{offset})} {s} \right), 

where $s$ is calculated to ensure that the score is equal to decay at the distance offset + scale from the origin:

s = \frac {\text{scale}} {1 - \text{decay}}

Using multiple scoring functions

You can specify multiple scoring functions in a function score query by listing them in the functions array.

Combining scores from multiple functions

Different functions can use different scales for scoring. For example, the random_score function provides a score between 0 and 1, but the field_value_factor does not have a specific scale for the score. Additionally, you may want to weigh scores given by different functions differently. To adjust scores for different functions, you can specify the weight parameter for each function. The score given by each function is then multiplied by the weight to produce the final score for that function. The weight parameter must be provided in the functions array in order to differentiate it from the weight function,

The scores given by each function are combined using the score_mode parameter, which takes one of the following values:

  • multiply: (Default) Scores are multiplied.
  • sum: Scores are added.
  • avg: Scores are averaged. If weight is specified, this is a weighted average. For example, if the first function with the weight $1returns the score10, and the second function with the weight 4returns the score20, the average is calculated as \frac {10 \cdot 1 + 20 \cdot 4}{1 + 4} = 18$.
  • first: The score from the first function that has a matching filter is taken.
  • max: The maximum score is taken.
  • min: The minimum score is taken.

Specifying an upper limit for a score

You can specify an upper limit for a function score in the max_boost parameter. The default upper limit is the maximum magnitude for a float value: (2 223) · 2127.

Combining the score for all functions with the query score

You can specify how the score computed using all functions is combined with the query score in the boost_mode parameter, which takes one of the following values:

  • multiply: (Default) Multiply the query score by the function score.
  • replace: Ignore the query score and use the function score.
  • sum: Add the query score and the function score.
  • avg: Average the query score and the function score.
  • max: Take the greater of the query score and the function score.
  • min: Take the lesser of the query score and the function score.

Filtering documents that don't meet a threshold

Changing the relevance score does not change the list of matching documents. To exclude some documents that don't meet a threshold, specify the threshold value in the min_score parameter. All documents returned by the query are then scored and filtered using the threshold value.

Example

The following request searches for blog posts that include the words "OpenSearch Data Prepper", preferring the posts published around 04/24/2022. Additionally, the number of views and likes are taken into consideration. Finally, the cutoff threshold is set at the score of 10:

GET blogs/_search
{
  "query": {
    "function_score": {
      "boost": "5", 
      "functions": [
        {
          "gauss": {
            "date_posted": {
              "origin": "2022-04-24",
              "offset": "1d",
              "scale": "6d"
            }
          }, 
          "weight": 1
        },
        {
          "gauss": {
            "likes": {
              "origin": 200,
              "scale": 200
            }
          }, 
          "weight": 4
        },
        {
          "gauss": {
            "views": {
              "origin": 1000,
              "scale": 800
            }
          }, 
          "weight": 2
        }
      ],
      "query": {
        "match": {
          "name": "opensearch data prepper"
        }
      },
      "max_boost": 10,
      "score_mode": "max",
      "boost_mode": "multiply",
      "min_score": 10
    }
  }
}

{% include copy-curl.html %}

The results contain the three matching blog posts:

Response {: .text-delta}
{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 31.191923,
    "hits": [
      {
        "_index": "blogs",
        "_id": "3",
        "_score": 31.191923,
        "_source": {
          "name": "Distributed tracing with Data Prepper",
          "views": 800,
          "likes": 50,
          "comments": 5,
          "date_posted": "2022-04-25"
        }
      },
      {
        "_index": "blogs",
        "_id": "1",
        "_score": 13.907352,
        "_source": {
          "name": "Semantic search in OpenSearch",
          "views": 1200,
          "likes": 150,
          "comments": 16,
          "date_posted": "2022-04-17"
        }
      },
      {
        "_index": "blogs",
        "_id": "2",
        "_score": 11.150461,
        "_source": {
          "name": "Get started with OpenSearch 2.7",
          "views": 1400,
          "likes": 100,
          "comments": 20,
          "date_posted": "2022-05-02"
        }
      }
    ]
  }
}