mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-06 04:58:50 +00:00
787acb14b9
This commit changes the default for the `track_total_hits` option of the search request to `10,000`. This means that by default search requests will accurately track the total hit count up to `10,000` documents, requests that match more than this value will set the `"total.relation"` to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response. Scroll queries are not impacted, they will continue to count the total hits accurately. The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request. I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate. Closes #33028
234 lines
5.8 KiB
Plaintext
234 lines
5.8 KiB
Plaintext
[[query-dsl-rank-feature-query]]
|
|
=== Rank Feature Query
|
|
|
|
The `rank_feature` query is a specialized query that only works on
|
|
<<rank-feature,`rank_feature`>> fields and <<rank-features,`rank_features`>> fields.
|
|
Its goal is to boost the score of documents based on the values of numeric
|
|
features. It is typically put in a `should` clause of a
|
|
<<query-dsl-bool-query,`bool`>> query so that its score is added to the score
|
|
of the query.
|
|
|
|
Compared to using <<query-dsl-function-score-query,`function_score`>> or other
|
|
ways to modify the score, this query has the benefit of being able to
|
|
efficiently skip non-competitive hits when
|
|
<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be
|
|
spectacular.
|
|
|
|
Here is an example that indexes various features:
|
|
- https://en.wikipedia.org/wiki/PageRank[`pagerank`], a measure of the
|
|
importance of a website,
|
|
- `url_length`, the length of the url, which typically correlates negatively
|
|
with relevance,
|
|
- `topics`, which associates a list of topics with every document alongside a
|
|
measure of how well the document is connected to this topic.
|
|
|
|
Then the example includes an example query that searches for `"2016"` and boosts
|
|
based or `pagerank`, `url_length` and the `sports` topic.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT test
|
|
{
|
|
"mappings": {
|
|
"properties": {
|
|
"pagerank": {
|
|
"type": "rank_feature"
|
|
},
|
|
"url_length": {
|
|
"type": "rank_feature",
|
|
"positive_score_impact": false
|
|
},
|
|
"topics": {
|
|
"type": "rank_features"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
PUT test/_doc/1
|
|
{
|
|
"url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",
|
|
"content": "Rio 2016",
|
|
"pagerank": 50.3,
|
|
"url_length": 42,
|
|
"topics": {
|
|
"sports": 50,
|
|
"brazil": 30
|
|
}
|
|
}
|
|
|
|
PUT test/_doc/2
|
|
{
|
|
"url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
|
|
"content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in São Paulo, Brazil",
|
|
"pagerank": 50.3,
|
|
"url_length": 47,
|
|
"topics": {
|
|
"sports": 35,
|
|
"formula one": 65,
|
|
"brazil": 20
|
|
}
|
|
}
|
|
|
|
PUT test/_doc/3
|
|
{
|
|
"url": "http://en.wikipedia.org/wiki/Deadpool_(film)",
|
|
"content": "Deadpool is a 2016 American superhero film",
|
|
"pagerank": 50.3,
|
|
"url_length": 37,
|
|
"topics": {
|
|
"movies": 60,
|
|
"super hero": 65
|
|
}
|
|
}
|
|
|
|
POST test/_refresh
|
|
|
|
GET test/_search
|
|
{
|
|
"query": {
|
|
"bool": {
|
|
"must": [
|
|
{
|
|
"match": {
|
|
"content": "2016"
|
|
}
|
|
}
|
|
],
|
|
"should": [
|
|
{
|
|
"rank_feature": {
|
|
"field": "pagerank"
|
|
}
|
|
},
|
|
{
|
|
"rank_feature": {
|
|
"field": "url_length",
|
|
"boost": 0.1
|
|
}
|
|
},
|
|
{
|
|
"rank_feature": {
|
|
"field": "topics.sports",
|
|
"boost": 0.4
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
[float]
|
|
=== Supported functions
|
|
|
|
The `rank_feature` query supports 3 functions in order to boost scores using the
|
|
values of rank features. If you do not know where to start, we recommend that you
|
|
start with the `saturation` function, which is the default when no function is
|
|
provided.
|
|
|
|
[float]
|
|
==== Saturation
|
|
|
|
This function gives a score that is equal to `S / (S + pivot)` where `S` is the
|
|
value of the rank feature and `pivot` is a configurable pivot value so that the
|
|
result will be less than +0.5+ if `S` is less than pivot and greater than +0.5+
|
|
otherwise. Scores are always is +(0, 1)+.
|
|
|
|
If the rank feature has a negative score impact then the function will be computed as
|
|
`pivot / (S + pivot)`, which decreases when `S` increases.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET test/_search
|
|
{
|
|
"query": {
|
|
"rank_feature": {
|
|
"field": "pagerank",
|
|
"saturation": {
|
|
"pivot": 8
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
If +pivot+ is not supplied then Elasticsearch will compute a default value that
|
|
will be approximately equal to the geometric mean of all feature values that
|
|
exist in the index. We recommend this if you haven't had the opportunity to
|
|
train a good pivot value.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET test/_search
|
|
{
|
|
"query": {
|
|
"rank_feature": {
|
|
"field": "pagerank",
|
|
"saturation": {}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
[float]
|
|
==== Logarithm
|
|
|
|
This function gives a score that is equal to `log(scaling_factor + S)` where
|
|
`S` is the value of the rank feature and `scaling_factor` is a configurable scaling
|
|
factor. Scores are unbounded.
|
|
|
|
This function only supports rank features that have a positive score impact.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET test/_search
|
|
{
|
|
"query": {
|
|
"rank_feature": {
|
|
"field": "pagerank",
|
|
"log": {
|
|
"scaling_factor": 4
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
[float]
|
|
==== Sigmoid
|
|
|
|
This function is an extension of `saturation` which adds a configurable
|
|
exponent. Scores are computed as `S^exp^ / (S^exp^ + pivot^exp^)`. Like for the
|
|
`saturation` function, `pivot` is the value of `S` that gives a score of +0.5+
|
|
and scores are in +(0, 1)+.
|
|
|
|
`exponent` must be positive, but is typically in +[0.5, 1]+. A good value should
|
|
be computed via training. If you don't have the opportunity to do so, we recommend
|
|
that you stick to the `saturation` function instead.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET test/_search
|
|
{
|
|
"query": {
|
|
"rank_feature": {
|
|
"field": "pagerank",
|
|
"sigmoid": {
|
|
"pivot": 7,
|
|
"exponent": 0.6
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|