2019-01-24 19:18:48 -05:00
|
|
|
[[query-dsl-rank-feature-query]]
|
|
|
|
=== Rank Feature Query
|
2018-05-23 02:55:21 -04:00
|
|
|
|
2019-01-24 19:18:48 -05:00
|
|
|
The `rank_feature` query is a specialized query that only works on
|
|
|
|
<<rank-feature,`rank_feature`>> fields and <<rank-features,`rank_features`>> fields.
|
2018-06-07 04:05:37 -04:00
|
|
|
Its goal is to boost the score of documents based on the values of numeric
|
|
|
|
features. It is typically put in a `should` clause of a
|
|
|
|
<<query-dsl-bool-query,`bool`>> query so that its score is added to the score
|
2018-05-23 02:55:21 -04:00
|
|
|
of the query.
|
|
|
|
|
|
|
|
Compared to using <<query-dsl-function-score-query,`function_score`>> or other
|
|
|
|
ways to modify the score, this query has the benefit of being able to
|
|
|
|
efficiently skip non-competitive hits when
|
2019-01-25 07:45:39 -05:00
|
|
|
<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be
|
2018-05-23 02:55:21 -04:00
|
|
|
spectacular.
|
|
|
|
|
2018-06-07 04:05:37 -04:00
|
|
|
Here is an example that indexes various features:
|
|
|
|
- https://en.wikipedia.org/wiki/PageRank[`pagerank`], a measure of the
|
|
|
|
importance of a website,
|
|
|
|
- `url_length`, the length of the url, which typically correlates negatively
|
|
|
|
with relevance,
|
|
|
|
- `topics`, which associates a list of topics with every document alongside a
|
|
|
|
measure of how well the document is connected to this topic.
|
|
|
|
|
|
|
|
Then the example includes an example query that searches for `"2016"` and boosts
|
|
|
|
based or `pagerank`, `url_length` and the `sports` topic.
|
2018-05-23 02:55:21 -04:00
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
2019-01-22 09:13:52 -05:00
|
|
|
PUT test
|
2018-05-23 02:55:21 -04:00
|
|
|
{
|
|
|
|
"mappings": {
|
2019-01-22 09:13:52 -05:00
|
|
|
"properties": {
|
|
|
|
"pagerank": {
|
2019-01-24 19:18:48 -05:00
|
|
|
"type": "rank_feature"
|
2019-01-22 09:13:52 -05:00
|
|
|
},
|
|
|
|
"url_length": {
|
2019-01-24 19:18:48 -05:00
|
|
|
"type": "rank_feature",
|
2019-01-22 09:13:52 -05:00
|
|
|
"positive_score_impact": false
|
|
|
|
},
|
|
|
|
"topics": {
|
2019-01-24 19:18:48 -05:00
|
|
|
"type": "rank_features"
|
2018-05-23 02:55:21 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
PUT test/_doc/1
|
|
|
|
{
|
2018-06-07 04:05:37 -04:00
|
|
|
"url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",
|
|
|
|
"content": "Rio 2016",
|
|
|
|
"pagerank": 50.3,
|
|
|
|
"url_length": 42,
|
|
|
|
"topics": {
|
|
|
|
"sports": 50,
|
|
|
|
"brazil": 30
|
|
|
|
}
|
2018-05-23 02:55:21 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
PUT test/_doc/2
|
|
|
|
{
|
2018-06-07 04:05:37 -04:00
|
|
|
"url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
|
|
|
|
"content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in São Paulo, Brazil",
|
|
|
|
"pagerank": 50.3,
|
|
|
|
"url_length": 47,
|
|
|
|
"topics": {
|
|
|
|
"sports": 35,
|
|
|
|
"formula one": 65,
|
|
|
|
"brazil": 20
|
|
|
|
}
|
2018-05-23 02:55:21 -04:00
|
|
|
}
|
|
|
|
|
2018-06-07 04:05:37 -04:00
|
|
|
PUT test/_doc/3
|
2018-05-23 02:55:21 -04:00
|
|
|
{
|
2018-06-07 04:05:37 -04:00
|
|
|
"url": "http://en.wikipedia.org/wiki/Deadpool_(film)",
|
|
|
|
"content": "Deadpool is a 2016 American superhero film",
|
|
|
|
"pagerank": 50.3,
|
|
|
|
"url_length": 37,
|
|
|
|
"topics": {
|
|
|
|
"movies": 60,
|
|
|
|
"super hero": 65
|
2018-05-23 02:55:21 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-06-07 04:05:37 -04:00
|
|
|
POST test/_refresh
|
|
|
|
|
|
|
|
GET test/_search
|
2018-05-23 02:55:21 -04:00
|
|
|
{
|
|
|
|
"query": {
|
2018-06-07 04:05:37 -04:00
|
|
|
"bool": {
|
|
|
|
"must": [
|
|
|
|
{
|
|
|
|
"match": {
|
|
|
|
"content": "2016"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"should": [
|
|
|
|
{
|
2019-01-24 19:18:48 -05:00
|
|
|
"rank_feature": {
|
2018-06-07 04:05:37 -04:00
|
|
|
"field": "pagerank"
|
|
|
|
}
|
|
|
|
},
|
|
|
|
{
|
2019-01-24 19:18:48 -05:00
|
|
|
"rank_feature": {
|
2018-06-07 04:05:37 -04:00
|
|
|
"field": "url_length",
|
|
|
|
"boost": 0.1
|
|
|
|
}
|
|
|
|
},
|
|
|
|
{
|
2019-01-24 19:18:48 -05:00
|
|
|
"rank_feature": {
|
2018-06-07 04:05:37 -04:00
|
|
|
"field": "topics.sports",
|
|
|
|
"boost": 0.4
|
|
|
|
}
|
|
|
|
}
|
|
|
|
]
|
2018-05-23 02:55:21 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Supported functions
|
|
|
|
|
2019-01-24 19:18:48 -05:00
|
|
|
The `rank_feature` query supports 3 functions in order to boost scores using the
|
|
|
|
values of rank features. If you do not know where to start, we recommend that you
|
2018-05-23 02:55:21 -04:00
|
|
|
start with the `saturation` function, which is the default when no function is
|
|
|
|
provided.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Saturation
|
|
|
|
|
|
|
|
This function gives a score that is equal to `S / (S + pivot)` where `S` is the
|
2019-01-24 19:18:48 -05:00
|
|
|
value of the rank feature and `pivot` is a configurable pivot value so that the
|
2018-05-23 02:55:21 -04:00
|
|
|
result will be less than +0.5+ if `S` is less than pivot and greater than +0.5+
|
|
|
|
otherwise. Scores are always is +(0, 1)+.
|
|
|
|
|
2019-01-24 19:18:48 -05:00
|
|
|
If the rank feature has a negative score impact then the function will be computed as
|
2018-05-23 02:55:21 -04:00
|
|
|
`pivot / (S + pivot)`, which decreases when `S` increases.
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
GET test/_search
|
|
|
|
{
|
|
|
|
"query": {
|
2019-01-24 19:18:48 -05:00
|
|
|
"rank_feature": {
|
2018-05-23 02:55:21 -04:00
|
|
|
"field": "pagerank",
|
|
|
|
"saturation": {
|
|
|
|
"pivot": 8
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
// TEST[continued]
|
|
|
|
|
|
|
|
If +pivot+ is not supplied then Elasticsearch will compute a default value that
|
|
|
|
will be approximately equal to the geometric mean of all feature values that
|
|
|
|
exist in the index. We recommend this if you haven't had the opportunity to
|
|
|
|
train a good pivot value.
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
GET test/_search
|
|
|
|
{
|
|
|
|
"query": {
|
2019-01-24 19:18:48 -05:00
|
|
|
"rank_feature": {
|
2018-05-23 02:55:21 -04:00
|
|
|
"field": "pagerank",
|
|
|
|
"saturation": {}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
// TEST[continued]
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Logarithm
|
|
|
|
|
|
|
|
This function gives a score that is equal to `log(scaling_factor + S)` where
|
2019-01-24 19:18:48 -05:00
|
|
|
`S` is the value of the rank feature and `scaling_factor` is a configurable scaling
|
2018-05-23 02:55:21 -04:00
|
|
|
factor. Scores are unbounded.
|
|
|
|
|
2019-01-24 19:18:48 -05:00
|
|
|
This function only supports rank features that have a positive score impact.
|
2018-05-23 02:55:21 -04:00
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
GET test/_search
|
|
|
|
{
|
|
|
|
"query": {
|
2019-01-24 19:18:48 -05:00
|
|
|
"rank_feature": {
|
2018-05-23 02:55:21 -04:00
|
|
|
"field": "pagerank",
|
|
|
|
"log": {
|
|
|
|
"scaling_factor": 4
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
// TEST[continued]
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Sigmoid
|
|
|
|
|
|
|
|
This function is an extension of `saturation` which adds a configurable
|
|
|
|
exponent. Scores are computed as `S^exp^ / (S^exp^ + pivot^exp^)`. Like for the
|
|
|
|
`saturation` function, `pivot` is the value of `S` that gives a score of +0.5+
|
|
|
|
and scores are in +(0, 1)+.
|
|
|
|
|
|
|
|
`exponent` must be positive, but is typically in +[0.5, 1]+. A good value should
|
2019-01-07 08:44:12 -05:00
|
|
|
be computed via training. If you don't have the opportunity to do so, we recommend
|
2018-05-23 02:55:21 -04:00
|
|
|
that you stick to the `saturation` function instead.
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
GET test/_search
|
|
|
|
{
|
|
|
|
"query": {
|
2019-01-24 19:18:48 -05:00
|
|
|
"rank_feature": {
|
2018-05-23 02:55:21 -04:00
|
|
|
"field": "pagerank",
|
|
|
|
"sigmoid": {
|
|
|
|
"pivot": 7,
|
|
|
|
"exponent": 0.6
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
// TEST[continued]
|