2016-11-17 04:27:57 -05:00
|
|
|
[[rank-eval]]
|
|
|
|
= Ranking Evaluation
|
|
|
|
|
|
|
|
[partintro]
|
|
|
|
--
|
|
|
|
|
|
|
|
Imagine having built and deployed a search application: Users are happily
|
|
|
|
entering queries into your search frontend. Your application takes these
|
|
|
|
queries and creates a dedicated Elasticsearch query from that, and returns its
|
|
|
|
results back to the user. Imagine further that you are tasked with tweaking the
|
|
|
|
Elasticsearch query that is being created to return specific results for a
|
|
|
|
certain set of queries without breaking others. How should that be done?
|
|
|
|
|
|
|
|
One possible solution is to gather a sample of user queries representative of
|
|
|
|
how the search application is used, retrieve the search results that are being
|
|
|
|
returned. As a next step these search results would be manually annotated for
|
|
|
|
their relevancy to the original user query. Based on this set of rated requests
|
|
|
|
we can compute a couple of metrics telling us more about how many relevant
|
|
|
|
search results are being returned.
|
|
|
|
|
|
|
|
This is a nice approximation for how well our translation from user query to
|
|
|
|
Elasticsearch query works for providing the user with relevant search results.
|
|
|
|
Elasticsearch provides a ranking evaluation API that lets you compute scores for
|
|
|
|
your current ranking function based on annotated search results.
|
|
|
|
--
|
|
|
|
|
|
|
|
== Plain ranking evaluation
|
|
|
|
|
|
|
|
In its most simple form, for each query a set of ratings can be supplied:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
-----------------------------
|
2016-11-17 11:03:46 -05:00
|
|
|
GET /twitter/tweet/_rank_eval
|
2016-11-17 04:27:57 -05:00
|
|
|
{
|
|
|
|
"requests": [
|
|
|
|
{
|
|
|
|
"id": "JFK query", <1>
|
|
|
|
"request": {
|
|
|
|
"query": {
|
|
|
|
"match": {
|
2016-11-17 11:03:46 -05:00
|
|
|
"title": {
|
2016-11-17 04:27:57 -05:00
|
|
|
"query": "JFK"}}}}, <2>
|
|
|
|
"ratings": [ <3>
|
|
|
|
{
|
|
|
|
"rating": 1.5, <4>
|
2016-11-17 11:03:46 -05:00
|
|
|
"_type": "tweet", <5>
|
2016-11-17 04:27:57 -05:00
|
|
|
"_id": "13736278", <6>
|
2016-11-17 11:03:46 -05:00
|
|
|
"_index": "twitter" <7>
|
2016-11-17 04:27:57 -05:00
|
|
|
},
|
|
|
|
{
|
|
|
|
"rating": 1,
|
2016-11-17 11:03:46 -05:00
|
|
|
"_type": "tweet",
|
2016-11-17 04:27:57 -05:00
|
|
|
"_id": "30900421",
|
2016-11-17 11:03:46 -05:00
|
|
|
"_index": "twitter"
|
2016-11-17 04:27:57 -05:00
|
|
|
}],
|
2016-11-17 09:01:04 -05:00
|
|
|
"summary_fields": ["title"] <8>
|
|
|
|
}],
|
2016-11-17 04:27:57 -05:00
|
|
|
"metric": { <9>
|
|
|
|
"reciprocal_rank": {}
|
2016-12-19 07:05:49 -05:00
|
|
|
},
|
|
|
|
"max_concurrent_searches": 10 <10>
|
2016-11-17 04:27:57 -05:00
|
|
|
}
|
|
|
|
------------------------------
|
|
|
|
// CONSOLE
|
2016-11-17 11:03:46 -05:00
|
|
|
// TEST[setup:twitter]
|
2016-11-17 04:27:57 -05:00
|
|
|
|
|
|
|
<1> A human readable id for the rated query (that will be re-used in the response to provide further details).
|
|
|
|
<2> The actual Elasticsearch query to execute.
|
|
|
|
<3> A set of ratings for how well a certain document fits as response for the query.
|
|
|
|
<4> A rating expressing how well the document fits the query, higher is better, are treated as int values.
|
|
|
|
<5> The type where the rated document lives.
|
|
|
|
<6> The id of the rated document.
|
|
|
|
<7> The index where the rated document lives.
|
|
|
|
<8> For a verbose response, specify which properties of a search hit should be returned in addition to index/type/id.
|
|
|
|
<9> A metric to use for evaluation. See below for a list.
|
2016-12-19 07:05:49 -05:00
|
|
|
<10> Maximum number of search requests to execute in parallel. Set to 10 by
|
|
|
|
default.
|
2016-11-17 04:27:57 -05:00
|
|
|
|
|
|
|
|
|
|
|
== Template based ranking evaluation
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------
|
2016-12-13 07:59:11 -05:00
|
|
|
GET /twitter/tweet/_rank_eval
|
2016-11-17 04:27:57 -05:00
|
|
|
{
|
2016-12-19 06:49:15 -05:00
|
|
|
"templates": [{
|
|
|
|
"id": "match_query",
|
|
|
|
"template": {
|
|
|
|
"inline": {
|
|
|
|
"query": {
|
|
|
|
"match": {
|
|
|
|
"{{field}}": {
|
|
|
|
"query": "{{query_string}}"}}}}}}], <1>
|
2016-11-17 04:27:57 -05:00
|
|
|
"requests": [
|
|
|
|
{
|
2016-11-17 09:01:04 -05:00
|
|
|
"id": "JFK query",
|
2016-11-17 04:27:57 -05:00
|
|
|
"ratings": [
|
|
|
|
{
|
|
|
|
"rating": 1.5,
|
2016-11-17 11:03:46 -05:00
|
|
|
"_type": "tweet",
|
2016-11-17 04:27:57 -05:00
|
|
|
"_id": "13736278",
|
2016-11-17 11:03:46 -05:00
|
|
|
"_index": "twitter"
|
2016-11-17 04:27:57 -05:00
|
|
|
},
|
|
|
|
{
|
|
|
|
"rating": 1,
|
2016-11-17 11:03:46 -05:00
|
|
|
"_type": "tweet",
|
2016-11-17 04:27:57 -05:00
|
|
|
"_id": "30900421",
|
2016-11-17 11:03:46 -05:00
|
|
|
"_index": "twitter"
|
2016-11-17 09:01:04 -05:00
|
|
|
}],
|
2016-11-17 04:27:57 -05:00
|
|
|
"params": {
|
|
|
|
"query_string": "JFK", <2>
|
2016-11-17 11:03:46 -05:00
|
|
|
"field": "opening_text" <2>
|
2016-12-19 06:49:15 -05:00
|
|
|
},
|
|
|
|
"template_id": "match_query"
|
2016-11-17 04:27:57 -05:00
|
|
|
}],
|
|
|
|
"metric": {
|
|
|
|
"precision": {
|
|
|
|
"relevant_rating_threshold": 2
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------
|
|
|
|
// CONSOLE
|
2016-11-17 11:03:46 -05:00
|
|
|
// TEST[setup:twitter]
|
2016-11-17 04:27:57 -05:00
|
|
|
|
|
|
|
<1> The template to use for every rated search request.
|
|
|
|
<2> The parameters to use to fill the template above.
|
|
|
|
|
|
|
|
|
|
|
|
== Valid evaluation metrics
|
|
|
|
|
|
|
|
=== Precision
|
|
|
|
|
|
|
|
Citing from https://en.wikipedia.org/wiki/Information_retrieval#Precision[Precision
|
|
|
|
page at Wikipedia]:
|
|
|
|
"Precision is the fraction of the documents retrieved that are relevant to the
|
|
|
|
user's information need."
|
|
|
|
|
|
|
|
Works well as an easy to explain evaluation metric. Caveat: All result positions
|
|
|
|
are treated equally. So a ranking of ten results that contains one relevant
|
|
|
|
result in position 10 is equally good as a ranking of ten results that contains
|
|
|
|
one relevant result in position 1.
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------
|
2016-11-17 11:03:46 -05:00
|
|
|
GET /twitter/tweet/_rank_eval
|
2016-11-17 04:27:57 -05:00
|
|
|
{
|
2016-11-17 09:01:04 -05:00
|
|
|
"requests": [
|
|
|
|
{
|
|
|
|
"id": "JFK query",
|
|
|
|
"request": { "query": { "match_all": {}}},
|
|
|
|
"ratings": []
|
|
|
|
}],
|
2016-11-17 04:27:57 -05:00
|
|
|
"metric": {
|
|
|
|
"precision": {
|
|
|
|
"relevant_rating_threshold": 1, <1>
|
2016-11-17 11:03:46 -05:00
|
|
|
"ignore_unlabeled": false <2>
|
2016-11-17 04:27:57 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------
|
2016-11-17 09:01:04 -05:00
|
|
|
// CONSOLE
|
2016-11-17 11:03:46 -05:00
|
|
|
// TEST[setup:twitter]
|
2016-11-17 04:27:57 -05:00
|
|
|
|
|
|
|
<1> For graded relevance ratings only ratings above this threshold are
|
|
|
|
considered as relevant results for the given query. By default this is set to 1.
|
|
|
|
|
|
|
|
<2> All documents retrieved by the rated request that have no ratings
|
|
|
|
assigned are treated unrelevant by default. Set to true in order to drop them
|
|
|
|
from the precision computation entirely.
|
|
|
|
|
|
|
|
|
|
|
|
=== Reciprocal rank
|
|
|
|
|
|
|
|
For any given query this is the reciprocal of the rank of the
|
|
|
|
first relevant document retrieved. For example finding the first relevant result
|
|
|
|
in position 3 means Reciprocal Rank is going to be 1/3.
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------
|
2016-11-17 11:03:46 -05:00
|
|
|
GET /twitter/tweet/_rank_eval
|
2016-11-17 04:27:57 -05:00
|
|
|
{
|
2016-11-17 09:01:04 -05:00
|
|
|
"requests": [
|
|
|
|
{
|
|
|
|
"id": "JFK query",
|
|
|
|
"request": { "query": { "match_all": {}}},
|
|
|
|
"ratings": []
|
|
|
|
}],
|
2016-11-17 04:27:57 -05:00
|
|
|
"metric": {
|
|
|
|
"reciprocal_rank": {}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------
|
2016-11-17 09:01:04 -05:00
|
|
|
// CONSOLE
|
2016-11-17 11:03:46 -05:00
|
|
|
// TEST[setup:twitter]
|
2016-11-17 04:27:57 -05:00
|
|
|
|
|
|
|
=== Normalized discounted cumulative gain
|
|
|
|
|
|
|
|
In contrast to the two metrics above this takes both, the grade of the result
|
|
|
|
found as well as the position of the document returned into account.
|
|
|
|
|
|
|
|
For more details also check the explanation on
|
|
|
|
https://en.wikipedia.org/wiki/Discounted_cumulative_gain[Wikipedia].
|
|
|
|
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------
|
2016-11-17 11:03:46 -05:00
|
|
|
GET /twitter/tweet/_rank_eval
|
2016-11-17 04:27:57 -05:00
|
|
|
{
|
2016-11-17 09:01:04 -05:00
|
|
|
"requests": [
|
|
|
|
{
|
|
|
|
"id": "JFK query",
|
|
|
|
"request": { "query": { "match_all": {}}},
|
|
|
|
"ratings": []
|
|
|
|
}],
|
2016-11-17 04:27:57 -05:00
|
|
|
"metric": {
|
|
|
|
"dcg": {
|
2016-11-17 09:01:04 -05:00
|
|
|
"normalize": false <1>
|
2016-11-17 04:27:57 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------
|
2016-11-17 09:01:04 -05:00
|
|
|
// CONSOLE
|
2016-11-17 11:03:46 -05:00
|
|
|
// TEST[setup:twitter]
|
2016-11-17 04:27:57 -05:00
|
|
|
|
|
|
|
<1> Set to true to compute nDCG instead of DCG, default is false.
|
|
|
|
|
|
|
|
Setting normalize to true makes DCG values better comparable across different
|
|
|
|
result set sizes. See also
|
|
|
|
https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG[Wikipedia
|
|
|
|
nDCG] for more details.
|