[[rank-eval]] = Ranking Evaluation [partintro] -- Imagine having built and deployed a search application: Users are happily entering queries into your search frontend. Your application takes these queries and creates a dedicated Elasticsearch query from that, and returns its results back to the user. Imagine further that you are tasked with tweaking the Elasticsearch query that is being created to return specific results for a certain set of queries without breaking others. How should that be done? One possible solution is to gather a sample of user queries representative of how the search application is used, retrieve the search results that are being returned. As a next step these search results would be manually annotated for their relevancy to the original user query. Based on this set of rated requests we can compute a couple of metrics telling us more about how many relevant search results are being returned. This is a nice approximation for how well our translation from user query to Elasticsearch query works for providing the user with relevant search results. Elasticsearch provides a ranking evaluation API that lets you compute scores for your current ranking function based on annotated search results. -- == Plain ranking evaluation In its most simple form, for each query a set of ratings can be supplied: [source,js] ----------------------------- GET /twitter/tweet/_rank_eval { "requests": [ { "id": "JFK query", <1> "request": { "query": { "match": { "title": { "query": "JFK"}}}}, <2> "ratings": [ <3> { "rating": 1.5, <4> "_type": "tweet", <5> "_id": "13736278", <6> "_index": "twitter" <7> }, { "rating": 1, "_type": "tweet", "_id": "30900421", "_index": "twitter" }], "summary_fields": ["title"] <8> }], "metric": { <9> "reciprocal_rank": {} } } ------------------------------ // CONSOLE // TEST[setup:twitter] <1> A human readable id for the rated query (that will be re-used in the response to provide further details). <2> The actual Elasticsearch query to execute. <3> A set of ratings for how well a certain document fits as response for the query. <4> A rating expressing how well the document fits the query, higher is better, are treated as int values. <5> The type where the rated document lives. <6> The id of the rated document. <7> The index where the rated document lives. <8> For a verbose response, specify which properties of a search hit should be returned in addition to index/type/id. <9> A metric to use for evaluation. See below for a list. == Template based ranking evaluation [source,js] -------------------------------- GET /twitter/tweet/_rank_eval/template { "template": { "inline": { "query": { "match": { "{{field}}": { "query": "{{query_string}}"}}}}}, <1> "requests": [ { "id": "JFK query", "ratings": [ { "rating": 1.5, "_type": "tweet", "_id": "13736278", "_index": "twitter" }, { "rating": 1, "_type": "tweet", "_id": "30900421", "_index": "twitter" }], "params": { "query_string": "JFK", <2> "field": "opening_text" <2> } }], "metric": { "precision": { "relevant_rating_threshold": 2 } } } -------------------------------- // CONSOLE // TEST[setup:twitter] <1> The template to use for every rated search request. <2> The parameters to use to fill the template above. == Valid evaluation metrics === Precision Citing from https://en.wikipedia.org/wiki/Information_retrieval#Precision[Precision page at Wikipedia]: "Precision is the fraction of the documents retrieved that are relevant to the user's information need." Works well as an easy to explain evaluation metric. Caveat: All result positions are treated equally. So a ranking of ten results that contains one relevant result in position 10 is equally good as a ranking of ten results that contains one relevant result in position 1. [source,js] -------------------------------- GET /twitter/tweet/_rank_eval { "requests": [ { "id": "JFK query", "request": { "query": { "match_all": {}}}, "ratings": [] }], "metric": { "precision": { "relevant_rating_threshold": 1, <1> "ignore_unlabeled": false <2> } } } -------------------------------- // CONSOLE // TEST[setup:twitter] <1> For graded relevance ratings only ratings above this threshold are considered as relevant results for the given query. By default this is set to 1. <2> All documents retrieved by the rated request that have no ratings assigned are treated unrelevant by default. Set to true in order to drop them from the precision computation entirely. === Reciprocal rank For any given query this is the reciprocal of the rank of the first relevant document retrieved. For example finding the first relevant result in position 3 means Reciprocal Rank is going to be 1/3. [source,js] -------------------------------- GET /twitter/tweet/_rank_eval { "requests": [ { "id": "JFK query", "request": { "query": { "match_all": {}}}, "ratings": [] }], "metric": { "reciprocal_rank": {} } } -------------------------------- // CONSOLE // TEST[setup:twitter] === Normalized discounted cumulative gain In contrast to the two metrics above this takes both, the grade of the result found as well as the position of the document returned into account. For more details also check the explanation on https://en.wikipedia.org/wiki/Discounted_cumulative_gain[Wikipedia]. [source,js] -------------------------------- GET /twitter/tweet/_rank_eval { "requests": [ { "id": "JFK query", "request": { "query": { "match_all": {}}}, "ratings": [] }], "metric": { "dcg": { "normalize": false <1> } } } -------------------------------- // CONSOLE // TEST[setup:twitter] <1> Set to true to compute nDCG instead of DCG, default is false. Setting normalize to true makes DCG values better comparable across different result set sizes. See also https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG[Wikipedia nDCG] for more details.