OpenSearch/docs/reference/query-dsl/feature-query.asciidoc

[[query-dsl-feature-query]]
=== Feature Query

The `feature` query is a specialized query that only works on
<<feature,`feature`>> fields and <<feature-vector,`feature_vector`>> fields.
Its goal is to boost the score of documents based on the values of numeric
features. It is typically put in a `should` clause of a
<<query-dsl-bool-query,`bool`>> query so that its score is added to the score
of the query.

Compared to using <<query-dsl-function-score-query,`function_score`>> or other
ways to modify the score, this query has the benefit of being able to
efficiently skip non-competitive hits when
<<search-uri-request,`track_total_hits`>> is set to `false`. Speedups may be
spectacular.

Here is an example that indexes various features:
 - https://en.wikipedia.org/wiki/PageRank[`pagerank`], a measure of the
   importance of a website,
 - `url_length`, the length of the url, which typically correlates negatively
   with relevance,
 - `topics`, which associates a list of topics with every document alongside a
   measure of how well the document is connected to this topic.

Then the example includes an example query that searches for `"2016"` and boosts
based or `pagerank`, `url_length` and the `sports` topic.

[source,js]
--------------------------------------------------
PUT test?include_type_name=true
{
  "mappings": {
    "_doc": {
      "properties": {
        "pagerank": {
          "type": "feature"
        },
        "url_length": {
          "type": "feature",
          "positive_score_impact": false
        },
        "topics": {
          "type": "feature_vector"
        }
      }
    }
  }
}

PUT test/_doc/1
{
  "url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",
  "content": "Rio 2016",
  "pagerank": 50.3,
  "url_length": 42,
  "topics": {
    "sports": 50,
    "brazil": 30
  }
}

PUT test/_doc/2
{
  "url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
  "content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in São Paulo, Brazil",
  "pagerank": 50.3,
  "url_length": 47,
  "topics": {
    "sports": 35,
    "formula one": 65,
    "brazil": 20
  }
}

PUT test/_doc/3
{
  "url": "http://en.wikipedia.org/wiki/Deadpool_(film)",
  "content": "Deadpool is a 2016 American superhero film",
  "pagerank": 50.3,
  "url_length": 37,
  "topics": {
    "movies": 60,
    "super hero": 65
  }
}

POST test/_refresh

GET test/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "2016"
          }
        }
      ],
      "should": [
        {
          "feature": {
            "field": "pagerank"
          }
        },
        {
          "feature": {
            "field": "url_length",
            "boost": 0.1
          }
        },
        {
          "feature": {
            "field": "topics.sports",
            "boost": 0.4
          }
        }
      ]
    }
  }
}
--------------------------------------------------
// CONSOLE

[float]
=== Supported functions

The `feature` query supports 3 functions in order to boost scores using the
values of features. If you do not know where to start, we recommend that you
start with the `saturation` function, which is the default when no function is
provided.

[float]
==== Saturation

This function gives a score that is equal to `S / (S + pivot)` where `S` is the
value of the feature and `pivot` is a configurable pivot value so that the
result will be less than +0.5+ if `S` is less than pivot and greater than +0.5+
otherwise. Scores are always is +(0, 1)+.

If the feature has a negative score impact then the function will be computed as
`pivot / (S + pivot)`, which decreases when `S` increases.

[source,js]
--------------------------------------------------
GET test/_search
{
  "query": {
    "feature": {
      "field": "pagerank",
      "saturation": {
        "pivot": 8
      }
    }
  }
}
--------------------------------------------------
// CONSOLE
// TEST[continued]

If +pivot+ is not supplied then Elasticsearch will compute a default value that
will be approximately equal to the geometric mean of all feature values that
exist in the index. We recommend this if you haven't had the opportunity to
train a good pivot value.

[source,js]
--------------------------------------------------
GET test/_search
{
  "query": {
    "feature": {
      "field": "pagerank",
      "saturation": {}
    }
  }
}
--------------------------------------------------
// CONSOLE
// TEST[continued]

[float]
==== Logarithm

This function gives a score that is equal to `log(scaling_factor + S)` where
`S` is the value of the feature and `scaling_factor` is a configurable scaling
factor. Scores are unbounded.

This function only supports features that have a positive score impact.

[source,js]
--------------------------------------------------
GET test/_search
{
  "query": {
    "feature": {
      "field": "pagerank",
      "log": {
        "scaling_factor": 4
      }
    }
  }
}
--------------------------------------------------
// CONSOLE
// TEST[continued]

[float]
==== Sigmoid

This function is an extension of `saturation` which adds a configurable
exponent. Scores are computed as `S^exp^ / (S^exp^ + pivot^exp^)`. Like for the
`saturation` function, `pivot` is the value of `S` that gives a score of +0.5+
and scores are in +(0, 1)+.

`exponent` must be positive, but is typically in +[0.5, 1]+. A good value should
be computed via training. If you don't have the opportunity to do so, we recommend
that you stick to the `saturation` function instead.

[source,js]
--------------------------------------------------
GET test/_search
{
  "query": {
    "feature": {
      "field": "pagerank",
      "sigmoid": {
        "pivot": 7,
        "exponent": 0.6
      }
    }
  }
}
--------------------------------------------------
// CONSOLE
// TEST[continued]
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`[[query-dsl-feature-query]]`
			`=== Feature Query`

			The `feature` query is a specialized query that only works on
Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			<<feature,`feature`>> fields and <<feature-vector,`feature_vector`>> fields.
			`Its goal is to boost the score of documents based on the values of numeric`
			features. It is typically put in a `should` clause of a
			<<query-dsl-bool-query,`bool`>> query so that its score is added to the score
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`of the query.`

			Compared to using <<query-dsl-function-score-query,`function_score`>> or other
			`ways to modify the score, this query has the benefit of being able to`
			`efficiently skip non-competitive hits when`
			<<search-uri-request,`track_total_hits`>> is set to `false`. Speedups may be
			`spectacular.`

Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			`Here is an example that indexes various features:`
			- https://en.wikipedia.org/wiki/PageRank[`pagerank`], a measure of the
			`importance of a website,`
			- `url_length`, the length of the url, which typically correlates negatively
			`with relevance,`
			- `topics`, which associates a list of topics with every document alongside a
			`measure of how well the document is connected to this topic.`

			Then the example includes an example query that searches for `"2016"` and boosts
			based or `pagerank`, `url_length` and the `sports` topic.
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00
			`[source,js]`
			`--------------------------------------------------`
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs. 2019-01-14 16:08:01 -05:00			`PUT test?include_type_name=true`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`{`
			`"mappings": {`
			`"_doc": {`
			`"properties": {`
			`"pagerank": {`
			`"type": "feature"`
			`},`
			`"url_length": {`
			`"type": "feature",`
			`"positive_score_impact": false`
Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			`},`
			`"topics": {`
			`"type": "feature_vector"`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`}`
			`}`
			`}`
			`}`
			`}`

			`PUT test/_doc/1`
			`{`
Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			`"url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",`
			`"content": "Rio 2016",`
			`"pagerank": 50.3,`
			`"url_length": 42,`
			`"topics": {`
			`"sports": 50,`
			`"brazil": 30`
			`}`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`}`

			`PUT test/_doc/2`
			`{`
Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			`"url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",`
			`"content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in São Paulo, Brazil",`
			`"pagerank": 50.3,`
			`"url_length": 47,`
			`"topics": {`
			`"sports": 35,`
			`"formula one": 65,`
			`"brazil": 20`
			`}`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`}`

Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			`PUT test/_doc/3`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`{`
Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			`"url": "http://en.wikipedia.org/wiki/Deadpool_(film)",`
			`"content": "Deadpool is a 2016 American superhero film",`
			`"pagerank": 50.3,`
			`"url_length": 37,`
			`"topics": {`
			`"movies": 60,`
			`"super hero": 65`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`}`
			`}`

Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			`POST test/_refresh`

			`GET test/_search`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`{`
			`"query": {`
Add a `feature_vector` field. (#31102) This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552 2018-06-07 04:05:37 -04:00			`"bool": {`
			`"must": [`
			`{`
			`"match": {`
			`"content": "2016"`
			`}`
			`}`
			`],`
			`"should": [`
			`{`
			`"feature": {`
			`"field": "pagerank"`
			`}`
			`},`
			`{`
			`"feature": {`
			`"field": "url_length",`
			`"boost": 0.1`
			`}`
			`},`
			`{`
			`"feature": {`
			`"field": "topics.sports",`
			`"boost": 0.4`
			`}`
			`}`
			`]`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`

			`[float]`
			`=== Supported functions`

			The `feature` query supports 3 functions in order to boost scores using the
			`values of features. If you do not know where to start, we recommend that you`
			start with the `saturation` function, which is the default when no function is
			`provided.`

			`[float]`
			`==== Saturation`

			This function gives a score that is equal to `S / (S + pivot)` where `S` is the
			value of the feature and `pivot` is a configurable pivot value so that the
			result will be less than +0.5+ if `S` is less than pivot and greater than +0.5+
			`otherwise. Scores are always is +(0, 1)+.`

			`If the feature has a negative score impact then the function will be computed as`
			`pivot / (S + pivot)`, which decreases when `S` increases.

			`[source,js]`
			`--------------------------------------------------`
			`GET test/_search`
			`{`
			`"query": {`
			`"feature": {`
			`"field": "pagerank",`
			`"saturation": {`
			`"pivot": 8`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[continued]`

			`If +pivot+ is not supplied then Elasticsearch will compute a default value that`
			`will be approximately equal to the geometric mean of all feature values that`
			`exist in the index. We recommend this if you haven't had the opportunity to`
			`train a good pivot value.`

			`[source,js]`
			`--------------------------------------------------`
			`GET test/_search`
			`{`
			`"query": {`
			`"feature": {`
			`"field": "pagerank",`
			`"saturation": {}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[continued]`

			`[float]`
			`==== Logarithm`

			This function gives a score that is equal to `log(scaling_factor + S)` where
			`S` is the value of the feature and `scaling_factor` is a configurable scaling
			`factor. Scores are unbounded.`

			`This function only supports features that have a positive score impact.`

			`[source,js]`
			`--------------------------------------------------`
			`GET test/_search`
			`{`
			`"query": {`
			`"feature": {`
			`"field": "pagerank",`
			`"log": {`
			`"scaling_factor": 4`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[continued]`

			`[float]`
			`==== Sigmoid`

			This function is an extension of `saturation` which adds a configurable
			exponent. Scores are computed as `S^exp^ / (S^exp^ + pivot^exp^)`. Like for the
			`saturation` function, `pivot` is the value of `S` that gives a score of +0.5+
			`and scores are in +(0, 1)+.`

			`exponent` must be positive, but is typically in +[0.5, 1]+. A good value should
[DOCS] Various spelling corrections (#37046) 2019-01-07 08:44:12 -05:00			`be computed via training. If you don't have the opportunity to do so, we recommend`
Expose Lucene's FeatureField. (#30618) Lucene has a new `FeatureField` which gives the ability to record numeric features as term frequencies. Its main benefit is that it allows to boost queries with the values of these features and efficiently skip non-competitive documents at the same time using block-max WAND and indexed impacts. 2018-05-23 02:55:21 -04:00			that you stick to the `saturation` function instead.

			`[source,js]`
			`--------------------------------------------------`
			`GET test/_search`
			`{`
			`"query": {`
			`"feature": {`
			`"field": "pagerank",`
			`"sigmoid": {`
			`"pivot": 7,`
			`"exponent": 0.6`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[continued]`