mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-17 10:25:15 +00:00
Documentation fix for significant_terms heading levels
This commit is contained in:
parent
933852768d
commit
5f1d9af9fe
@ -202,8 +202,7 @@ Rare vs common is essentially a precision vs recall balance and so the absolute
|
|||||||
**********************************
|
**********************************
|
||||||
|
|
||||||
|
|
||||||
|
==== Use on free-text fields
|
||||||
=== Use on free-text fields
|
|
||||||
|
|
||||||
The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest:
|
The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest:
|
||||||
|
|
||||||
@ -234,28 +233,27 @@ free-text field and use them in a `terms` query on the same field with a `highli
|
|||||||
are presented unstemmed, highlighted, with the right case, in the right order and with some context, their significance/meaning is more readily apparent.
|
are presented unstemmed, highlighted, with the right case, in the right order and with some context, their significance/meaning is more readily apparent.
|
||||||
============
|
============
|
||||||
|
|
||||||
|
==== Limitations
|
||||||
|
|
||||||
=== Limitations
|
===== Single _background_ comparison base
|
||||||
|
|
||||||
==== Single _background_ comparison base
|
|
||||||
The above examples show how to select the _foreground_ set for analysis using a query or parent aggregation to filter but currently there is no means of specifying
|
The above examples show how to select the _foreground_ set for analysis using a query or parent aggregation to filter but currently there is no means of specifying
|
||||||
a _background_ set other than the index from which all results are ultimately drawn. Sometimes it may prove useful to use a different
|
a _background_ set other than the index from which all results are ultimately drawn. Sometimes it may prove useful to use a different
|
||||||
background set as the basis for comparisons e.g. to first select the tweets for the TV show "XFactor" and then look
|
background set as the basis for comparisons e.g. to first select the tweets for the TV show "XFactor" and then look
|
||||||
for significant terms in a subset of that content which is from this week.
|
for significant terms in a subset of that content which is from this week.
|
||||||
|
|
||||||
==== Significant terms must be indexed values
|
===== Significant terms must be indexed values
|
||||||
Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes.
|
Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes.
|
||||||
Because of the way the significant_terms aggregation must consider both _foreground_ and _background_ frequencies
|
Because of the way the significant_terms aggregation must consider both _foreground_ and _background_ frequencies
|
||||||
it would be prohibitively expensive to use a script on the entire index to obtain background frequencies for comparisons.
|
it would be prohibitively expensive to use a script on the entire index to obtain background frequencies for comparisons.
|
||||||
Also DocValues are not supported as sources of term data for similar reasons.
|
Also DocValues are not supported as sources of term data for similar reasons.
|
||||||
|
|
||||||
==== No analysis of floating point fields
|
===== No analysis of floating point fields
|
||||||
Floating point fields are currently not supported as the subject of significant_terms analysis.
|
Floating point fields are currently not supported as the subject of significant_terms analysis.
|
||||||
While integer or long fields can be used to represent concepts like bank account numbers or category numbers which
|
While integer or long fields can be used to represent concepts like bank account numbers or category numbers which
|
||||||
can be interesting to track, floating point fields are usually used to represent quantities of something.
|
can be interesting to track, floating point fields are usually used to represent quantities of something.
|
||||||
As such, individual floating point terms are not useful for this form of frequency analysis.
|
As such, individual floating point terms are not useful for this form of frequency analysis.
|
||||||
|
|
||||||
==== Use as a parent aggregation
|
===== Use as a parent aggregation
|
||||||
If there is the equivalent of a `match_all` query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the
|
If there is the equivalent of a `match_all` query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the
|
||||||
top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and
|
top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and
|
||||||
so there is no difference in document frequencies to observe and from which to make sensible suggestions.
|
so there is no difference in document frequencies to observe and from which to make sensible suggestions.
|
||||||
@ -266,7 +264,7 @@ it can be inefficient and costly in terms of RAM to embed large child aggregatio
|
|||||||
aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of
|
aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of
|
||||||
significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations.
|
significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations.
|
||||||
|
|
||||||
==== Approximate counts
|
===== Approximate counts
|
||||||
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and
|
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and
|
||||||
as such may be:
|
as such may be:
|
||||||
|
|
||||||
@ -276,11 +274,10 @@ as such may be:
|
|||||||
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies.
|
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies.
|
||||||
However, the `size` and `shard size` settings covered in the next section provide tools to help control the accuracy levels.
|
However, the `size` and `shard size` settings covered in the next section provide tools to help control the accuracy levels.
|
||||||
|
|
||||||
|
==== Parameters
|
||||||
=== Parameters
|
|
||||||
|
|
||||||
|
|
||||||
==== Size & Shard Size
|
===== Size & Shard Size
|
||||||
|
|
||||||
The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
|
The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
|
||||||
default, the node coordinating the search process will request each shard to provide its own top term buckets
|
default, the node coordinating the search process will request each shard to provide its own top term buckets
|
||||||
@ -302,7 +299,7 @@ will cause extra network traffic and RAM usage so this is quality/cost trade of
|
|||||||
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
|
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
|
||||||
override it and reset it to be equal to `size`.
|
override it and reset it to be equal to `size`.
|
||||||
|
|
||||||
==== Minimum document count
|
===== Minimum document count
|
||||||
|
|
||||||
It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option:
|
It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option:
|
||||||
|
|
||||||
@ -328,7 +325,7 @@ WARNING: Setting `min_doc_count` to `1` is generally not advised as it tends to
|
|||||||
default value of 3 is used to provide a minimum weight-of-evidence.
|
default value of 3 is used to provide a minimum weight-of-evidence.
|
||||||
|
|
||||||
|
|
||||||
==== Filtering Values
|
===== Filtering Values
|
||||||
|
|
||||||
It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the `include` and
|
It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the `include` and
|
||||||
`exclude` parameters which are based on regular expressions. This functionality mirrors the features
|
`exclude` parameters which are based on regular expressions. This functionality mirrors the features
|
||||||
@ -392,7 +389,7 @@ http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CA
|
|||||||
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS[`UNICODE_CHARACTER_CLASS`] and
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS[`UNICODE_CHARACTER_CLASS`] and
|
||||||
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNIX_LINES[`UNIX_LINES`]
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNIX_LINES[`UNIX_LINES`]
|
||||||
|
|
||||||
==== Execution hint
|
===== Execution hint
|
||||||
|
|
||||||
There are two mechanisms by which terms aggregations can be executed: either by using field values directly in order to aggregate
|
There are two mechanisms by which terms aggregations can be executed: either by using field values directly in order to aggregate
|
||||||
data per-bucket (`map`), or by using ordinals of the field values instead of the values themselves (`ordinals`). Although the
|
data per-bucket (`map`), or by using ordinals of the field values instead of the values themselves (`ordinals`). Although the
|
||||||
|
Loading…
x
Reference in New Issue
Block a user