mirror of https://github.com/apache/lucene.git
196 lines
10 KiB
Plaintext
196 lines
10 KiB
Plaintext
= The Stats Component
|
||
:page-shortname: the-stats-component
|
||
:page-permalink: the-stats-component.html
|
||
// Licensed to the Apache Software Foundation (ASF) under one
|
||
// or more contributor license agreements. See the NOTICE file
|
||
// distributed with this work for additional information
|
||
// regarding copyright ownership. The ASF licenses this file
|
||
// to you under the Apache License, Version 2.0 (the
|
||
// "License"); you may not use this file except in compliance
|
||
// with the License. You may obtain a copy of the License at
|
||
//
|
||
// http://www.apache.org/licenses/LICENSE-2.0
|
||
//
|
||
// Unless required by applicable law or agreed to in writing,
|
||
// software distributed under the License is distributed on an
|
||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
// KIND, either express or implied. See the License for the
|
||
// specific language governing permissions and limitations
|
||
// under the License.
|
||
|
||
The Stats component returns simple statistics for numeric, string, and date fields within the document set.
|
||
|
||
The sample queries in this section assume you are running the "```techproducts```" example included with Solr:
|
||
|
||
[source,bash]
|
||
----
|
||
bin/solr -e techproducts
|
||
----
|
||
|
||
== Stats Component Parameters
|
||
|
||
The Stats Component accepts the following parameters:
|
||
|
||
`stats`::
|
||
If `true`, then invokes the Stats component.
|
||
|
||
`stats.field`::
|
||
Specifies a field for which statistics should be generated. This parameter may be invoked multiple times in a query in order to request statistics on multiple fields.
|
||
+
|
||
<<local-parameters-in-queries.adoc#local-parameters-in-queries,Local Parameters>> may be used to indicate which subset of the supported statistics should be computed, and/or that statistics should be computed over the results of an arbitrary numeric function (or query) instead of a simple field name. See the examples below.
|
||
|
||
|
||
=== Stats Component Example
|
||
|
||
The query below demonstrates computing stats against two different fields numeric fields, as well as stats over the results of a `termfreq()` function call using the `text` field:
|
||
|
||
`\http://localhost:8983/solr/techproducts/select?q=*:*&stats=true&stats.field={!func}termfreq('text','memory')&stats.field=price&stats.field=popularity&rows=0&indent=true`
|
||
|
||
[source,xml]
|
||
----
|
||
<lst name="stats">
|
||
<lst name="stats_fields">
|
||
<lst name="termfreq(text,memory)">
|
||
<double name="min">0.0</double>
|
||
<double name="max">3.0</double>
|
||
<long name="count">32</long>
|
||
<long name="missing">0</long>
|
||
<double name="sum">10.0</double>
|
||
<double name="sumOfSquares">22.0</double>
|
||
<double name="mean">0.3125</double>
|
||
<double name="stddev">0.7803018439949604</double>
|
||
<lst name="facets"/>
|
||
</lst>
|
||
<lst name="price">
|
||
<double name="min">0.0</double>
|
||
<double name="max">2199.0</double>
|
||
<long name="count">16</long>
|
||
<long name="missing">16</long>
|
||
<double name="sum">5251.270030975342</double>
|
||
<double name="sumOfSquares">6038619.175900028</double>
|
||
<double name="mean">328.20437693595886</double>
|
||
<double name="stddev">536.3536996709846</double>
|
||
<lst name="facets"/>
|
||
</lst>
|
||
<lst name="popularity">
|
||
<double name="min">0.0</double>
|
||
<double name="max">10.0</double>
|
||
<long name="count">15</long>
|
||
<long name="missing">17</long>
|
||
<double name="sum">85.0</double>
|
||
<double name="sumOfSquares">603.0</double>
|
||
<double name="mean">5.666666666666667</double>
|
||
<double name="stddev">2.943920288775949</double>
|
||
<lst name="facets"/>
|
||
</lst>
|
||
</lst>
|
||
</lst>
|
||
----
|
||
|
||
== Statistics Supported
|
||
|
||
The table below explains the statistics supported by the Stats component. Not all statistics are supported for all field types, and not all statistics are computed by default (see <<Local Parameters with the Stats Component>> below for details)
|
||
|
||
`min`::
|
||
The minimum value of the field/function in all documents in the set. This statistic is computed for all field types and is computed by default.
|
||
|
||
`max`::
|
||
The maximum value of the field/function in all documents in the set. This statistic is computed for all field types and is computed by default.
|
||
|
||
`sum`::
|
||
The sum of all values of the field/function in all documents in the set. This statistic is computed for numeric and date field types and is computed by default.
|
||
|
||
`count`::
|
||
The number of values found in all documents in the set for this field/function. This statistic is computed for all field types and is computed by default.
|
||
|
||
`missing`::
|
||
The number of documents in the set which do not have a value for this field/function. This statistic is computed for all field types and is computed by default.
|
||
|
||
`sumOfSquares`::
|
||
Sum of all values squared (a by product of computing stddev). This statistic is computed for numeric and date field types and is computed by default.
|
||
|
||
`mean`::
|
||
The average `(v1 + v2 .... + vN)/N`. This statistic is computed for numeric and date field types and is computed by default.
|
||
|
||
`stddev`::
|
||
Standard deviation, measuring how widely spread the values in the data set are. This statistic is computed for numeric and date field types and is computed by default.
|
||
|
||
`percentiles`::
|
||
A list of percentile values based on cut-off points specified by the parameter value, such as `1,99,99.9`. These values are an approximation, using the https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[t-digest algorithm]. This statistic is computed for numeric field types and is not computed by default.
|
||
|
||
`distinctValues`::
|
||
The set of all distinct values for the field/function in all of the documents in the set. This calculation can be very expensive for fields that do not have a tiny cardinality. This statistic is computed for all field types but is not computed by default.
|
||
|
||
`countDistinct`::
|
||
The exact number of distinct values in the field/function in all of the documents in the set. This calculation can be very expensive for fields that do not have a tiny cardinality. This statistic is computed for all field types but is not computed by default.
|
||
|
||
`cardinality`::
|
||
A statistical approximation (currently using the https://en.wikipedia.org/wiki/HyperLogLog[HyperLogLog] algorithm) of the number of distinct values in the field/function in all of the documents in the set. This calculation is much more efficient then using the `countDistinct` option, but may not be 100% accurate.
|
||
+
|
||
Input for this option can be floating point number between `0.0` and `1.0` indicating how aggressively the algorithm should try to be accurate: `0.0` means use as little memory as possible; `1.0` means use as much memory as needed to be as accurate as possible. `true` is supported as an alias for `0.3`.
|
||
+
|
||
This statistic is computed for all field types but is not computed by default.
|
||
|
||
== Local Parameters with the Stats Component
|
||
|
||
Similar to the <<faceting.adoc#faceting,Facet Component>>, the `stats.field` parameter supports local parameters for:
|
||
|
||
* Tagging & Excluding Filters: `stats.field={!ex=filterA}price`
|
||
* Changing the Output Key: `stats.field={!key=my_price_stats}price`
|
||
* Tagging stats for <<The Stats Component and Faceting,use with `facet.pivot`>>: `stats.field={!tag=my_pivot_stats}price`
|
||
|
||
Local parameters can also be used to specify individual statistics by name, overriding the set of statistics computed by default, eg: `stats.field={!min=true max=true percentiles='99,99.9,99.99'}price`
|
||
|
||
[IMPORTANT]
|
||
====
|
||
If any supported statistics are specified via local parameters, then the entire set of default statistics is overridden and only the requested statistics are computed.
|
||
====
|
||
|
||
Additional "Expert" local params are supported in some cases for affecting the behavior of some statistics:
|
||
|
||
* `percentiles`
|
||
** `tdigestCompression` - a positive numeric value defaulting to `100.0` controlling the compression factor of the T-Digest. Larger values means more accuracy, but also uses more memory.
|
||
* `cardinality`
|
||
** `hllPreHashed` - a boolean option indicating that the statistics are being computed over a "long" field that has already been hashed at index time – allowing the HLL computation to skip this step.
|
||
** `hllLog2m` - an integer value specifying an explicit "log2m" value to use, overriding the heuristic value determined by the cardinality local param and the field type – see the https://github.com/aggregateknowledge/java-hll/[java-hll] documentation for more details
|
||
** `hllRegwidth` - an integer value specifying an explicit "regwidth" value to use, overriding the heuristic value determined by the cardinality local param and the field type – see the https://github.com/aggregateknowledge/java-hll/[java-hll] documentation for more details
|
||
|
||
=== Examples with Local Parameters
|
||
|
||
Here we compute some statistics for the price field. The min, max, mean, 90th, and 99th percentile price values are computed against all products that are in stock (`q=*:*` and `fq=inStock:true`), and independently all of the default statistics are computed against all products regardless of whether they are in stock or not (by excluding that filter).
|
||
|
||
`\http://localhost:8983/solr/techproducts/select?q=*:*&fq={!tag=stock_check}inStock:true&stats=true&stats.field={!ex=stock_check+key=instock_prices+min=true+max=true+mean=true+percentiles='90,99'}price&stats.field={!key=all_prices}price&rows=0&indent=true`
|
||
|
||
[source,xml]
|
||
----
|
||
<lst name="stats">
|
||
<lst name="stats_fields">
|
||
<lst name="instock_prices">
|
||
<double name="min">0.0</double>
|
||
<double name="max">2199.0</double>
|
||
<double name="mean">328.20437693595886</double>
|
||
<lst name="percentiles">
|
||
<double name="90.0">564.9700012207031</double>
|
||
<double name="99.0">1966.6484985351556</double>
|
||
</lst>
|
||
</lst>
|
||
<lst name="all_prices">
|
||
<double name="min">0.0</double>
|
||
<double name="max">2199.0</double>
|
||
<long name="count">12</long>
|
||
<long name="missing">5</long>
|
||
<double name="sum">4089.880027770996</double>
|
||
<double name="sumOfSquares">5385249.921747174</double>
|
||
<double name="mean">340.823335647583</double>
|
||
<double name="stddev">602.3683083752779</double>
|
||
</lst>
|
||
</lst>
|
||
</lst>
|
||
----
|
||
|
||
== The Stats Component and Faceting
|
||
|
||
Sets of `stats.field` parameters can be referenced by `'tag'` when using Pivot Faceting to compute multiple statistics at every level (i.e.: field) in the tree of pivot constraints.
|
||
|
||
For more information and a detailed example, please see <<faceting.adoc#combining-stats-component-with-pivots,Combining Stats Component With Pivots>>.
|