Commit Graph

114 Commits

Author SHA1 Message Date
Tal Levy 5e90ff32f7
Add Normalize Pipeline Aggregation (#56399) (#56792)
This aggregation will perform normalizations of metrics
for a given series of data in the form of bucket values.

The aggregations supports the following normalizations

- rescale 0-1
- rescale 0-100
- percentage of sum
- mean normalization
- z-score normalization
- softmax normalization

To specify which normalization is to be used, it can be specified
in the normalize agg's `normalizer` field.

For example:

```
{
  "normalize": {
    "buckets_path": <>,
    "normalizer": "percent"
  }
}
```
2020-05-14 17:40:15 -07:00
Ignacio Vera 222ee721ec
Add moving percentiles pipeline aggregation (#55441) (#56575)
Similar to what the moving function aggregation does, except merging windows of percentiles
sketches together instead of cumulatively merging final metrics
2020-05-12 11:35:23 +02:00
Nik Everett 2f38aeb5e2
Save memory when numeric terms agg is not top (#55873) (#56454)
Right now all implementations of the `terms` agg allocate a new
`Aggregator` per bucket. This uses a bunch of memory. Exactly how much
isn't clear but each `Aggregator` ends up making its own objects to read
doc values which have non-trivial buffers. And it forces all of it
sub-aggregations to do the same. We allocate a new `Aggregator` per
bucket for two reasons:

1. We didn't have an appropriate data structure to track the
   sub-ordinals of each parent bucket.
2. You can only make a single call to `runDeferredCollections(long...)`
   per `Aggregator` which was the only way to delay collection of
   sub-aggregations.

This change switches the method that builds aggregation results from
building them one at a time to building all of the results for the
entire aggregator at the same time.

It also adds a fairly simplistic data structure to track the sub-ordinals
for `long`-keyed buckets.

It uses both of those to power numeric `terms` aggregations and removes
the per-bucket allocation of their `Aggregator`. This fairly
substantially reduces memory consumption of numeric `terms` aggregations
that are not the "top level", especially when those aggregations contain
many sub-aggregations. It also is a pretty big speed up, especially when
the aggregation is under a non-selective aggregation like
the `date_histogram`.

I picked numeric `terms` aggregations because those have the simplest
implementation. At least, I could kind of fit it in my head. And I
haven't fully understood the "bytes"-based terms aggregations, but I
imagine I'll be able to make similar optimizations to them in follow up
changes.
2020-05-08 20:38:53 -04:00
Nik Everett e35919d3b8
Optimize date_histograms across daylight savings time (backport of #55559) (#56334)
Rounding dates on a shard that contains a daylight savings time transition
is currently something like 1400% slower than when a shard contains dates
only on one side of the DST transition. And it makes a ton of short lived
garbage. This replaces that implementation with one that benchmarks to
having around 30% overhead instead of the 1400%. And it doesn't generate
any garbage per search hit.

Some background:
There are two ways to round in ES:
* Round to the nearest time unit (Day/Hour/Week/Month/etc)
* Round to the nearest time *interval* (3 days/2 weeks/etc)

I'm only optimizing the first one in this change and plan to do the second
in a follow up. It turns out that rounding to the nearest unit really *is*
two problems: when the unit rounds to midnight (day/week/month/year) and
when it doesn't (hour/minute/second). Rounding to midnight is consistently
about 25% faster and rounding to individual hour or minutes.

This optimization relies on being able to *usually* figure out what the
minimum and maximum dates are on the shard. This is similar to an existing
optimization where we rewrite time zones that aren't fixed
(think America/New_York and its daylight savings time transitions) into
fixed time zones so long as there isn't a daylight savings time transition
on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement
time interval rounding the time zone rewriting optimization *should* no
longer be needed.

This optimization doesn't come into play for `composite` or
`auto_date_histogram` aggs because neither have been migrated to the new
`DATE` `ValuesSourceType` which is where that range lookup happens. When
they are they will be able to pick up the optimization without much work.
I expect this to be substantial for `auto_date_histogram` but less so for
`composite` because it deals with fewer values.

Note: My 30% overhead figure comes from small numbers of daylight savings
time transitions. That overhead gets higher when there are more
transitions in logarithmic fashion. When there are two thousand years
worth of transitions my algorithm ends up being 250% slower than rounding
without a time zone, but java time is 47000% slower at that point,
allocating memory as fast as it possibly can.
2020-05-07 09:10:51 -04:00
Julie Tibshirani e852bb29b7
Simplify signature of FieldMapper#parseCreateField. (#56144)
`FieldMapper#parseCreateField` accepts the parse context, plus a list of fields
as an output parameter. These fields are immediately added to the document
through `ParseContext#doc()`.

This commit simplifies the signature by removing the list of fields, and having
the mappers add the fields directly to `ParseContext#doc()`. I think this is
nicer for implementors, because previously fields could be added either through
the list, or the context (through `add`, `addWithKey`, etc.)
2020-05-06 11:12:09 -07:00
Christos Soulios c65f828cb7
[7.x] Histogram field type support for ValueCount and Avg aggregations (#56099)
Backports #55933 to 7.x

Implements value_count and avg aggregations over Histogram fields as discussed in #53285

- value_count returns the sum of all counts array of the histograms
- avg computes a weighted average of the values array of the histogram by multiplying each value with its associated element in the counts array
2020-05-04 13:23:02 +03:00
Ryan Ernst 52b9d8d15e
Convert remaining license methods to isAllowed (#55908) (#55991)
This commit converts the remaining isXXXAllowed methods to instead of
use isAllowed with a Feature value. There are a couple other methods
that are static, as well as some licensed features that check the
license directly, but those will be dealt with in other followups.
2020-04-30 15:52:22 -07:00
Igor Motov d8f9df771d
Expose agg usage in Feature Usage API (#55732) (#56048)
Counts usage of the aggs and exposes them on the _nodes/usage/.

Closes #53746
2020-04-30 12:53:36 -04:00
Christos Soulios 43dab77186
[7.x] Modified searchAndReduce() to return empty agg when no docs exist (#55967)
Backports #55826 to 7.x

    Modified AggregatorTestCase.searchAndReduce() method so that it returns an empty aggregation result when no documents have been inserted.

    Also refactored several aggregation tests so they do not re-implement method AggregatorTestCase.testCase()

    Fixes #55824
2020-04-30 00:28:32 +03:00
Christos Soulios 02bf0c586a
[7.x] Histogram field type support for Sum aggregation (#55916)
Implements Sum aggregation over Histogram fields by summing the value of each bucket multiplied by their count as requested in #53285

Backports #55681 to 7.x
2020-04-29 15:06:12 +03:00
Mark Tozzi 22a98ec279
Aggregation support for Value Scripts that change types (#54830) (#55752) 2020-04-27 09:57:05 -04:00
Mark Tozzi 87b4979c24
[7.x] Make ValuesSourceRegistry immutable after initilization #55493 (#55697) 2020-04-24 13:33:38 -04:00
Zachary Tong 715c90bf7d Aggs must specify a `field` or `script` (or both) (#52226)
This adds a validation to VSParserHelper to ensure that a field or
script or both are specified by the user.  This is technically
required today already, but throws an exception much deeper
in the agg framework and has a very unintuitive error for the user
(as well as eating more resources instead of failing early)
2020-04-23 19:23:41 -04:00
Zachary Tong 4f483ac370 Fix half-float range in SupportedTypeTests (#55409)
Also adds a comment to the half-float number field type tests indicating
why 70000 is used instead of 65504
2020-04-23 11:36:37 -04:00
Zachary Tong f46b567563 Convert InternalAggTestCase to AbstractNamedWriteableTestCase (#55250)
Some aggregations, such as the Terms* family, will use an alternate
class to represent unmapped shard results (while the rest of the aggs
use the same object but with some form of "empty" or "nullish" values
to represent unmapped).

This was problematic with AbstractWireSerializingTestCase because it
expects the instanceReader to always match the original class.  Instead,
we need to use the NamedWriteable version so that the registry
can be consulted for the proper deserialization reader.
2020-04-17 16:39:38 -04:00
Tanguy Leroux 71855fbfe0 Mute testSupportedFieldTypes in HDRPreAggregatedPercentile tests (#55369)
Relates #55360
2020-04-17 10:49:43 +02:00
Mark Tozzi 22c55180c1
[7.x] Backport ValuesSourceRegistry and related work (#54922)
* Add ValuesSource Registry and associated logic (#54281)

* Remove ValuesSourceType argument to ValuesSourceAggregationBuilder (#48638)

* ValuesSourceRegistry Prototype (#48758)

* Remove generics from ValuesSource related classes (#49606)

* fix percentile aggregation tests (#50712)

* Basic thread safety for ValuesSourceRegistry (#50340)

* Remove target value type from ValuesSourceAggregationBuilder (#49943)

* Cleanup default values source type (#50992)

* CoreValuesSourceType no longer implements Writable (#51276)

* Remove genereics & hard coded ValuesSource references from Matrix Stats (#51131)

* Put values source types on fields (#51503)

* Remove VST Any (#51539)

* Rewire terms agg to use new VS registry (#51182)

Also adds some basic AggTestCases for untested code
paths (and boilerplate for future tests once the IT are
converted over)

* Wire Cardinality aggregation to work with the ValuesSourceRegistry (#51337)

* Wire Percentiles aggregator into new VS framework (#51639)

This required a bit of a refactor to percentiles itself.  Before,
the Builder would switch on the chosen algo to generate an
algo-specific factory.  This doesn't work (or at least, would be
difficult) in the new VS framework.

This refactor consolidates both factories together and introduces
a PercentilesConfig object to act as a standardized way to pass
algo-specific parameters through the factory.  This object
is then used when deciding which kind of aggregator to create

Note: CoreValuesSourceType.HISTOGRAM still lives in core, and will
be moved in a subsequent PR.

* Remove generics and target value type from MultiVSAB (#51647)

* fix checkstyle after merge (#52008)

* Plumb ValuesSourceRegistry through to QuerySearchContext (#51710)

* Convert RareTerms to new VS registry (#52166)

* Wire up Value Count (#52225)

* Wire up Max & Min aggregations (#52219)

* ValuesSource refactoring: Wire up Sum aggregation (#52571)

* ValuesSource refactoring: Wire up SigTerms aggregation (#52590)

* Soft immutability for VSConfig (#52729)

* Unmute testSupportedFieldTypes, fix Percentiles/Ranks/Terms tests (#52734)

Also fixes Percentiles which was incorrectly specified to only accept
numeric, but in fact also accepts Boolean and Date (because those are
numeric on master - thanks `testSupportedFieldTypes` for catching it!)

* VS refactoring: Wire up stats aggregation (#52891)

* ValuesSource refactoring: Wire up string_stats aggregation (#52875)

* VS refactoring: Wire up median (MAD) aggregation (#52945)

* fix valuesourcetype issue with constant_keyword field (#53041)x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/job/RollupIndexer.java

this commit implements `getValuesSourceType` for
the ConstantKeyword field type.

master was merged into feature/extensible-values-source
introducing a new field type that was not implementing
`getValuesSourceType`.

* ValuesSource refactoring: Wire up Avg aggregation (#52752)

* Wire PercentileRanks aggregator into new VS framework  (#51693)

* Add a VSConfig resolver for aggregations not using the registry (#53038)

* Vs refactor wire up ranges and date ranges (#52918)

* Wire up geo_bounds aggregation to ValuesSourceRegistry (#53034)

This commit updates the geo_bounds aggregation to depend
on registering itself in the ValuesSourceRegistry

relates #42949.

* VS refactoring: convert Boxplot to new registry (#53132)

* Wire-up geotile_grid and geohash_grid to ValuesSourceRegistry (#53037)

This commit updates the geo*_grid aggregations to depend
on registering itself in the ValuesSourceRegistry

relates to the values-source refactoring meta issue #42949.

* Wire-up geo_centroid agg to ValuesSourceRegistry (#53040)

This commit updates the geo_centroid aggregation to depend
on registering itself in the ValuesSourceRegistry.

relates to the values-source refactoring meta issue #42949.

* Fix type tests for Missing aggregation (#53501)

* ValuesSource Refactor: move histo VSType into XPack module (#53298)

- Introduces a new API (`getBareAggregatorRegistrar()`) which allows plugins to register aggregations against existing agg definitions defined in Core.
- This moves the histogram VSType over to XPack where it belongs. `getHistogramValues()` still remains as a Core concept
- Moves the histo-specific bits over to xpack (e.g. the actual aggregator logic). This requires extra boilerplate since we need to create a new "Analytics" Percentile/Rank aggregators to deal with the histo field. Doubly-so since percentiles/ranks are extra boiler-plate'y... should be much lighter for other aggs

* Wire up DateHistogram to the ValuesSourceRegistry (#53484)

* Vs refactor parser cleanup (#53198)

Co-authored-by: Zachary Tong <polyfractal@elastic.co>
Co-authored-by: Zachary Tong <zach@elastic.co>
Co-authored-by: Christos Soulios <1561376+csoulios@users.noreply.github.com>
Co-authored-by: Tal Levy <JubBoy333@gmail.com>

* First batch of easy fixes

* Remove List.of from ValuesSourceRegistry

Note that we intend to have a follow up PR dealing with the mutability
of the registry, so I didn't even try to address that here.

* More compiler fixes

* More compiler fixes

* More compiler fixes

* Precommit is happy and so am I

* Add new Core VSTs to tests

* Disabled supported type test on SigTerms until we can backport it's fix

* fix checkstyle

* Fix test failure from semantic merge issue

* Fix some metaData->metadata replacements that got lost

* Fix list of supported types for MinAggregator

* Fix list of supported types for Avg

* remove unused import

Co-authored-by: Zachary Tong <polyfractal@elastic.co>
Co-authored-by: Zachary Tong <zach@elastic.co>
Co-authored-by: Christos Soulios <1561376+csoulios@users.noreply.github.com>
Co-authored-by: Tal Levy <JubBoy333@gmail.com>
2020-04-16 16:54:46 -04:00
David Turner 7941f4a47e Add RepositoriesService to createComponents() args (#54814)
Today we pass the `RepositoriesService` to the searchable snapshots plugin
during the initialization of the `RepositoryModule`, forcing the plugin to be a
`RepositoryPlugin` even though it does not implement any repositories.

After discussion we decided it best for now to pass this in via
`Plugin#createComponents` instead, pending some future work in which plugins
can depend on services more dynamically.
2020-04-16 16:27:36 +01:00
Igor Motov 1754e50cbd
[7.x] Add analytics plugin usage stats to _xpack/usage (#54911) (#55162)
Adds analytics plugin usage stats to _xpack/usage.

Closes #54847
2020-04-14 17:03:14 -04:00
Igor Motov 51c6f69e02
[7.x] Add support for filters to T-Test aggregation (#54980) (#55066)
Adds support for filters to T-Test aggregation. The filters can be used to
select populations based on some criteria and use values from the same or
different fields.

Closes #53692
2020-04-13 12:28:58 -04:00
Igor Motov 6861295706
Further improve InternalTTestTests (#55081)
A small follow-up to #54910. Now that we can generated consistent set of
internal aggs to reduce, we no longer need to keep agg parameters as class
variables.

Related to #54910
2020-04-13 10:26:23 -04:00
Nik Everett c00811f3a3
Make some agg tests easier to read (#54954) (#55079)
We added a fancy method to provide random realistic test data to the
reduction tests in #54910. This uses that to remove some of the more
esoteric machinations in the agg tests. This will marginally increase
the coverage of the serialiation tests and, more importantly, remove
some mysterious value generation code that only really made sense for
random reduction tests but was used all over the place. It doesn't, on
the other hand, make the tests shorter. Just *hopefully* more clear.

I only cleaned up a few tests this way. If we like this it'd probably be
worth grabbing others.
2020-04-10 14:15:30 -04:00
Nik Everett ce7ae4a7d1
Remove pipline aggs from agg result tree (backport of #54716) (#54920)
This removes pipeline aggregators from the aggregation result tree
except for a single field used for backwards compatibility with pre-7.8
versions of Elasticsearch. That field isn't populated unless we are
serializing to pre-7.8 Elasticsearch. So, good news! We no longer build
pipeline aggregators on the data node. Most of the time.
2020-04-07 17:22:23 -04:00
Nik Everett 100f7258c7
Improve agg reduce tests (#54910) (#54914)
This allows subclasses of `InternalAggregationTestCase` to make a `List`
of values to reduce so that it can make values that are realistic
*together*. The first use of this is with `InternalTTest` which uses it
to make results that don't cause their `sum` field to wrap. It'd likely
be useful for a ton of other aggs but just one for now.
2020-04-07 17:22:04 -04:00
Nik Everett faa687c0ae Fix InternalTTestTests
`testReduceRandom` was bumping up against the serialization that I added
in #54776. This makes it use random values that reduce in ways that
don't cause the randomized serialization to fail.
2020-04-07 11:51:54 -04:00
Nik Everett 3c56e0de42
Fix scripted metric in ccs (backport of #54776) (#54888)
`scripted_metric` did not work with cross cluster search because it
assumed that you'd never perform a partial reduction, serialize the
results, and then perform a final reduction. That
serialized-after-partial-reduction step was broken.

This is also required to support #54758.
2020-04-07 10:43:00 -04:00
Igor Motov 2794572a35
[7.x] Add Student's t-test aggregation support (#54469) (#54737)
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to #53692
2020-04-06 11:36:47 -04:00
Nik Everett 54ea4f4f50 Begin to drop pipeline aggs from the result tree (backport of #54311) (#54659)
Removes pipeline aggregations from the aggregation result tree as they
are no longer used. This stops us from building the pipeline aggregators
at all on data nodes except for backwards compatibility serialization.
This will save a tiny bit of space in the aggregation tree which is
lovely, but the biggest benefit is that it is a step towards simplifying
pipeline aggregators.

This only does about half of the work to remove the pipeline aggs from
the tree. Removing all of it would, well, double the size of the change
and make it harder to review.
2020-04-02 16:45:12 -04:00
Zachary Tong 20d67720aa
Refactor Percentiles/Ranks aggregation builders and factories (#51887) (#54537)
- Consolidates HDR/TDigest factories into a single factory
- Consolidates most HDR/TDigest builder into an abstract builder
- Deprecates method(), compression(), numSigFig() in favor of a new
unified PercentileConfig object
- Disallows setting algo options that don't apply to current algo

The unified config method carries both the method and algo-specific
setting. This provides a mechanism to reject settings that apply
to the wrong algorithm.  For BWC the old methods are retained
but marked as deprecated, and can be removed in future versions.

Co-authored-by: Mark Tozzi <mark.tozzi@gmail.com>

Co-authored-by: Mark Tozzi <mark.tozzi@gmail.com>
2020-04-02 10:39:41 -04:00
Jason Tedor 63e5f2b765
Rename META_DATA to METADATA
This is a follow up to a previous commit that renamed MetaData to
Metadata in all of the places. In that commit in master, we renamed
META_DATA to METADATA, but lost this on the backport. This commit
addresses that.
2020-03-31 17:30:51 -04:00
Jason Tedor 5fcda57b37
Rename MetaData to Metadata in all of the places (#54519)
This is a simple naming change PR, to fix the fact that "metadata" is a
single English word, and for too long we have not followed general
naming conventions for it. We are also not consistent about it, for
example, METADATA instead of META_DATA if we were trying to be
consistent with MetaData (although METADATA is correct when considered
in the context of "metadata"). This was a simple find and replace across
the code base, only taking a few minutes to fix this naming issue
forever.
2020-03-31 17:24:38 -04:00
Zachary Tong c9db2de41d
[7.x] Comprehensively test supported/unsupported field type:agg combinations (#54451)
* Comprehensively test supported/unsupported field type:agg combinations (#52493)

This adds a test to AggregatorTestCase that allows us to programmatically
verify that an aggregator supports or does not support a particular
field type.  It fetches the list of registered field type parsers,
creates a MappedFieldType from the parser and then attempts to run
a basic agg against the field.

A supplied list of supported VSTypes are then compared against the
output (success or exception) and suceeds or fails the test accordingly.

Co-Authored-By: Mark Tozzi <mark.tozzi@gmail.com>
* Skip fields that are not aggregatable

* Use newIndexSearcher() to avoid incompatible readers (#52723)

Lucene's `newSearcher()` can generate readers like ParallelCompositeReader
which we can't use.  We need to instead use our helper `newIndexSearcher`
2020-03-31 14:35:03 -04:00
Nik Everett e58ad9fed3
Clean up how pipeline aggs check for multi-bucket (backport of #54161) (#54379)
Pipeline aggregations like `stats_bucket`, `sum_bucket`, and
`percentiles_bucket` only operate on buckets that have multiple buckets.
This adds support for those aggregations to `geo_distance`, `ip_range`,
`auto_date_histogram`, and `rare_terms`.

This all happened because we used a marker interface to mark compatible
aggs, `MultiBucketAggregationBuilder` and it was fairly easy to forget
to implement the interface.

This replaces the marker interface with an abstract method in
`AggregationBuilder`, `bucketCardinality` which makes you return `NONE`,
`ONE`, or `MANY`. The `bucket` aggregations can check for `MANY`. At
this point `ONE` and `NONE` amount to about the same thing, but I
suspect that'll be a useful distinction when validating bucket sorts.

Closes #53215
2020-03-30 10:44:55 -04:00
Nik Everett b9bfba2c8b
Move pipeline agg validation to coordinating node (backport of #53669) (#54019)
This moves the pipeline aggregation validation from the data node to the
coordinating node so that we, eventually, can stop sending pipeline
aggregations to the data nodes entirely. In fact, it moves it into the
"request validation" stage so multiple errors can be accumulated and
sent back to the requester for the entire request. We can't always take
advantage of that, but it'll be nice for folks not to have to play
whack-a-mole with validation.

This is implemented by replacing `PipelineAggretionBuilder#validate`
with:
```
protected abstract void validate(ValidationContext context);
```

The `ValidationContext` handles the accumulation of validation failures,
provides access to the aggregation's siblings, and implements a few
validation utility methods.
2020-03-23 17:22:56 -04:00
Alan Woodward 71b703edd1 Rename AtomicFieldData to LeafFieldData (#53554)
This conforms with lucene's LeafReader naming convention, and
matches other per-segment structures in elasticsearch.
2020-03-17 12:30:12 +00:00
Nik Everett 9dcd64c110
Preserve metric types in top_metrics (backport of #53288) (#53440)
This changes the `top_metrics` aggregation to return metrics in their
original type. Since it only supports numerics, that means that dates,
longs, and doubles will come back as stored, with their appropriate
formatter applied.
2020-03-12 17:17:09 -04:00
Nik Everett 28df7ae5ed
Support multiple metrics in `top_metrics` agg (backport of #52965) (#53163)
This adds support for returning multiple metrics to the `top_metrics`
agg. It looks like:
```
POST /test/_search?filter_path=aggregations
{
  "aggs": {
    "tm": {
      "top_metrics": {
        "metrics": [
          {"field": "v"},
          {"field": "m"}
        ],
        "sort": {"s": "desc"}
      }
    }
  }
}
```
2020-03-05 08:12:01 -05:00
Nik Everett 609c61f75c
Formalize usage stats for analytics (backport of #52966) (#53077)
This moves the usage statistics gathering from the `AnalyticsPlugin`
into an `AnalyicsUsage`, removing the static state. It also checks the
license level when parsing all analytics aggregations. This is how we
were checking them before but we did it in an easy to forget way. This
way is slightly simpler, I think.
2020-03-04 10:29:11 -05:00
Nik Everett 1d1956ee93
Add size support to `top_metrics` (backport of #52662) (#52914)
This adds support for returning the top "n" metrics instead of just the
very top.

Relates to #51813
2020-02-27 16:12:52 -05:00
Nik Everett bfaa487757
Switch pipeline agg parsing to ContextParser (#52776) (#52832)
We've pretty well settled on `ContextParser` for a generic interface to
`ObjectParser`-like-things. This switches the interface used for
building parsing pipeline aggregations to `ContextParser` which saves a
couple of little wrappers around `ObjectParser`.
2020-02-26 12:57:20 -05:00
Nik Everett ed957f35a9
Cover missing case in top_metrics test (#52517)
The top_metrics test assumed that it'd never end up *only* reducing
unmapped results. But, rarely, it does. This handles that case in the
test.

Closes #52462
2020-02-21 09:49:17 -05:00
Nik Everett 8796cdce4b
Modernize boxplot's parser (backport of #52361) (#52372)
Uses a newer way to build `ObjectParser` for in `boxplot` that allows us
to drop a mostly ceremonial method.
2020-02-19 09:20:49 -05:00
Nik Everett 146def8caa
Implement top_metrics agg (#51155) (#52366)
The `top_metrics` agg is kind of like `top_hits` but it only works on
doc values so it *should* be faster.

At this point it is fairly limited in that it only supports a single,
numeric sort and a single, numeric metric. And it only fetches the "very
topest" document worth of metric. We plan to support returning a
configurable number of top metrics, requesting more than one metric and
more than one sort. And, eventually, non-numeric sorts and metrics. The
trick is doing those things fairly efficiently.

Co-Authored by: Zachary Tong <zach@elastic.co>
2020-02-14 11:19:11 -05:00
Igor Motov a66988281f
Add histogram field type support to boxplot aggs (#52265)
Add support for the histogram field type to boxplot aggs.

Closes #52233
Relates to #33112
2020-02-13 18:09:26 -05:00
Nik Everett 2dac36de4d
HLRC support for string_stats (#52163) (#52297)
This adds a builder and parsed results for the `string_stats`
aggregation directly to the high level rest client. Without this the
HLRC can't access the `string_stats` API without the elastic licensed
`analytics` module.

While I'm in there this adds a few of our usual unit tests and
modernizes the parsing.
2020-02-12 19:25:05 -05:00
Hendrik Muhs 098380e483 Percentiles aggregation validation checks for range (#51871)
disallow to specify percentile out of range [0,100]. This also fixes a problem in transform by failing
validation if an invalid percentile configuration is used.
2020-02-11 17:25:39 +01:00
Igor Motov 667e1a5225
Add Boxplot Aggregation (#52174)
Adds a `boxplot` aggregation that calculates min, max, medium and the first
and the third quartiles of the given data set.

Closes #33112
2020-02-11 09:38:17 -05:00
Ignacio Vera ababd730f6
Histogram field: Use #name() instead of #simpleName() when generating doc values (#51920) (#51927) 2020-02-05 12:35:49 +01:00
Ryan Ernst 21224caeaf Remove comparison to true for booleans (#51723)
While we use `== false` as a more visible form of boolean negation
(instead of `!`), the true case is implied and the true value does not
need to explicitly checked. This commit converts cases that have slipped
into the code checking for `== true`.
2020-01-31 16:35:43 -08:00
Nik Everett 4ff314a9d5
Begin moving date_histogram to offset rounding (take two) (#51271) (#51495)
We added a new rounding in #50609 that handles offsets to the start and
end of the rounding so that we could support `offset` in the `composite`
aggregation. This starts moving `date_histogram` to that new offset.

This is a redo of #50873 with more integration tests.

This reverts commit d114c9db3e1d1a766f9f48f846eed0466125ce83.
2020-01-27 13:40:54 -05:00
Nik Everett 788836ea3f
Revert "Begin moving date_histogram to offset rounding (backport of #50873) (#50978)" (#51239)
This reverts commit 9a3d4db840. It was
subtly broken in ways we didn't have tests for.
2020-01-21 08:50:02 -05:00
Nik Everett 9a3d4db840
Begin moving date_histogram to offset rounding (backport of #50873) (#50978)
We added a new rounding in #50609 that handles offsets to the start and
end of the rounding so that we could support `offset` in the `composite`
aggregation. This starts moving `date_histogram` to that new offset.
2020-01-14 16:50:27 -05:00
Nik Everett ae40e22452
Drop "funny" functions building parsers (#50715) (#50814)
Replaces the "funny"
`Function<String, ConstructingObjectParser<T, Void>>` with a much
simpler `ConstructingObjectParser<T, String>`. This makes pretty much
all of our object parsers static.
2020-01-09 15:53:03 -05:00
Adrien Grand 31158ab3d5
Add per-field metadata. (#50333)
This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.

In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:
 - keys can't be longer than 20 chars,
 - values can only be numbers or strings of no more than 50 chars - no inner
   arrays or objects,
 - the metadata can't have more than 5 keys in total.

Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.

Here is how the meta might look like in mappings:

```json
{
  "properties": {
    "latency": {
      "type": "long",
      "meta": {
        "unit": "ms"
      }
    }
  }
}
```

And then in the field capabilities response:

```json
{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms" ]
      }
    }
  }
}
```

When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:

```json
{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms", "ns" ]
      }
    }
  }
}
```

Closes #33267
2020-01-08 16:21:18 +01:00
Zachary Tong fec882a457 Decouple pipeline reductions from final agg reduction (#45796)
Historically only two things happened in the final reduction:
empty buckets were filled, and pipeline aggs were reduced (since it
was the final reduction, this was safe).  Usage of the final reduction
is growing however.  Auto-date-histo might need to perform
many reductions on final-reduce to merge down buckets, CCS
may need to side-step the final reduction if sending to a
different cluster, etc

Having pipelines generate their output in the final reduce was
convenient, but is becoming increasingly difficult to manage
as the rest of the agg framework advances.

This commit decouples pipeline aggs from the final reduction by
introducing a new "top level" reduce, which should be called
at the beginning of the reduce cycle (e.g. from the SearchPhaseController).
This will only reduce pipeline aggs on the final reduce after
the non-pipeline agg tree has been fully reduced.

By separating pipeline reduction into their own set of methods,
aggregations are free to use the final reduction for whatever
purpose without worrying about generating pipeline results
which are non-reducible
2019-12-05 16:11:54 -05:00
Ignacio Vera 44e94555ee
Add reusable HistogramValue object (#49799) (#49823)
Adds a reusable implementation of HistogramValue so we do not create
an object per document.
2019-12-04 11:51:53 +01:00
Ignacio Vera d9162c1243
Replace usages of XPackPlugin with the LocalStateCompositeXPackPlugin (#49714) (#49722) 2019-11-29 15:47:23 +01:00
Ignacio Vera 326fe7566e
New Histogram field mapper that supports percentiles aggregations. (#48580) (#49683)
This commit adds  a new histogram field mapper that consists in a pre-aggregated format of numerical data to be used in percentiles aggregations.
2019-11-28 15:06:26 +01:00
Mark Tozzi 17358b5af7
(refactor) Extract Empty/Script/Missing ValuesSource behavior to an interface (#48320) (#49330)
This is a pure code rearrangement refactor.  Logic for what specific ValuesSource instance to use for a given type (e.g. script or field) moved out of ValuesSourceConfig and into CoreValuesSourceType (previously just ValueSourceType; we extract an interface for future extensibility).  ValueSourceConfig still selects which case to use, and then the ValuesSourceType instance knows how to construct the ValuesSource for that case.
2019-11-19 16:44:29 -05:00
Christos Soulios d9f0245b10
[7.x] Implement stats aggregation for string terms (#49097)
Backport of #47468 to 7.x

This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following:

min_length: The length of the shortest term
max_length: The length of the longest term
avg_length: The average length of all terms
distribution: The probability distribution of all characters appearing in all terms
entropy: The total Shannon entropy value calculated for all terms

This aggregation has been implemented as an analytics plugin.
2019-11-15 14:36:21 +02:00
Rory Hunter c46a0e8708
Apply 2-space indent to all gradle scripts (#49071)
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
2019-11-14 11:01:23 +00:00
Alpar Torok 0a14bb174f Remove eclipse conditionals (#44075)
* Remove eclipse conditionals

We used to have some meta projects with a `-test` prefix because
historically eclipse could not distinguish between test and main
source-sets and could only use a single classpath.
This is no longer the case for the past few Eclipse versions.

This PR adds the necessary configuration to correctly categorize source
folders and libraries.
With this change eclipse can import projects, and the visibility rules
are correct e.x. auto compete doesn't offer classes from test code or
`testCompile` dependencies when editing classes in `main`.

Unfortunately the cyclic dependency detection in Eclipse doesn't seem to
take the difference between test and non test source sets into account,
but since we are checking this in Gradle anyhow, it's safe to set to
`warning` in the settings. Unfortunately there is no setting to ignore
it.

This might cause problems when building since Eclipse will probably not
know the right order to build things in so more wirk might be necesarry.
2019-10-03 11:55:00 +03:00
Jim Ferenczi 23bf310c84 Replace the SearchContext with QueryShardContext when building aggregator factories (#46527)
This commit replaces the `SearchContext` with the `QueryShardContext` when building aggregator factories. Aggregator factories are part of the `SearchContext` so they shouldn't require a `SearchContext` to create them.
The main changes here are the signatures of `AggregationBuilder#build` that now takes a `QueryShardContext` and `AggregatorFactory#createInternal` that passes the `SearchContext` to build the `Aggregator`.

Relates #46523
2019-09-11 16:43:30 +02:00
Zachary Tong cf8a4171e1
Rename `data-science` plugin to `analytics` (#46133)
Rename `data-science` plugin to `analytics`. 
Also removes enabled flag. Backport of #46092
2019-08-29 12:45:39 -04:00