The current heuristic to compute a default shard size is pretty aggressive,
it returns `max(10, number_of_shards * size)` as a value for the shard size.
I think making it less aggressive has the benefit that it would reduce the
likelyness of running into OOME when there are many shards (yearly
aggregations with time-based indices can make numbers of shards in the
thousands) and make the use of breadth-first more likely/efficient.
This commit replaces the heuristic with `size * 1.5 + 10`, which is enough
to have good accuracy on zipfian distributions.
* Update gateway.asciidoc
Added a note to clarify that, in cases where nodes in a cluster have different setting, the node that is the elected master takes precedence over anything else.
* Update gateway.asciidoc
Updated as per @bleskes's comments
Performing the bulk request shown in #19267 now results in the following:
```
{"_index":"test","_type":"test","_id":"1","_version":1,"_operation":"create","forced_refresh":false,"_shards":{"total":2,"successful":1,"failed":0},"status":201}
{"_index":"test","_type":"test","_id":"1","_version":1,"_operation":"noop","forced_refresh":false,"_shards":{"total":2,"successful":1,"failed":0},"status":200}
```
This change adds a new special path to the buckets_path syntax
`_bucket_count`. This new option will return the number of buckets for a
multi-bucket aggregation, which can then be used in pipeline
aggregations.
Closes#19553
This adds new circuit breaking with the "request" breaker, which adds
circuit breaks based on the number of buckets created during
aggregations. It consists of incrementing during AggregatorBase creation
This also bumps the REQUEST breaker to 60% of the JVM heap now.
The output when circuit breaking an aggregation looks like:
```json
{
"shard" : 0,
"index" : "i",
"node" : "a5AvjUn_TKeTNYl0FyBW2g",
"reason" : {
"type" : "exception",
"reason" : "java.util.concurrent.ExecutionException: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: CircuitBreakingException[[request] Data too large, data for [<agg [otherthings]>] would be larger than limit of [104857600/100mb]];",
"caused_by" : {
"type" : "execution_exception",
"reason" : "QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: CircuitBreakingException[[request] Data too large, data for [<agg [myagg]>] would be larger than limit of [104857600/100mb]];",
"caused_by" : {
"type" : "circuit_breaking_exception",
"reason" : "[request] Data too large, data for [<agg [otherthings]>] would be larger than limit of [104857600/100mb]",
"bytes_wanted" : 104860781,
"bytes_limit" : 104857600
}
}
}
}
```
Relates to #14046
With #19140 we started persisting the node ID across node restarts. Now that we have a "stable" anchor, we can use it to generate a stable default node name and make it easier to track nodes over a restarts. Sadly, this means we will not have those random fun Marvel characters but we feel this is the right tradeoff.
On the implementation side, this requires a bit of juggling because we now need to read the node id from disk before we can log as the node node is part of each log message. The PR move the initialization of NodeEnvironment as high up in the starting sequence as possible, with only one logging message before it to indicate we are initializing. Things look now like this:
```
[2016-07-15 19:38:39,742][INFO ][node ] [_unset_] initializing ...
[2016-07-15 19:38:39,826][INFO ][node ] [aAmiW40] node name set to [aAmiW40] by default. set the [node.name] settings to change it
[2016-07-15 19:38:39,829][INFO ][env ] [aAmiW40] using [1] data paths, mounts [[ /(/dev/disk1)]], net usable_space [5.5gb], net total_space [232.6gb], spins? [unknown], types [hfs]
[2016-07-15 19:38:39,830][INFO ][env ] [aAmiW40] heap size [1.9gb], compressed ordinary object pointers [true]
[2016-07-15 19:38:39,837][INFO ][node ] [aAmiW40] version[5.0.0-alpha5-SNAPSHOT], pid[46048], build[473d3c0/2016-07-15T17:38:06.771Z], OS[Mac OS X/10.11.5/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_51/25.51-b03]
[2016-07-15 19:38:40,980][INFO ][plugins ] [aAmiW40] modules [percolator, lang-mustache, lang-painless, reindex, aggs-matrix-stats, lang-expression, ingest-common, lang-groovy, transport-netty], plugins []
[2016-07-15 19:38:43,218][INFO ][node ] [aAmiW40] initialized
```
Needless to say, settings `node.name` explicitly still works as before.
The commit also contains some clean ups to the relationship between Environment, Settings and Plugins. The previous code suggested the path related settings could be changed after the initial Environment was changed. This did not have any effect as the security manager already locked things down.
Add parser for anonymous char_filters/tokenizer/token_filters
Using Settings in AnalyzeRequest for anonymous definition
Add breaking changes document
Closed#8878
Remove `ParseField` constants used for names where there are no deprecated
names and just use the `String` version of the registration method instead.
This is step 2 in cleaning up the plugin interface for extending
search time actions. Aggregations are next.
This is breaking for plugins because those that register a new query should
now implement `SearchPlugin` rather than `onModule(SearchModule)`.
Previously if the size of the search request was greater than zero we would not cache the request in the request cache.
This change retains the default behaviour of not caching requests with size > 0 but also allows the `request_cache=true` query parameter
to enable the cache for requests with size > 0
This is a tentative to revive #15939 motivated by elastic/beats#1941.
Half-floats are a pretty bad option for storing percentages. They would likely
require 2 bytes all the time while they don't need more than one byte.
So this PR exposes a new `scaled_float` type that requires a `scaling_factor`
and internally indexes `value*scaling_factor` in a long field. Compared to the
original PR it exposes a lower-level API so that the trade-offs are clearer and
avoids any reference to fixed precision that might imply that this type is more
accurate (actually it is *less* accurate).
In addition to being more space-efficient for some use-cases that beats is
interested in, this is also faster that `half_float` unless we can improve the
efficiency of decoding half-float bits (which is currently done using software)
or until Java gets first-class support for half-floats.
Today the default precision for the cardinality aggregation depends on how many
parent bucket aggregations it had. The reasoning was that the more parent bucket
aggregations, the more buckets the cardinality had to be computed on. And this
number could be huge depending on what the parent aggregations actually are.
However now that we run terms aggregations in breadth-first mode by default when
there are sub aggregations, it is less likely that we have to run the cardinality
aggregation on kagilions of buckets. So we could use a static default, which will
be less confusing to users.
* Removed `Template` class and unified script & template parsing logic. Templates are scripts, so they should be defined as a script. Unless there will be separate template infrastructure, templates should share as much code as possible with scripts.
* Removed ScriptParseException in favour for ElasticsearchParseException
* Moved TemplateQueryBuilder to lang-mustache module because this query is hard coded to work with mustache only
This should make them easier to read and adds them to the test suite
I changed the example from a two node cluster to a single node cluster
because that is what we have running in the integration tests. It is also
what a user just starting out is likely to see so I think that is ok.
Invocation counts can be used to help judge the selectivity of individual query components in the context of the entire query. E.g. a query may not look selective when run by itself (matches most of the index), but when run in context of a full search request, is evaluated only rarely due to execution order
Since this is modifying the base timing class, it'll enrich both query and agg profiles (as well as future profile results)
Today `node.mode` and `node.local` serve almost the same purpose, they
are a shortcut for `discovery.type` and `transport.type`. If `node.local: true`
or `node.mode: local` is set elasticsearch will start in _local_ mode which means
only nodes within the same JVM are discovered and a non-network based transport
is used. The _local_ mode it only really used in tests or if nodes are embedded.
For both, embedding and tests explicit configuration via `discovery.type` and `transport.type`
should be preferred.
This change removes all the usage of these settings and by-default doesn't
configure a default transport implemenation since netty is now a module. Yet, to make
the user expericence flawless, plugins or modules can set a `http.type.default` and
`transport.type.default`. Plugins set this via `PluginService#additionalSettings()`
which enforces _set-once_ which prevents node startup if set multiple times. This means
that our distributions will just startup with netty transport since it's packaged as a
module unless `transport.type` or `http.transport.type` is explicitly set.
This change also found a bunch of bugs since several NamedWriteables were not registered if a
transport client is used. Now that we don't rely on the `node.mode` leniency which is inherited
instead of using explicit settings, `TransportClient` uses `AssertingLocalTransport` which detects these problems since it serializes all messages.
Closes#16234