This adds some new tests to DocumentParserTests to make sure the DocumentParser behaves correctly when dynamically mapping fields. Especially testing that the dynamic setting works when dynamically mapping different field types.
Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter,
and a `fingerprint` analyzer that combines the Fingerprint filter with
lowercasing, stop word removal and asciifolding.
Closes#13325
Don't pass XContentParser to ParseFieldRegistry#lookup
Passing the parser in is not good because we are not parsing anything in the lookup methods, we only need it to retrieve the xcontent location from it so that in case there is an error we emit where we were with the parsing. It is better to pass in the XContentLocation although calling getTokenLocation needs to create a new object at each call. The workaround of passing the Parser in is worse than the original problem.
Cluster health responses have not shown validation errors, which are
retrieved from RoutingTable validations, in any production or testing
instances. The code is unit tested well in this area and any issues are
exposed through the testing infrastructure, so this commit removes
reporting of validation errors in the cluster health response.
Closes#17773Closes#16979
This is what happens when you pull on the "Remove the PROTOTYPEs from
MovAvgModels" string. This removes MovAvgModelStreams in favor of
readNamedWriteable and MovAvgParserMapper in favor of
ParseFieldRegistry<MovAvgModel.AbstractModelParser>.
Relates to #17085
Boolean fields were not handled in `DocumentParser.createBuilderFromFieldType`.
This also improves the logic a bit by falling back to the default mapping of
the given type insteah of hard-coding every case and throws a better exception
than a NPE if no dynamic mappings could be created.
Closes#17879
The Settings class has an enormous amount of methods with variations of
parameters. This change removes all the methods which take multiple
setting names, which were completely unused.
We plan to change every request so that it can support a parentTaskId.
This shrinks EMPTY_TASK_ID, which will be on every request once that change
is made, from 31 bytes to 9 bytes to 1 byte.
This change fixes the lookup during document parsing of whether an
object field is dynamic to handle looking up through parent object
mappers, and also handle when a parent object mapper was created during
document parsing.
closes#17854
Now there aren't any more specialized methods to read or write
NamedWriteables! Now StreamInput and StreamOutput don't need to know
about large chunks of the application!
In Elasticsearch 5.0.0, by default unquoted field names in JSON will be
rejected. This can cause issues, however, for documents that were
already indexed with unquoted field names. To alleviate this, a system
property has been added that can be enabled so migration can occur.
This system property will be removed in Elasticsearch 6.0.0
Resolves#17674
When I pulled on the thread that is "Remove PROTOTYPEs from
SignificanceHeuristics" I ended up removing SignificanceHeuristicStreams
and replacing it with readNamedWriteable. That seems like a lot at once
but it made sense at the time. And it is what we want in the end, I think.
Anyway, this also converts registration of SignificanceHeuristics to
use ParseFieldRegistry to make them consistent with Queries, Aggregations
and lots of other stuff.
Adds a new and wonderous hack to support serialization checking of
NamedWriteables registered by plugins!
Related to #17085
Extracts all the replication logic that is done on the Primary to a separated class called ReplicationOperation. The goal
here is to make unit testing of this logic easier and in the future allow setting up tests that work directly on IndexShards
without the need for networking.
Closes#16492
Removes deprecated registration methods from SearchModule and
NamedWriteableRegistry and removes the "shims" used to migrate
aggregations to the new registration methods.
Relates to #17085
* Added an extra `field` parameter to the `percolator` query to indicate what percolator field should be used. This must be an existing field in the mapping of type `percolator`.
* The `.percolator` type is now forbidden. (just like any type that starts with a `.`)
This only applies for new indices created on 5.0 and later. Indices created on previous versions the .percolator type is still allowed to exist.
The new `percolator` field type isn't active in such indices and the `PercolatorQueryCache` knows how to load queries from these legacy indices.
The `PercolatorQueryBuilder` will not enforce that the `field` parameter is of type `percolator`.
This commit refactors the UUID-generating methods out of Strings into
their own class. The primary motive for this refactoring is to avoid a
chain of class initializers from loading this class earlier than
necessary. This was discovered when it was noticed that starting
Elasticsearch without any active network interfaces leads to some
logging statements being executed before logging had been
initailized. Thus:
- these UUID methods have no place being on Strings
- removing them reduces spooky action-at-distance loading of this class
- removed the troublesome, logging statements from MacAddressProvider,
logging using statically-initialized instances of ESLogger are prone
to this problem
Relates #17837
We have this TransportAddressSerializers that works similarly to
NamedWriteables except it uses shorts instead of streams. I don't know
enough to propose removing it in favor of NamedWriteables to I just ported
it to using Writeable.Reader and left it alone.
Relates to #17085
This test should demonstrate that a single (larger)
request is processed but on of multiple large concurrent requests
is rejected. This test broke too early under some circumstances in
network mode as the limit is quite low.
With this commit we reduce the size of the individual large
requests but issue more concurrent ones thus increasing stability
of this test.
We advertise in our documentation that byte units are like `kb`, `mb`... But we actually only support the simple notation `k` or `m`.
This commit adds support for the documented form and keeps the non documented options to avoid any breaking change.
It also adds support for `micros`, `nanos` and `d` as a time unit in `_cat` API.
Remove the support for `b` as a SizeValue unit. Actually, for numbers, when using raw numbers without unit, there is no text to add/parse after the number. For example, you don't write `10` as `10b`. We support option like `size=` in `_cat` API which means that we want to display raw data without unit (singles).
Documentation updated accordingly.
Add test for the empty size option.
Fix missing TimeValues options for some cat APIs
The queryShardContext we create during setup was sometimes
accessed directly, sometimes by making a copy through
the createShardContext() helper. This should be the default.
Also making sure that strict parsing is switched on via
IndexSettings in the test testup.
This change cleans up a few things in QueryParseContext and QueryShardContext
that make it hard to reason about the state of these context objects and are
thus error prone and should be simplified.
Currently the parser that used in QueryParseContext can be set and reset any
time from the outside, which makes reasoning about it hard. This change makes
the parser a mandatory constructor argument removes ability to later set a
different ParseFieldMatcher. If none is provided at construction time, the
one set inside the parser is used. If a ParseFieldMatcher is specified at
construction time, it is implicitely set on the parser that is beeing used.
The ParseFieldMatcher is only kept inside the parser.
Also currently the QueryShardContext historically holds an inner QueryParseContext
(in the super class QueryRewriteContext), that is mainly used to hold the parser
and parseFieldMatcher. For that reason, the parser can also be reset, which leads
to the same problems as above. This change removes the QueryParseContext from
QueryRewriteContext and removes the ability to reset or retrieve it from the
QueryShardContext. Instead, `QueryRewriteContext#newParseContext(parser)` can be
used to create new parse contexts with the given parser from a shard context
when needed.
Currently we are able to set a ParseFieldMatcher on XContentParsers,
mainly to conveniently carry it around to be available where the
actual parsing happens. This was just recently introduced together
with ObjectParser so that ObjectParser can make use of deprecation
logging and throwing errors while parsing.
This however is trappy because we create parsers in so many places in
the code and it is easy to forget setting the right ParseFieldMatcher.
Instead we should hold the ParseFieldMatcher only in the parse contexts
(e.g. QueryParseContext).
This PR removes the ParseFieldMatcher from XContentParser. ObjectParser
can still make use of it because we can make the otherwise unbounded
`context` type to extend an interface that makes sure contexts used in
ObjectParser can supply a ParseFieldMatcher. Contexts in ObjectParser
are now no longer optional, but it is sufficient to pass in a small
lambda expression in places where no other context is available.
Relates to #17417
and remove its PROTOTYPE. This is the first aggregation builder that
serializes its targetValueType so ValuesSourceAggregatorBuilder had to
grow support for that.
Relates to #17085
The current api allows for choosing which "case" response json keys are
written in. This has the options of camelCase or underscore. camelCase
is going to be deprecated from the query apis. However, with the case
api, it is not necessary to deprecate, as users who were using it in 2.x
can transition completely on 2.x before upgrading by simply using
the underscore option.
This change removes the 'case' option from rest apis.
see #8988
During the bulk action a hierachy of tasks is getting created: bulk->bulk[s] (coord node) -> bulk[s] (primary shard node) -> bulk[s][p] and bulk[s][r]. Due to a bug the first bulk[s] task didn't have bulk's task id is set as a parent id. This commit fixes this bug.
Handling of the current path when parsing a document is very sensitive.
This fixes a subtle bug in array parsing, where the path that was added
by parsing an array would not be cleared. It also adds a hard state
check at the end of parsing to ensure we ended with a clean path.
The doc parser uses a context object to store the state of parsing,
namely the existing mappings, new mappings, and the parsed document.
Currently this uses a threadlocal which is "reset" for each doc parsed.
However, the thread local doesn't actually save anything, since
resetting is constructing new objects. This change removes the thread
local, which also simplifies the mapper service as it now does not need
to be closeable.
In 2.0 we began restricting fields to not contains dots in their names.
This change adds back part of dots in fieldnames support. Specifically,
it allows indexing documents that contain dots in the field names, when
the correct corresponding mappers exist. For example, if mappings
contain an object field `foo`, and a subfield `bar`, then indexing a
document with `foo.bar` will work.
see #15951
This commit really reverts the inadvertent removal of allowing duplicate
calls to Node#start to be a no-op (but was mistakenly restored to
Node#stop in ddfa3a661510f25c2ce431dfd6fb86ac11eb8888).
This makes all numeric fields including `date`, `ip` and `token_count` use
points instead of the inverted index as a lookup structure. This is expected
to perform worse for exact queries, but faster for range queries. It also
requires less storage.
Notes about how the change works:
- Numeric mappers have been split into a legacy version that is essentially
the current mapper, and a new version that uses points, eg.
LegacyDateFieldMapper and DateFieldMapper.
- Since new and old fields have the same names, the decision about which one
to use is made based on the index creation version.
- If you try to force using a legacy field on a new index or a field that uses
points on an old index, you will get an exception.
- IP addresses now support IPv6 via Lucene's InetAddressPoint and store them
in SORTED_SET doc values using the same encoding (fixed length of 16 bytes
and sortable).
- The internal MappedFieldType that is stored by the new mappers does not have
any of the points-related properties set. Instead, it keeps setting the index
options when parsing the `index` property of mappings and does
`if (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }`
when parsing documents.
Known issues that won't fix:
- You can't use numeric fields in significant terms aggregations anymore since
this requires document frequencies, which points do not record.
- Term queries on numeric fields will now return constant scores instead of
giving better scores to the rare values.
Known issues that we could work around (in follow-up PRs, this one is too large
already):
- Range queries on `ip` addresses only work if both the lower and upper bounds
are inclusive (exclusive bounds are not exposed in Lucene). We could either
decide to implement it, or drop range support entirely and tell users to
query subnets using the CIDR notation instead.
- Since IP addresses now use a different representation for doc values,
aggregations will fail when running a terms aggregation on an ip field on a
list of indices that contains both pre-5.0 and 5.0 indices.
- The ip range aggregation does not work on the new ip field. We need to either
implement range aggs for SORTED_SET doc values or drop support for ip ranges
and tell users to use filters instead. #17700Closes#16751Closes#17007Closes#11513
This commit contains the following improvements/fixes:
1. Renaming method names and variables to better reflect the purpose
of the method and the semantics of the variable.
2. For deleting indexes, replace the closed parameter passed to the
delete index/store methods with obtaining the index's state from the
IndexSettings that is already passed in.
3. Added tests to the IndexWithShadowReplicaIT suite, some of which
show issues in the shadow replica delete process that are captured in
Github issue 17695.
Closes#17638
When we pass both XContentParser and QueryParseContext to a method this can be trappy because
we cannot make sure that the parser contained in the context and the parser passed as an argument
are the same.
This removes the parser argument from methods where we currently have both the parser and the parse
context as arguments and instead retrieves the parse from the context inside the method.
The change adds a new option to the geo_* queries: ignore_unmapped. If this option is set to false, the toQuery method on the QueryBuilder will throw an exception if the field specified in the query is unmapped. If the option is set to true, the toQuery method on the QueryBuilder will return a MatchNoDocsQuery. The default value is false so the queries work how they do today (throwing an exception on unmapped field)
It's use tempted the creation of PROTOTYPEs. The only classes that
legitimately implement a readFrom method are those extending from
Diffable - such behavior is part of cluster state management and
out of scope for the PROTOTYPE cleanup.
When we pass down both parser and QueryParseContext to a method, we cannot
make sure that the parser contained in the context and the parser that is
parsed as an argument have the same state. This removes the parser argument
from methods where we currently have both the parser and the parse context
as arguments and instead retrieves the parse from the context inside the
method.
Since OpenJDK virtual machines have G1 GC but do not have a java.vm.name
that contains HotSpot, this test fails on OpenJDK. Instead, the
java.vm.name condition should be expanded to include OpenJDK virtual
machines.
* Inner hits can now only be provided and prepared via setter in the nested, has_child and has_parent query.
* Also made `score_mode` a required constructor parameter.
* Moved has_child's min_child/max_children validation from doToQuery(...) to a setter.
This commit adds a simple test that JvmInfo is correctly able to
determine whether or not G1 GC is running. As the JvmInfo G1 GC logic is
only applies to HotSpot, the test is constructed to do the same. The
test determines whether or not G1 GC is running by inspecting the test
JVM argument line.
The change adds a new option to the `nested`, `has_parent`, `has_children` and `parent_id` queries: `ignore_unmapped`. If this option is set to false, the `toQuery` method on the QueryBuilder will throw an exception if the type/path specified in the query is unmapped. If the option is set to true, the `toQuery` method on the QueryBuilder will return a MatchNoDocsQuery. The default value is `false`so the queries work how they do today (throwing an exception on unmapped paths/types)
With the previous breaker limit spurious failures on transport level
can happen in the test. With the new limit we ensure that the breaker
breaks due to a too large HTTP request instead.
Previously the sigma variable in the `extended_stats_bucket` pipeline aggregation was not being serialised in `ExtendedStatsBucketPipelineAggregator`. This PR fixes that.
It also corrects the initial value of sumOfSquares to be 0.
Closes#17701
#14259 added a check to honor rebalancing policies (i.e., rebalance only on green state) when moving shards due to changes in allocation filtering rules. The rebalancing policy is there to make sure that we don't try to even out the number of shards per node when we are still missing shards. However, it should not interfere with explicit user commands (allocation filtering) or things like the disk threshold wanting to move shards because of a node hitting the high water mark.
#14259 was done to address #14057 where people reported that using allocation filtering caused many shards to be moved at once. This is however a none issue - with 1.7 (where the issue was reported) and 2.x, we protect recovery source nodes by limitting the number of concurrent data streams they can open (i.e., we can have many recoveries, but they will be throttled). In 5.0 we came up with a simpler and more understandable approach where we have a hard limit on the number of outgoing recoveries per node (on top of the incoming recoveries we already had).
This commit removes the last remaining uses of JAVA_OPTS. Now searching
the codebase for the regex '(?<!ES_)JAVA_OPTS' only shows the uses
warning of its removal and the note about it in the migration docs.
* Tokens in the same position are grouped into a SynonymQuery..
* The default operator is applied on tokens in different positions.
* The wildcard is applied to the terms in the last position only.
Fixes#2183
This commit modifies the boostrap checks to output all failing checks
instead of early-failing on the first check and then possibly failing
again after the user resolves the first failing check.
Closes#17474
This commit moves the execution of the bootstrap checks to after network
services are started. This gives us the flexibility to not merely check
if any of the network settings are set, but instead be smarter about it
and check if the network settings are set in a way that means that the
node will be communicating over an external network (either directly, or
via a proxy). As an bonanza, executing the bootstrap checks in this way
enables us to have the node name in the logs!
Closes#17570
If ts=0, cat health disable epoch and timestamp
Be Constant String timestamp and epoch
Move timestamp and epoch to Table
Add rest-api test and test
Closes#10109
With this commit we limit the size of all in-flight requests on
HTTP level. The size is guarded by the same circuit breaker that
is also used on transport level. Similarly, the size that is used
is HTTP content length.
Relates #16011
With this commit we limit the size of all in-flight requests on
transport level. The size is guarded by a circuit breaker and is
based on the content size of each request.
By default we use 100% of available heap meaning that the parent
circuit breaker will limit the maximum available size. This value
can be changed by adjusting the setting
network.breaker.inflight_requests.limit
Relates #16011
The cluster reroute API had a copy of NamedWriteableRegistry's behavior
inside it in the form of AllocationCommands#registerFactory and
AllocationCommands#lookupFactorySafe. There isn't a reason to duplicate
that effort. So this replaces all of AllocationCommand#Factory with
query-like registration in NetworkModule. Why NetworkModule? Because
the transport client needs it.
Makes Writeable not depend on StreamableReader. Keeps the default readFrom
implementation for backwards compatibility during the PROTOTYPE removal
but that'll go when those are gone.
Makes Diffable not extend StreamableReader. Instead Diffable has a readFrom
method. The PROTOTYPE removal will not get to cluster state for a long time
so that method will stay.
Now only a few things implement StreamableReader. They will be addressed
individually and then we'll remove StreamableReader.
When an index is recovered from disk it's metadata is imported first and the master reaches out to the nodes looking for shards of that index. Sometimes those requests reach other nodes before the cluster state is processed by them. At the moment, that situation disables the checking of the store, which requires the meta data (indices with custom path need to know where the data is). When corruption hits this means we may assign a shard to node with corrupted store, which will be caught later on but causes confusion. Instead we can try loading the meta data from disk in those cases.
Relates to #17630
When it comes to query parsing, either a field is tokenized and it would go
through analysis with its search_analyzer. Or it is not tokenized and the
raw string should be passed to termQuery(). Since numeric fields are not
tokenized and also declare a search analyzer, values would currently go through
analysis twice...
We have a couple places in the code base that assume that search is always done
on the inverted index. However with the new points API in Lucene 6, this is not
true anymore. This commit makes MappedFieldType.indexedValueForSearch protected
and fixes call sites to keep working for field types that use the inverted
index and either work differently ar throw an exception otherwise. For instance,
it will still be possible to run cross_fields multi match queries on numeric
fields, but the score contributions will not be blended as well as before, and
significant terms aggregations on long terms will not be possible anymore since
points do not record document frequencies.
This commit removes `MappedFieldType.value` and simplifies
`MappedFieldType.valueforSearch`. `valueforSearch` was used to post-process
values that come for stored fields (eg. to convert a long back to a string
representation of a date in the case of a date field) and also values that
are extracted from the source but only in the case of GET calls: it would
not be called when performing source filtering on search requests.
`valueforSearch` is now only called for stored fields, since values that are
extracted from the source should already be formatted as expected.
In both cases, what elasticsearch is really interested in is whether the field
is an analyzed string field. So it can just check `tokenized()` instead.
* upgrades numerics to new Point format
* updates geo api changes
* adds GeoPointDistanceRangeQuery as XGeoPointDistanceRangeQuery
* cuts over to ES GeoHashUtils
TL;DR This commit should not have any impact on terms aggs, it will just make
supporting ipv6 easier.
Currently only the numeric terms aggs propagate the DocValueFormat instance since
we use numerics to represent also dates or ip addresses. Since string terms aggs
are only used for text/keyword/string fields, they do not use the format and just
call toUt8String(). However when we support ipv6, ip addresses as well will be
encoded in sorted doc values (just like strings) so we will need to use the
DocValueFormat to format the keys.
CBOR is natively supported in Elasticsearch and allows for byte arrays.
This means, that by using CBOR the user can prevent base64 conversions
for the data being sent back and forth.
This PR adds support to extract data from a byte array in addition to
a string. This also required to add a ByteArrayValueSource class.
In #17133 we introduce request size limit handling and need a custom
channel implementation. In order to ensure we delegate all methods
it is better to have this channel implement an interface instead of
an abstract base class (so changes on the interface turn into
compile errors).
Relates #17133
Sometimes we get a test failure caused by search contexts left open.
The tests include a stack trace of the call that opened the context
but nothing else about the context. This adds more information about
the context that has been left open like what query it was running,
what shard it targeted, and whether or not it was a scroll.
Relates to #17582
When you implement an ingest factory which implements `Closeable`:
```java
public static final class Factory extends AbstractProcessorFactory<MyProcessor> implements Closeable {
@Override
public void close() throws IOException {
logger.debug("closing my processor factory");
}
}
```
The `close()` method is never called which could lead to some leak threads when we close a node.
The `ProcessorsRegistry#close()` method exists though and seems to do the right job:
```java
@Override
public void close() throws IOException {
List<Closeable> closeables = new ArrayList<>();
for (Processor.Factory factory : processorFactories.values()) {
if (factory instanceof Closeable) {
closeables.add((Closeable) factory);
}
}
IOUtils.close(closeables);
}
```
But apparently this method is never called in `Node#stop()`.
Closes#17625.
We have both `Settings.settingsBuilder` and `Settings.builder` that do exactly
the same thing, so we should keep only one. I kept `Settings.builder` since it
has my preference but also it is the one that we use in examples of the Java API.
* Create one AllField field per field eligible for _all.
* Add a positionIncrementGap (with a size of 100, not configurable) between
each entry in order to distinguish fields when doing phrase query on _all.
This removes the inconsistent output of IP addresses. The format was parsing-unfriendly and it makes it hard
to reason about API responses, such as to _nodes.
With this change in place, it will never print the hostname as part of the default format, which has the
added benefit that it can be used consistently for URIs, which was not the case when the hostname might
appear at the front with "hostname/ip:port".
CORS headers and methods config parameters must be read as arrays. This
commit fixes the issue. It affects http.cors.allow-methods and
http.cors.allow-headers.
Fixes#17483
Aggregations need to perform instanceof calls on MappedFieldType instances in
order to know how they should be parsed or formatted. Instead, we should let
the field types provide a formatter/parser that can can be used.
This change makes the root (/) rest api delegate to a transport action to get the
data for the response. This aligns this rest api with all of the other apis, which
delegate to one or more actions.
In doing this, unit tests were added to provide coverage of the RestMainAction
and the associated classes.
Because sigma is also used at reduce time, it should be passed to empty aggs.
Otherwise it causes bugs when an empty aggregation is used to perform reduction
is it would assume a sigma of zero.
Closes#17362
Today we fail if you try to add a field and an object from another type already
has the same name. However, we do NOT fail if you insert the field first and the
object afterwards. This leads to bad bugs since mappings are not necessarily
parsed in the same order at recovery time, so a mapping update could succeed and
then you would fail to reopen the index.
This commit removes a superfluous check when validing incoming cluster
states. The check in question prevents out-of-order cluster states from
the same master from entering the queue. However, such out-of-order
cluster states will be cleaned from the queue when a commit message for
that cluster state arrives or a commit message for any higher-versioned
cluster state arrives.
This removes PROTOTYPEs from ScoreFunctionsBuilders. To do so we rework
registration so it doesn't need PROTOTYPEs and lines up with the recent
changes to query registration.
This PR fixes a bug where a NPE was thrown when parsing a moving average pipeline aggregation request which did not specify a window size.
Closes#17516
This adds shuffling of xContent similar to #17521 to the aggregation and pipeline aggregation base test.
The additional shuffling uncovered that some aggregation builders internally store some properties in a
way that made the equals() testing fail when the xContent is shuffled.
For TopHitsAggregatorBuilder, the internal scriptFields parameter was changed to a set because the order
they appear in the xContent should not matter. For FiltersAggregatorBuilder, the internal list of KeyedFilters
is sorted by key now. As a side effect, the keys in the aggregation response are now not always in the same
order as the filters in the query, but sorted by key as well (unless they are anonymous).
DiscoveryNode is immutable yet we rebuild DiscoveryNode#toString on
every invocation. Most importantly, this just leads to unnecessary
allocations. This is most germane to ZenDiscovery and the processing of
cluster states where DiscoveryNode#toString is invoked when submitting
update tasks and processing cluster state updates.
Closes#17543