Similar to the Filters aggregation but only supports "keyed" filter buckets and automatically "ANDs" pairs of filters to produce a form of adjacency matrix.
The intersection of buckets "A" and "B" is named "A&B" (the choice of separator is configurable). Empty intersection buckets are removed from the final results.
Closes#22169
This PR removes all leniency in the conversion of Strings to booleans: "true"
is converted to the boolean value `true`, "false" is converted to the boolean
value `false`. Everything else raises an error.
Relates to #22024
On top of documentation, the PR adds deprecation loggers and deals with the resulting warning headers.
The yaml test is set exclude versions up to 6.0. This is need to make sure bwc tests pass until this is backported to 5.2.0 . Once that's done, I will change the yaml test version limits
This change makes it possible for custom routing values to go to a subset of shards rather than
just a single shard. This enables the ability to utilize the spatial locality that custom routing can
provide while mitigating the likelihood of ending up with an imbalanced cluster or suffering
from a hot shard.
This is ideal for large multi-tenant indices with custom routing that suffer from one or both of
the following:
- The big tenants cannot fit into a single shard or there is so many of them that they will likely
end up on the same shard
- Tenants often have a surge in write traffic and a single shard cannot process it fast enough
Beyond that, this should also be useful for use cases where most queries are done under the context
of a specific field (e.g. a category) since it gives a hint at how the data can be stored to minimize
the number of shards to check per query. While a similar solution can be achieved with multiple
concrete indices or aliases per value today, those approaches breakdown for high cardinality fields.
A partitioned index enforces that mappings have routing required, that the partition size does not
change when shrinking an index (the partitions will shrink proportionally), and rejects mappings
that have parent/child relationships.
Closes#21585
This commit corrects the grammar in a list in the how-to docs; namely,
the word "and" was missing preceding the final element in a list.
Relates #22663
* Grammatical correction
* Add note for legacy string mapping type
* Update truncate token filter to not mention the keyword tokenizer
The advice predates the existence of the keyword field
Closes#22650
Currently both ProfileResult and CollectorResult print the time field in a human readable string format
(e.g. "time": "55.20315000ms"). When trying to parse this back to a long value, for example to use in
the planned high level java rest client, we can lose precision because of conversion and rounding issues.
This change adds a new additional field (`time_in_nanos`) to the profile response to be able to get the
original time value in nanoseconds back.
The old `time` field is only printed when the `?`human=true` flag in the url is set. This follow the behaviour for
all other stats-related apis. Also the format of the `time` field is slightly changed. Instead of always formatting
the output as a 10-digit ms value, by using the `XContentBuilder#timeValueField()` method we now print
the largest time unit present is used (e.g. "s", "ms", "micros").
For certain situations, end-users need the base path for Elasticsearch
logs. Exposing this as a property is better than hard-coding the path
into the logging configuration file as otherwise the logging
configuration file could easily diverge from the Elasticsearch
configuration file. Additionally, Elasticsearch will only have
permissions to write to the log directory configured in the
Elasticsearch configuration file. This commit adds a property that
exposes this base path.
One use-case for this is configuring a rollover strategy to retain logs
for a certain period of time. As such, we add an example of this to the
documentation.
Additionally, we expose the property es.logs.cluster_name as this is
used as the name of the log files in the default configuration.
Finally, we expose es.logs.node_name in cases where node.name is
explicitly set in case users want to include the node name as part of
the name of the log files.
Relates #22625
This change disables the _all meta field by default.
Now that we have the "all-fields" method of query execution, we can save both
indexing time and disk space by disabling it.
_all can no longer be configured for indices created after 6.0.
Relates to #20925 and #21341Resolves#19784
Today when you do not specify a port for an entry in
discovery.zen.ping.unicast.hosts, the default port is the value of the
setting transport.profiles.default.port and falls back to the value of
transport.tcp.port if this is not set. For a node that is explicitly
bound to a different port than the default port, this means that the
default port will be equal to this explicitly bound port. Yet, the docs
say that we fall back to 9300 here. This commit corrects the docs.
Relates #22568
Instead of `search.remote.seeds.${clustername}` we now specify the seeds as:
`search.remote.${clustername}.seeds` which is a real list setting compared to an unvalidated
group setting before.
Fix the `Dockerfile` example in the `Customizing image` third configuration
method by adding the missing RUN instruction.
Originally reported by Shankar Vasudevan (@vshank77).
Relates #21973
#9261 added a warning about the use of `add-apt-repository` which is becoming obsolete over time as new distribution releases include later versions of `add-apt-repository` which don't automatically add the `deb-src` line. This change updates the documentation to make the block a note rather than a warning and adds two other reasons for avoiding `add-apt-repository` which are still relevant: avoiding edits to a system shared file and not requiring a large number of non-default packages to add one line of text to a file.
The example
```
/<logstash-{now/d-2d}>,<logstash-{now/d-1d}>,<logstash-{now/d}>/_search
```
shows escaped URL where `, => %2C`, so I assume it should be escaped and be present in the table
This commit updates the cluster allocation explain API documentation to
explain the new request parameters and response formats, and gives
examples of the explain API responses under various scenarios.
This commit adds a document describing our data replication model in high level terms. The goal is give people basic insight into how things work in order to better understand how read and writes interact, both during normal operations and under failures.
* Promote longs to doubles when a terms agg mixes decimal and non-decimal number
This change makes the terms aggregation work when the buckets coming from different indices are a mix of decimal numbers and non-decimal numbers. In this case non-decimal number (longs) are promoted to decimal (double) which can result in a loss of precision for big numbers.
Fixes#22232
Provides an example of using is and an example return description
and explains that we've added descriptions for some tasks but not
even close to all of them. And that we expect to change the
descriptions as we learn more.
Closes#22407
* Fix example
Getting a single task is always detailed, no need to specify.
* Rewrite like imotov wants it
After deprecating getters and setters and the query DSL parameter in 5.x,
support for `minimum_number_should_match` can be removed entirely. Also
consolidated comments with the ones on 5.x branch and added an entry to the
migration docs.
The `Script source settings` section currently states that `false` means scripting is ENABLED.
The other sections seem to indicate that `false` means scripting is DISABLED.
If the current documentation is correct, that would imply that `inline` and `stored` scripting are ENABLED by default, which seems to conflict with all the other sections in the document.
This adds a new `normalizer` property to `keyword` fields that pre-processes the
field value prior to indexing, but without altering the `_source`. Note that
only the normalization components that work on a per-character basis are
applied, so for instance stemming filters will be ignored while lowercasing or
ascii folding will be applied.
Closes#18064
Today we try to pull stats from index writer but we do not get a
consistent view of stats. Under heavy indexing, this inconsistency can
be very skewed indeed. In particular, it can lead to the number of
deleted docs being reported as negative and this leads to serialization
issues. Instead, we should provide a consistent view of the stats by
using an index reader.
Relates #22317
Today we only expose `value_type` in scriptable aggregations, however it is
also useful with unmapped fields. I suspect we never noticed because
`value_type` was not documented (fixed) and most aggregations are scriptable.
Closes#20163
* Repeated language analyzers
The `catalan` analyzer was repeated on the supported list :)
* Reordered the languages to have alphabetic order
* Added space for format
* Reordered the languages and removed repeated
Added a new section detailing how to use the attachment processor
within an array.
This reverts commit #22296 and instead links to the foreach processor.
Our `float`/`double` fields generally assume that `-0` compares less than `+0`,
except when bounds are exclusive: an exclusive lower bound on `-0` excludes
`+0` and an exclusive upper bound on `+0` excludes `-0`.
Closes#22167
Problem: So far all rank eval requests are being executed in parallel. If there
are more than the search thread pool can handle, or if there are other search
requests executed in parallel rank eval can fail.
Solution: Make number of max_concurrent_searches configurable.
Name of configuration parameter is analogous to msearch. Default
max_concurrent_searches set to 10: Rank_eval isn't particularly time critical so
trying to avoid being more clever than probably needed here. Can set this value
through the API to a higher value anytime.
Fixes#21403
Problem: We introduced the ability to shorten the rank eval request by using a
template in #20231. When playing with the API it turned out that there might be
use cases where - e.g. due to various heuristics - folks might want to translate
the original user query into more than just one type of Elasticsearch query.
Solution: Give each template an id that can later be referenced in the
actual requests.
Closes#21257
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default for all XContent types (not only JSON).
We have also changed the name of the system property to disable this feature
from `es.json.strict_duplicate_detection` to the now more appropriate name
`es.xcontent.strict_duplicate_detection`.
Relates elastic/elasticsearch#19614
Relates elastic/elasticsearch#22073
With this commit we change the data type of the 'TIMESTAMP'
meta-data field from a formatted date string to a plain
`java.util.Date` instance. The main reason for this change is
that our benchmarks have indicated that this contributes
significantly to the time spent in the ingest pipeline.
The overhead in terms of indexing throughput of the ingest
pipeline is about 15% and breaks down roughly as follows:
* 5% overhead caused by the conversion from `XContent` -> `Map`
* 5% overhead caused by the timestamp formatting
* 5% overhead caused by the conversion `Map` -> `XContent`
Relates #22074
We try to install a system call filter on various operating systems
(Linux, macOS, BSD, Solaris, and Windows) but the setting
(bootstrap.seccomp) to control this is named after the Linux
implementation (seccomp). This commit replaces this setting with
bootstrap.system_call_filter. For backwards compatibility reasons, we
fallback to bootstrap.seccomp and log a deprecation message if
bootstrap.seccomp is set. We intend to remove this fallback in
6.0.0. Note that now is the time to make this change it's likely that
most users are not making this setting anyway as prior to version 5.2.0
(currently unreleased) it was not necessary to configure anything to
enable a node to start up if the system call filter failed to install
(we marched on anyway) but starting in 5.2.0 it will be necessary in
this case.
Relates #22226
The JSON processor has an optional field called "target_field".
If you don't specify target_field then target_field becomes what you specified as "field".
There isn't anyway to add the fields to the root of a document. By
setting `add_to_root`, now serialized fields will be inserted into the
top-level fields of the ingest document.
Closes#21898.
This commit fixes a silly doc bug where the field that represents the
total CPU time consumed by all tasks in the same cgroup was mistakenly
reported as "usage" instead of "usage_nanos".
Relates #21029
Sends the `error_trace` parameter with all requests sent by the
yaml test framework, including the doc snippet tests. This can be
overridden by settings `error_trace: false`. While this drift's
core's handling of the yaml tests from the client's slightly this
should only be a problem for tests that rely on the default value,
both of which I've fixed by setting the value explicitly.
This also escapes `\n` and `\t` in the `Stash dump on failure` so
the `stack_trace` is more readable.
Also fixes `RestUpdateSettingsAction` to not think of the `error_trace`
parameter as a setting.
* Replace _suggest endpoint to _search in docs
In 5.0, the _suggest endpoint is just sugar for _search
with suggestions specified. Users should move away from
using the _suggest endpoint, as it is marked as deprecated in 5.x and
will be removed in 6.0
* update docs to use _search endpoint instead of _suggest
* Add deprecation logging to RestSuggestAction
* Use search endpoint instead of suggest endpoint in rest tests
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.
As a last resort, we allow users to still disable strict duplicate checks
by setting `-Des.json.strict_duplicate_detection=false` which is
intentionally undocumented.
Closes#19614
When using dynamic templates, ES will now throw an exception if a
`match_mapping_type` is used that doesn't correspond to an actual type.
Relates to #17285
Our query DSL supports empty queries (`{}`), which have a different meaning depending on the query that holds it, either ignored, match_all or match_none. We deprecated the support for empty queries in 5.0, where we log a deprecation warning wherever they are used.
The way we supported it once we moved query parsing to the coordinating node was having an Optional<QueryBuilder> return type in all of our parse methods (called fromXContent). See #17624. The central place for this was QueryParseContext#parseInnerQueryBuilder. We can now remove all the optional return types and simply throw an exception whenever an empty query is found.
When we decided to deprecate and remove fuzzy query in #15760, we didn't realize we would take away the possibililty for uses to use a fuzzy query as part of a span query, which is not possible using match query. This means we have to go back and un-deprecate fuzzy query, which will not be removed.
Closes#15760
This change allows specifying alias/wildcard expression in indices_boost.
And added another format for specifying indices_boost. It accepts array of index name and boost pair.
If an index is included in multiple aliases/wildcard expressions, the first match will be used.
With new format, old format is marked as deprecated.
Closes#4756
The documentation reads:
> You can disable this behavior by setting "detect_noop": false like this:
Followed by a code example, that originally set `"detect_noop": true`.
Please correct me if I got the change backwards (i.e. the paragraph should be changed to `true`), but this seems like it makes the most sense.
In 5.0, the search slow log switched to the multi-line format with no option to get back to the origin single-line format that was used prior to 5.0 by default. This commit removes the reformat option from the search slow log and returns the search slow log back to the single-line format.
Closes#21711
When overriding a systemd configuration via a drop-in file, the
[Service] header is required. This commit adds this to an example
drop-in override in the configuring docs.
Relates #22038
The use of the avg aggregation for sorting the terms aggregation is not encouraged since it has unbounded error. This changes the examples to use the max aggregation which does not suffer the same issues
* Include unindexed field in FieldStats response
This change adds non-searchable fields to the FieldStats response. These fields do not have min/max informations but they can be aggregatable. Fields that are only stored in _source (store:no, index:no, doc_values:no) will still be missing since they do not have any useful information to show. Indices and clients must be at least on V_5_2_0 to see this change.
Changes the default socket and connection timeouts for the rest
client from 10 seconds to the more generous 30 seconds.
Defaults reindex-from-remote to those timeouts and make the
timeouts configurable like so:
```
POST _reindex
{
"source": {
"remote": {
"host": "http://otherhost:9200",
"socket_timeout": "1m",
"connect_timeout": "10s"
},
"index": "source",
"query": {
"match": {
"test": "data"
}
}
},
"dest": {
"index": "dest"
}
}
```
Closes#21707
Today if system call filters fail to install on startup, we log a
message but otherwise march on. This might leave users without system
call filters installed not knowing that they have implicitly accepted
the additional risk. We should not be lenient like this, instead clearly
informing the user that they have to either fix their configuration or
accept the risk of not having system call filters installed. This commit
adds a bootstrap check that if system call filters are enabled, they
must successfully install.
Relates #21940
The reference for the jvm.options docs recently changed from
es-java-opts to jvm-options. This commit fixes a broken reference that
arose as a result of this change.
Elasticsearch can be run in a few different ways:
- from the command line on Linux and Windows
- as a service on Linux and Windows
on both 32-bit client and 64-bit server VMs. We strive for a great
out-of-the-box experience any of these combinations but today it is
lacking on 32-bit client JVMs and on the Windows service. There are two
deficiencies that arise:
- on any 32-bit client JVM we fail to start out of the box because we
force the server JVM in jvm.options
- when installing the Windows service, the thread stack size must be
specified in jvm.options
This commit attempts to address these deficiencies.
We should continue to force the server JVM because there are systems
where the server JVM is not active by default (e.g., the 32-bit JDK on
Windows). This does mean that if a user tries to run with a client JVM
they will see a failure message at startup but this is the best that we
can do if we want to continue to force the server JVM. Thus, this commit
at least documents this situation.
To improve the situation with installing the Windows service, this
commit adds a default setting for the thread stack size. This default is
chosen based on the default thread stack size across all 64-bit server
JVMs. This means that if a user tries to run with a 32-bit JVM they
could otherwise see significantly higher memory usage (this situation is
complicated, it's really only on Windows where the extra memory usage is
egregious, but cutting into the 32-bit address space on any system is
bad). So this commit makes it so that the out-of-the-box experience is
improved for the Windows service on 64-bit server JVMs and we document
the need to adjust this setting on 32-bit JVMs.
Again, we are focusing on the out-of-the-box experience here and this
means optimizing for the best experience on any 64-bit server JVM as
this covers the vast majority of the user base. The users that are on
32-bit JVMs will suffer a little bit but at least now any user on any
64-bit server JVM can start Elasticsearch out of the box.
Finally, we fix some references to the jvm.options documentation.
Relates #21920
During package install on systemd-based systems, we try to set
vm.max_map_count. On some systems (e.g., containers), users do not have
the ability to tune these parameters from within the container. This
commit provides an option for these users to skip setting such kernel
parameters.
Relates #21899
It used to be a hybrid store between `niofs` and `mmapfs`, which we removed when
we switched to `fs` by default (which is `mmapfs` on 64-bits systems).
Currently, the `terms` query is just syctactic sugar for a `bool` query when
used in a query context. This change proposes to always generate the same query
in query and filter contexts, which is less confusing.
For the record, I also had to remove the geo-hash cell and geo-distance range
queries to make the code compile. These queries already throw an exception in
all cases with 5.x indices, so that does not hurt any more.
I also had to rename all 2.x bwc indices from `index-${version}` to
`unsupported-${version}` to make `OldIndexBackwardCompatibilityIT`
happy.
These query names were all deprecated in 5.0.0:
- in is removed in favour of terms
- geo_bbox is removed in favour of geo_bounding_box
- mlt is removed in favour of more_like_this
- fuzzy_match and match_fuzzy are removed in favour of match
Set lucene version to 6.4.0-snapshot-ec38570 and update all the sha1s/license
Fix invalid combo after upgrade in query_string query. split_on_whitespace=false is disallowed if auto_generate_phrase_queries=true
Adapt the expectations of some tests to the new format of the Lucene explain output
Lucene 6.2 added index and query support for numeric ranges. This commit adds a new RangeFieldMapper for indexing numeric (int, long, float, double) and date ranges and creating appropriate range and term queries. The design is similar to NumericFieldMapper in that it uses a RangeType enumerator for implementing the logic specific to each type. The following range types are supported by this field mapper: int_range, float_range, long_range, double_range, date_range.
Lucene does not provide a DocValue field specific to RangeField types so the RangeFieldMapper implements a CustomRangeDocValuesField for handling doc value support.
When executing a Range query over a Range field, the RangeQueryBuilder has been enhanced to accept a new relation parameter for defining the type of query as one of: WITHIN, CONTAINS, INTERSECTS. This provides support for finding all ranges that are related to a specific range in a desired way. As with other spatial queries, DISJOINT can be achieved as a MUST_NOT of an INTERSECTS query.
Integrate the patch from LUCENE-6664 into elasticsearch and
add support for handling a graph token stream in match/multi-match
queries.
This fixes longstanding bugs with multi-token synonyms returning
incorrect results with proximity queries.
NOTE: The result of `?.` and `?:` can't be assigned to primitives. So
`int[] someArray = null; int l = someArray?.length` and
`int s = params.size ?: 100` don't work. Do
`def someArray = null; def l = someArray?.length` and
`def s = params.size ?: 100` instead.
Relates to #21748
* Scripting: Remove groovy scripting language
Groovy was deprecated in 5.0. This change removes it, along with the
legacy default language infrastructure in scripting.
The `error_trace` parameter turns on the `stack_trace` field
in errors which returns stack traces.
Removes documentation for `camelCase` because it hasn't worked
in a while....
Documents the internal parameters used to render stack traces as
internal only.
Closes#21708
Add indices and filter information to search shards api output
The search shards api returns info about which shards are going to be hit by executing a search with provided parameters: indices, routing, preference. Indices can also be aliases, which can also hold filters. The output includes an array of shards and a summary of all the nodes the shards are allocated on. This commit adds a new indices section to the search shards output that includes one entry per index, where each index can be associated with an optional filter in case the index was hit through a filtered alias.
This is relevant since we have moved parsing of alias filters to the coordinating node.
Relates to #20916
Today we eagerly resolve unicast hosts. This means that if DNS changes,
we will never find the host at the new address. Moreover, a single host
failng to resolve causes startup to abort. This commit introduces lazy
resolution of unicast hosts. If a DNS entry changes, there is an
opportunity for the host to be discovered. Note that under the Java
security manager, there is a default positive cache of infinity for
resolved hosts; this means that if a user does want to operate in an
environment where DNS can change, they must adjust
networkaddress.cache.ttl in their security policy. And if a host fails
to resolve, we warn log the hostname but continue pinging other
configured hosts.
When doing DNS resolutions for unicast hostnames, we wait until the DNS
lookups timeout. This appears to be forty-five seconds on modern JVMs,
and it is not configurable. If we do these serially, the cluster can be
blocked during ping for a lengthy period of time. This commit introduces
doing the DNS lookups in parallel, and adds a user-configurable timeout
for these lookups.
Relates #21630
You can use `Debug.explain(someObject)` in painless to throw an
`Error` that can't be caught by painless code and contains an
object's class. This is useful because painless's sandbox doesn't
allow you to call `someObject.getClass()`.
Closes#20263
The `type` parameter has always been accepted by the search_shards api, probably to make the api and its urls the same as search. Truth is that the type never had any effect, it's been ignored from day one while accepting it may make users think that we actually do something with it.
This commit removes support for the type parameter from the REST layer and the Java API. Backwards compatibility is maintained on the transport layer though.
The new added serialization test also uncovered a bug in the java API where the `ClusterSearchShardsRequest` could be created with no arguments, but the indices were required to be not null otherwise the request couldn't be serialized as `writeTo` would throw NPE. Fixed by setting a default value (empty array) for indices.
As part of #20925 and #21341 we added an "all-fields" mode to the
`query_string` and `simple_query_string`. This would expand the query to
all fields and automatically set `lenient` to true.
However, we should still allow a user to override the `lenient` flag to
whichever value they desire, should they add it in the request. This
commit does that.
Implements a null coalescing operator in painless that looks like `?:`. This form was chosen to emulate Groovy's `?:` operator. It is different in that it only coalesces null values, instead of Groovy's `?:` operator which coalesces all falsy values. I believe that makes it the same as Kotlin's `?:` operator. In other languages this operator looks like `??` (C#) and `COALESCE` (SQL) and `:-` (bash).
This operator is lazy, meaning the right hand side is only evaluated at all if the left hand side is null.
The first changed referred to an example of the 2.4 documentation. I removed the no longer relevant parts. We should consider adding a little more here.
The second change was just then->than in the suggest_mode popular section
* Reference documentation for rank evaluation API
This adds a first page of reference documentation to the current state of the
rank evaluation API.
Closes to #21402
* Add default values for precision metric.
Add information on default relevant_rating_threshold and ignore_unlabeled
settings.
Relates to #21304
* Move under search request docs, fix formatting
Also removes some detail where it seemed unneeded for reference docs
* master: (22 commits)
Add proper toString() method to UpdateTask (#21582)
Fix `InternalEngine#isThrottled` to not always return `false`. (#21592)
add `ignore_missing` option to SplitProcessor (#20982)
fix trace_match behavior for when there is only one grok pattern (#21413)
Remove dead code from GetResponse.java
Fixes date range query using epoch with timezone (#21542)
Do not cache term queries. (#21566)
Updated dynamic mapper section
Docs: Clarify date_histogram bucket sizes for DST time zones
Handle release of 5.0.1
Fix skip reason for stats API parameters test
Reduce skip version for stats API parameter tests
Strict level parsing for indices stats
Remove cluster update task when task times out (#21578)
[DOCS] Mention "all-fields" mode doesn't search across nested documents
InternalTestCluster: when restarting a node we should validate the cluster is formed via the node we just restarted
Fixed bad asciidoc in boolean mapping docs
Fixed bad asciidoc ID in node stats
Be strict when parsing values searching for booleans (#21555)
Fix time zone rounding edge case for DST overlaps
...
There is an issue in the Grok Processor, where trace_match: true does not inject the _ingest._grok_match_index into the ingest-document when there is just one pattern provided. This is due to an optimization in the regex construction. This commit adds a check for when this is the case, and injects a static index value of "0", since there is only one pattern matched (at the first index into the patterns).
To make this clearer, more documentation was added to the grok-processor docs.
Fixes#21371.
This changes only the query parsing behavior to be strict when searching on
boolean values. We continue to accept the variety of values during index time,
but searches will only be parsed using `"true"` or `"false"`.
Resolves#21545
Today when parsing a stats request, Elasticsearch silently ignores
incorrect metrics. This commit removes lenient parsing of stats requests
for the nodes stats and indices stats APIs.
Relates #21417
We log deprecation events at "WARN", so setting it to `info` means the events
are still logged. It must be set to `error` in order to disable the logging.
This failure is due to the fact that we sort on store size, which is cached. So
it might happen that the store size that is taken into account is not the right
one, which makes the indices sorted in the wrong order. This changes the doc
example to sort on the number of docs instead.
Closes#21062
* master:
Set vm.max_map_count on systemd package install
[TEST] reduce the number of snapshotted shards to 1 in testSnapshotSucceedsAfterSnapshotFailure() so that we are more likely to trigger I/O exceptions on writing the control files during the finalize phase of snapshotting (with the aim of triggering an I/O failure when writing pending-index-*).
Add documentation for Logger with Transport Client
Enable appender exceptions in UpdateSettingsIT
[TEST] remove AwaitsFix from testSnapshotSucceedsAfterSnapshotFailure, turns out the issue is specific to Java 9 v143
Cleanup formatting in UpdateSettingsIT.java
[TEST] mute the testSnapshotSucceedsAfterSnapshotFailure() test until its clear what is going wrong.
Mark SearchQueryIT test as awaits fix
Makes snapshot throttling test go much faster (#21485)
Breaking changes docs for template index_patterns
[TEST] adds randomness between atomic and non-atomic move operations in MockRepository
Cache successful shard deletion checks (#21438)
Task cancellation command should wait for all child nodes to receive cancellation request before returning
* master:
ShardActiveResponseHandler shouldn't hold to an entire cluster state
Ensures cleanup of temporary index-* generational blobs during snapshotting (#21469)
Remove (again) test uses of onModule (#21414)
[TEST] Add assertBusy when checking for pending operation counter after tests
Revert "Add trace logging when aquiring and releasing operation locks for replication requests"
Allows multiple patterns to be specified for index templates (#21009)
[TEST] fixes rebalance single shard check as it isn't guaranteed that a rebalance makes sense and the method only tests if rebalance is allowed
Document _reindex with random_score
0219a211d3 added support for templates
to have multiple patterns and renamed `template` to `index_patterns`.
This adds the breaking changes docs for that.
* master: (516 commits)
Avoid angering Log4j in TransportNodesActionTests
Add trace logging when aquiring and releasing operation locks for replication requests
Fix handler name on message not fully read
Remove accidental import.
Improve log message in TransportNodesAction
Clean up of Script.
Update Joda Time to version 2.9.5 (#21468)
Remove unused ClusterService dependency from SearchPhaseController (#21421)
Remove max_local_storage_nodes from elasticsearch.yml (#21467)
Wait for all reindex subtasks before rethrottling
Correcting a typo-Maan to Man-in README.textile (#21466)
Fix InternalSearchHit#hasSource to return the proper boolean value (#21441)
Replace all index date-math examples with the URI encoded form
Fix typos (#21456)
Adapt ES_JVM_OPTIONS packaging test to ubuntu-1204
Add null check in InternalSearchHit#sourceRef to prevent NPE (#21431)
Add VirtualBox version check (#21370)
Export ES_JVM_OPTIONS for SysV init
Skip reindex rethrottle tests with workers
Make forbidden APIs be quieter about classpath warnings (#21443)
...
This change was reverted after it caused random test failures. This was
due to a copy/paste error in the original PR which caused the mock
version of ClusterInfoService to be used whenever the mock *ZenPing* was
used, and the real ClusterInfoService to be used when MockZenPing was
not used.
* Allows for an array of index template patterns to be provided to an
index template, and rename the field from 'template' to 'index_pattern'.
Closes#20690
You can use `_reindex` and `random_score` to extract a random
subset of an index but you have to be careful to sort by `_score`
or it won't work.
Closes#21432
This commit introduces a new execution mode for the
`simple_query_string` query, which is intended down the road to be a
replacement for the current _all field.
It now does auto-field-expansion and auto-leniency when the following criteria
are ALL met:
The _all field is disabled
No default_field has been set in the index settings
No fields are specified in the request
Additionally, a user can force the "all-like" execution by setting the
all_fields parameter to true.
When executing in all field mode, the `simple_query_string` query will
look at all the fields in the mapping that are not metafields and can be
searched, and automatically expand the list of fields that are going to
be queried.
Relates to #20925, which is the `query_string` version of this work.
This is basically the same behavior, but for the `simple_query_string`
query.
Relates to #19784
Null safe dereferences make handling null or missing values shorter.
Compare without:
```
if (ctx._source.missing != null && ctx._source.missing.foo != null) {
ctx._source.foo_length = ctx.source.missing.foo.length()
}
```
To with:
```
Integer length = ctx._source.missing?.foo?.length();
if (length != null) {
ctx._source.foo_length = length
}
```
Combining this with the as of yet unimplemented elvis operator allows
for very concise defaults for nulls:
```
ctx._source.foo_length = ctx._source.missing?.foo?.length() ?: 0;
```
Since you have to start somewhere, we started with null safe dereferenes.
Anyway, this is a feature borrowed from groovy. Groovy allows writing to
null values like:
```
def v = null
v?.field = 'cat'
```
And the writes are simply ignored. Painless doesn't support this at this
point because it'd be complex to implement and maybe not all that useful.
There is no runtime cost for this feature if it is not used. When it is
used we implement it fairly efficiently, adding a jump rather than a
temporary variable.
This should also work fairly well with doc values.
This commit removes some references to 5.x that were picked up when the
migration docs for the cat API were migrated from 5.x to master.
Relates #21342
This commit adds migration docs for the cat API, including a note
regarding the change in response in the cat thread pool API for
unbounded queue sizes.
Relates #21342
We plan to deprecate `_suggest` during 5.0 so it isn't worth fixing
it to support the `_source` parameter for `_source` filtering. But we
should fix the docs so they are accurate.
Since this removes the last non-`// CONSOLE` line in
`completion-suggest.asciidoc` this also removes it from the list of
files that have non-`// CONSOLE` docs.
Closes#20482
Adds support for `?slices=N` to reindex which automatically
parallelizes the process using parallel scrolls on `_uid`. Performance
testing sees a 3x performance improvement for simple docs
on decent hardware, maybe 30% performance improvement
for more complex docs. Still compelling, especially because
clusters should be able to get closer to the 3x than the 30%
number.
Closes#20624
Exist requests are supposed to never throw an exception, but rather return true or false depending on whether some resource exists or not. Indices exists does that for indices and accepts wildcard expressions too. The way the api works internally is by resolving indices and catching IndexNotFoundException: if an exception is thrown the index does not exist hence it returns false, otherwise it returns true. That works ok only if ignore_unavailable and allow_no_indices indices options are both set to false, meaning that they are strict and any missing index or wildcard expressions that resolves to no indices will lead to an exception that can be thrown and cause false to be returned.
Unfortunately the indices options have been configurable up until now for this request, meaning that one can set ignore_unavailable or allow_no_indices to true and have the indices exist request return true for indices that really don't exist, which makes very little sense in the context of this api.
This commit removes the indicesOptions setter from the IndicesExistsRequest and makes settable only expandWildcardsOpen and expandWildcardsClosed, hence a subset of the available indices options. This way we can guarantee more consistent behaviour of the indices exists api. We can then remove the ignore_unavailable and allow_no_indices option from indices exists api spec
Moves the `_flush` in the `_cat/indices` snippets testing framework
to the very first test. We need to flush super early because index
size is cached for a few seconds so we really need to read a
consistent size on the first read so we can sort by it properly.
Closes#21062
This commit introduces a new execution mode for the query_string query, which
is intended down the road to be a replacement for the current _all field.
It now does auto-field-expansion and auto-leniency when the following criteria
are ALL met:
The _all field is disabled
No default_field has been set in the index settings
No default_field has been set in the request
No fields are specified in the request
Additionally, a user can force the "all-like" execution by setting the
all_fields parameter to true.
When executing in all field mode, the query_string query will look at all the
fields in the mapping that are not metafields and can be searched, and
automatically expand the list of fields that are going to be queried.
Relates to #19784
The important settings docs previously referred to a section regarding
the node.max_local_storage_nodes setting. This section was removed, but
the link was not. This commit removes that link.
Previously node.max_local_storage_nodes defaulted to fifty, and this
permitted users to start multiple instances of Elasticsearch sharing the
same data folder. This can be dangerous, and usually it does not make
sense to run more than one instance of Elasticsearch on a single
server. Because of this, we had a note in the important settings docs
advising users to set this setting to one. However, we have since
changed the default value of this setting to one so this advise is no
longer needed.
Relates #21305
This query is deprecated from 5.0 on. Similar to IndicesQueryBuilder we should
log a deprecation warning whenever this query is used.
Relates to #15760
Lucene 6.2 introduces the new `Analyzer.normalize` API, which allows to apply
only character-level normalization such as lowercasing or accent folding, which
is exactly what is needed to process queries that operate on partial terms such
as `prefix`, `wildcard` or `fuzzy` queries. As a consequence, the
`lowercase_expanded_terms` option is not necessary anymore. Furthermore, the
`locale` option was only needed in order to know how to perform the lowercasing,
so this one can be removed as well.
Closes#9978
This change adds an option called `split_on_whitespace` which prevents the query parser to split free text part on whitespace prior to analysis. Instead the queryparser would parse around only real 'operators'. Default to true.
For instance the query `"foo bar"` would let the analyzer of the targeted field decide how the tokens should be splitted.
Some options are missing in this change but I'd like to add them in a follow up PR in order to be able to simplify the backport in 5.x. The missing options (changes) are:
* A `type` option which similarly to the `multi_match` query defines how the free text should be parsed when multi fields are defined.
* Simple range query with additional tokens like ">100 50" are broken when `split_on_whitespace` is set to false. It should be possible to preserve this syntax and make the parser aware of this special syntax even when `split_on_whitespace` is set to false.
* Since all this options would make the `query_string_query` very similar to a match (multi_match) query we should be able to share the code that produce the final Lucene query.
* Add docs with up to date instructions on updating default similarity
The default similarity can no longer be set in the configuration file
(you will get an error on startup). Update the docs with the method
that works.
* Add instructions for changing similarity on index creation
It was 10mb and that was causing trouble when folks reindex-from-remoted
with large documents.
We also improve the error reporting so it tells folks to use a smaller
batch size if they hit a buffer size exception. Finally, adds some docs
to reindex-from-remote mentioning the buffer and giving an example of
lowering the size.
Closes#21185
Refactored ScriptType to clean up some of the variable and method names. Added more documentation. Deprecated the 'in' ParseField in favor of 'stored' to match the indexed scripts being replaced by stored scripts.
Adds support for indexing into lists and arrays with negative
indexes meaning "counting from the back". So for if
`x = ["cat", "dog", "chicken"]` then `x[-1] == "chicken"`.
This adds an extra branch to every array and list access but
some performance testing makes it look like the branch predictor
successfully predicts the branch every time so there isn't a
in execution time for this feature when the index is positive.
When the index is negative performance testing showed the runtime
is the same as writing `x[x.length - 1]`, again, presumably thanks
to the branch predictor.
Those performance metrics were calculated for lists and arrays but
`def`s get roughly the same treatment though instead of inlining
the test they need to make a invoke dynamic so we don't screw up
maps.
Closes#20870
Lucene 6.3 is expected to be released in the next weeks so it'd be good to give
it some integration testing. I had to upgrade randomized-testing too so that
both Lucene and Elasticsearch are on the same version.
Converts docs for `_cat/segments`, `_cat/plugins` and `_cat/repositories`
from `curl` to `// CONSOLE` so they are tested as part of the build and
are cleaner to use in Console. They should work fine with `curl` with
the `COPY AS CURL` link.
Also swaps the `source` type of the response from `js` to `txt` because
that is more correct. The syntax highlighter doesn't care. It looks at
the text to figure out the language. So it looks a little funny for `_cat`
responses regardless.
Relates to #18160
On some systems, cgroups will be available but not configured. And in
some cases, cgroups will be configured, but not for the subsystems that
we are expecting (e.g., cpu and cpuacct). This commit strengthens the
handling of cgroup stats on such systems.
Relates #21094
This change adds a TypesQuery that checks if the disjunction of types should be rewritten to a MatchAllDocs query. The check is done only if the number of terms is below a threshold (16 by default and configurable via max_boolean_clause).
This allows you to whitelist `localhost:*` or `127.0.10.*:9200`.
It explicitly checks for patterns like `*` in the whitelist and
refuses to start if the whitelist would match everything. Beyond
that the user is on their own designing a secure whitelist.
This commit fixes two issues with the slow log docs:
- clarifies that these settings are per index
- updates index slow log configuration for Log4j 2
Relates #20976
It is important that folks understand that snapshot/restore isn't
for archiving. It is appropriate for backup and disaster recovery
but not for archival over long periods of time because of version
incompatibility.
Closes#20866
Relates to #18160
Uses the new sorting (#20658) in the `_cat` API to support all use
cases natively. We can still resort to piping things through `sort`
if we need to, but we don't have to for basic stuff like sorting!
* Adding built-in sorting capability to _cat apis.
Closes#16975
* addressing pr comments
* changing value types back to original implementation and fixing cosmetic issues
* Changing compareTo, hashCode of value types to a better implementation
* Changed value compareTos to use Double.compare instead of if statements + fixed some failed unit tests
The shards preference on a search request enables specifying a list of
shards to hit, and then a secondary preference (e.g., "_primary") can be
added. Today, the separator between the shards list and the secondary
preference is ';'. Unfortunately, this is also a valid separtor for URL
query parameters. This means that a preference like "_shards:0;_primary"
will be parsed into two URL parameters: "_shards:0" and "_primary". With
the recent change to strict URL parsing, the second parameter will be
rejected, "_primary" is not a valid URL parameter on a search
request. This means that this feature has never worked (unless the ';'
is escaped, but no one does that because our docs do not that, and there
was no indication from Elasticsearch that this did not work). This
commit changes the separator to '|'.
Relates #20786
This change proposes the removal of all non-tcp transport implementations. The
mock transport can be used by default to run tests instead of local transport that has
roughly the same performance compared to TCP or at least not noticeably slower.
This is a master only change, deprecation notice in 5.x will be committed as a
separate change.
Previously, this doc was using a field called "content". This is
confusing, especially when the doc starts talking about the content of
the content field. This change makes the field name "comment" which
is less ambiguous and also changes some related field names in the doc
to make a consistent example theme of editing docs around blog posts.
This causes the snippets to be tested during the build and gives
helpful links to the reader to open the docs in console or copy them
as curl commands.
Relates to #18160
Today when parsing a request, Elasticsearch silently ignores incorrect
(including parameters with typos) or unused parameters. This is bad as
it leads to requests having unintended behavior (e.g., if a user hits
the _analyze API and misspell the "tokenizer" then Elasticsearch will
just use the standard analyzer, completely against intentions).
This commit removes lenient URL parameter parsing. The strategy is
simple: when a request is handled and a parameter is touched, we mark it
as such. Before the request is actually executed, we check to ensure
that all parameters have been consumed. If there are remaining
parameters yet to be consumed, we fail the request with a list of the
unconsumed parameters. An exception has to be made for parameters that
format the response (as opposed to controlling the request); for this
case, handlers are able to provide a list of parameters that should be
excluded from tripping the unconsumed parameters check because those
parameters will be used in formatting the response.
Additionally, some inconsistencies between the parameters in the code
and in the docs are corrected.
Relates #20722
On Windows the JDK uses `CreateFileW` which has a stupidly high
limit for the number of `Handle`s it can make - `16 * 1024 * 1024`.
So this isn't really a problem on Windows at all.
Closes#20732
today it's not possible to use date-math efficiently with the `_rollover`
API. This change adds support for date-math in the target index as well as
support for preserving the math logic when an existing index that was created with
a date math expression all subsequent indices are created with the same expression.
this change adds a hard limit to `index.number_of_shard` that prevents
indices from being created that have more than 1024 shards. This is still
a huge limit and can only be changed via settings a system property.
* master: (1199 commits)
[DOCS] Remove non-valid link to mapping migration document
Revert "Default `include_in_all` for numeric-like types to false"
test: add a test with ipv6 address
docs: clearify that both ip4 and ip6 addresses are supported
Include complex settings in settings requests
Add production warning for pre-release builds
Clean up confusing error message on unhandled endpoint
[TEST] Increase logging level in testDelayShards()
change health from string to enum (#20661)
Provide error message when plugin id is missing
Document that sliced scroll works for reindex
Make reindex-from-remote ignore unknown fields
Remove NoopGatewayAllocator in favor of a more realistic mock (#20637)
Remove Marvel character reference from guide
Fix documentation for setting Java I/O temp dir
Update client benchmarks to log4j2
Changes the API of GatewayAllocator#applyStartedShards and (#20642)
Removes FailedRerouteAllocation and StartedRerouteAllocation
IndexRoutingTable.initializeEmpty shouldn't override supplied primary RecoverySource (#20638)
Smoke tester: Adjust to latest changes (#20611)
...
Surprise! You can use sliced scroll to easily parallelize reindex
and friend. They support it because they use the same infrastructure
as a regular search to parse the search request. While we would like
to make an "automatic" option for parallelizing reindex, this manual
option works right now and is pretty convenient!
This commit fixes the documentation for configuring the Java I/O temp
dir which incorrectly suggested using the -D flag as a parameter on the
command line; these flags have been removed and should now be specified
as arguments to the JVM using either the ES_JAVA_OPTS environment
variable or using the jvm.options configuration file.
Closes#20652
* plugins/discovery-azure-class.asciidoc
* reference/cluster.asciidoc
* reference/modules/cluster/misc.asciidoc
* reference/modules/indices/request_cache.asciidoc
After this is merged there will be no unconvereted snippets outside
of `reference`.
Related to #18160
Adds a cat api endpoint: /_cat/templates and its more specific version, /_cat/templates/{name}.
It looks something like:
$ curl "localhost:9200/_cat/templates?v"
name template order version
sushi_california_roll *avocado* 1 1
pizza_hawaiian *pineapples* 1
pizza_pepperoni *pepperoni* 1
The specified version (only allows * globs) looks like:
$ curl "localhost:9200/_cat/templates/pizza*"
name template order version
pizza_hawaiian *pineapples* 1
pizza_pepperoni *pepperoni* 1
Partially specified columns:
$ curl "localhost:9200/_cat/templates/pizza*?v=true&h=name,template"
name template
pizza_hawaiian *pineapples*
pizza_pepperoni *pepperoni*
The help text:
$ curl "localhost:9200/_cat/templates/pizza*?help"
name | n | template name
template | t | template pattern string
order | o | template application order number
version | v | version
Closes#20467
With the unified release process across the elastic stack, download
links for all products are changing. This change updates docs referring
to the old download and packages urls.
Note that this change also updates the plugin installation command as
the url for downloads is being changed to be consistent with that for
packages (both plural).
The serial collector is not suitable for running with a server
application like Elasticsearch and can decimate performance and lead to
cluster instability. This commit adds a bootstrap check to prevent usage
of the serial collector when Elasticsearch is running in production
mode.
Relates #20558
This tracks the snippets that probably should be converted to
`// CONSOLE` or `// TESTRESPONSE` and fails the build if the list
of files with such snippets doesn't match the list in `docs/build.gradle`.
Setting the file looks like
```
/* List of files that have snippets that probably should be converted to
* `// CONSOLE` and `// TESTRESPONSE` but have yet to be converted. Try and
* only remove entries from this list. When it is empty we'll remove it
* entirely and have a party! There will be cake and everything.... */
buildRestTests.expectedUnconvertedCandidates = [
'plugins/discovery-azure-classic.asciidoc',
...
'reference/search/suggesters/completion-suggest.asciidoc',
]
```
This list is in `build.gradle` because we expect it to be fairly
temporary. In a few months we'll have converted all of the docs and won't
ned it any more.
From now on if you add now docs that contain a snippet that shows an
interaction with elasticsearch you have three choices:
1. Stick `// CONSOLE` on the interactions and `// TESTRESPONSE` on the
responses. The build (specifically (`gradle docs:check`) will test that
these interactions "work". If there isn't a `// TESTRESPONSE` snippet
then "work" just means "Elasticsearch responds with a 200-level response
code and no `WARNING` headers. This is way better than nothing.
2. Add `// NOTCONSOLE` if the snippet isn't actually interacting with
Elasticsearch. This should only be required for stuff like javascript
source code or `curl` against an external service like AWS or GCE. The
snippet will not get "OPEN IN CONSOLE" or "COPY AS CURL" buttons or be
tested.
3. Add `// TEST[skip:reason]` under the snippet. This will just skip the
snippet in the test phase. This should really be reserved for snippets
where we can't test them because they require an external service that
we don't have at testing time.
Please, please, please, please don't add more things to the list. After
all, it sais there'll be cake when we remove it entirely!
Relates to #18160
We can now run templates using `explain` and/or `profile` parameters.
Which is interesting when you have defined a complicated profile but want to debug it in an easier way than running the full query again.
You can use `explain` parameter when running a template:
```js
GET /_search/template
{
"file": "my_template",
"params": {
"status": [ "pending", "published" ]
},
"explain": true
}
```
You can use `profile` parameter when running a template:
```js
GET /_search/template
{
"file": "my_template",
"params": {
"status": [ "pending", "published" ]
},
"profile": true
}
```
Funny node names have been removed in #19456 and replaced by UUID. This commit removes these obsolete node names and replace them by real UUIDs in the documentation.
closes#20065
`gender.keyword` should be used instead of just `gender` or we
get an error `Fielddata is disabled on text fields by default. Set
fielddata=true on [gender] in order to load fielddata in memory by
uninverting the inverted index. Note that this can however use
Closes#20535
significant memory.`
`cluster.routing.allocation.cluster_concurrent_rebalance` setting,
clarifying in which shard allocation situations the rebalance limit
takes effect.
Closes#20529
`profile.asciidoc` now runs all of its command but it doesn't validate
all of the results. Writing the validation is time consuming so I only
did some of it.
* Update rescoring docs in respect to sort
If sort is present in a query the rescore query is not executed. As long as this feature is neither implemented (see discussion in #6788) nor the combination of sort and rescoring raises an error, we should warn the user in the documentation about this.
* Missed a dot
In 7560101ec7, the Elasticsearch logger
names were modified to be their fully-qualified class name (with some
exceptions for special loggers like the slow logs and the transport
tracer). This commit updates the docs accordingly.
Relates #20475
With the cut over to LatLonPoint the geohash, geohash_precision, lat_lon, and geohash_prefix parameters have been removed. This commit fixes the doc build by removing the remaining dangling references to these removed parameters.
This change replaces the fields parameter with stored_fields when it makes sense.
This is dictated by the renaming we made in #18943 for the search API.
The following list of endpoint has been changed to use `stored_fields` instead of `fields`:
* get
* mget
* explain
The documentation and the rest API spec has been updated to cope with the changes for the following APIs:
* delete_by_query
* get
* mget
* explain
The `fields` parameter has been deprecated for the following APIs (it is replaced by _source filtering):
* update: the fields are extracted from the _source directly.
* bulk: the fields parameter is used but fields are extracted from the source directly so it is allowed to have non-stored fields.
Some APIs still have the `fields` parameter for various reasons:
* cat.fielddata: the fields paramaters relates to the fielddata fields that should be printed.
* indices.clear_cache: used to indicate which fielddata fields should be cleared.
* indices.get_field_mapping: used to filter fields in the mapping.
* indices.stats: get stats on fields (stored or not stored).
* termvectors: fields are retrieved from the stored fields if possible and extracted from the _source otherwise.
* mtermvectors:
* nodes.stats: the fields parameter is used to concatenate completion_fields and fielddata_fields so it's not related to stored_fields at all.
Fixes#20155
This commit adds a -q/--quiet option to Elasticsearch so that it does not log anything in the console and closes stdout & stderr streams. This is useful for SystemD to avoid duplicate logs in both journalctl and /var/log/elasticsearch/elasticsearch.log while still allows the JVM to print error messages in stdout/stderr if needed.
closes#17220
This commit adds a health status parameter to the cat indices API for
filtering on indices that match the specified status (green|yellow|red).
Relates #20393
Add docs to template support for _msearch
Relates to #10885
Relates to #15674
* Reference those docs from the rest api spec for _msearch/template support.
In 5.x we allowed this with a deprecation warning. This removes the code
added for that deprecation, requiring the cluster name to not be in the
data path.
Resolves#20391
This was an error-prone version type that allowed overriding previous
version semantics. It could cause primaries and replicas to be out of
sync however, so it has been removed.
Resolves#19769
Previous versions of Elasticsearch permitted unquoted JSON field names even though this is against the JSON spec. This leniency was disabled by default in the 5.x series of Elasticsearch but a backwards compatibility layer was added via a system property with the intention of removing this layer in 6.0.0. This commit removes this backwards compatibility layer.
Relates #20388
This includes:
- All regular numeric types such as int, long, scaled-float, double, etc
- IP addresses
- Dates
- Geopoints and Geoshapes
Relates to #19784
The collect_payloads parameter of the span_near query was previously
deprecated with the intention to be removed. This commit removes this
parameter.
Relates #20385
This was an error-prone version type that allowed overriding previous
version semantics. It could cause primaries and replicas to be out of
sync however, so it has been removed.
Resolves#19769
Hi all,
I was trying to run the percolate examples, but I figured that because of the "type":"keyword" , the code wasn't working.
In the saerch query the "message" : "A new bonsai tree in the office" is a pure string.
I changed it to "text".
Exposing lucene 6.x minhash tokenfilter
Generate min hash tokens from an incoming stream of tokens that can
be used to estimate document similarity.
Closes#20149
** The default script language is now maintained in `Script` class.
* Added `script.legacy.default_lang` setting that controls the default language for scripts that are stored inside documents (for example percolator queries). This defaults to groovy.
** Added `QueryParseContext#getDefaultScriptLanguage()` that manages the default scripting language. Returns always `painless`, unless loading query/search request in legacy mode then the returns what is configured in `script.legacy.default_lang` setting.
** In the aggregation parsing code added `ParserContext` that also holds the default scripting language like `QueryParseContext`. Most parser don't have access to `QueryParseContext`. This is for scripts in aggregations.
* The `lang` script field is always serialized (toXContent).
Closes#20122
and be much more stingy about what we consider a console candidate.
* Add `// CONSOLE` to check-running
* Fix version in some snippets
* Mark groovy snippets as groovy
* Fix versions in plugins
* Fix language marker errors
* Fix language parsing in snippets
This adds support for snippets who's language is written like
`[source, txt]` and `["source","js",subs="attributes,callouts"]`.
This also makes language required for snippets which is nice because
then we can be sure we can grep for snippets in a particular language.
- Using log() to indicate natural log can add some confusion when trying to further adjust/tweak scores. Other parts of the API (field_value_factor on this same page) use 'ln' and 'log', so this change should be more consistent
- Fixes#20027
- I generated the images using http://latex2png.com/ at a resolution of 150 which seemed to be about the same size as before
This commit configures the deprecation logs to be size-limited to 1 GB,
and compress these logs when they roll. The default configuration will
preserve up to four rolled logs.
Relates #20287
The mem section was buggy in cluster stats and removed. It is now added back with the same structure as in node stats, containing total memory, available memory, used memory and percentages. All the values are the sum of all the nodes across the cluster (or at least the ones that we were able to get the values from).
If elasticsearch controls the ID values as well as the documents
version we can optimize the code that adds / appends the documents
to the index. Essentially we an skip the version lookup for all
documents unless the same document is delivered more than once.
On the lucene level we can simply call IndexWriter#addDocument instead
of #updateDocument but on the Engine level we need to ensure that we deoptimize
the case once we see the same document more than once.
This is done as follows:
1. Mark every request with a timestamp. This is done once on the first node that
receives a request and is fixed for this request. This can be even the
machine local time (see why later). The important part is that retry
requests will have the same value as the original one.
2. In the engine we make sure we keep the highest seen time stamp of "retry" requests.
This is updated while the retry request has its doc id lock. Call this `maxUnsafeAutoIdTimestamp`
3. When the engine runs an "optimized" request comes, it compares it's timestamp with the
current `maxUnsafeAutoIdTimestamp` (but doesn't update it). If the the request
timestamp is higher it is safe to execute it as optimized (no retry request with the same
timestamp has been run before). If not we fall back to "non-optimzed" mode and run the request as a retry one
and update the `maxUnsafeAutoIdTimestamp` unless it's been updated already to a higher value
Relates to #19813
* master:
Avoid NPE in LoggingListener
Randomly use Netty 3 plugin in some tests
Skip smoke test client on JDK 9
Revert "Don't allow XContentBuilder#writeValue(TimeValue)"
[docs] Remove coming in 2.0.0
Don't allow XContentBuilder#writeValue(TimeValue)
[doc] Remove leftover from CONSOLE conversion
Parameter improvements to Cluster Health API wait for shards (#20223)
Add 2.4.0 to packaging tests list
Docs: clarify scale is applied at origin+offest (#20242)
* Params improvements to Cluster Health API wait for shards
Previously, the cluster health API used a strictly numeric value
for `wait_for_active_shards`. However, with the introduction of
ActiveShardCount and the removal of write consistency level for
replication operations, `wait_for_active_shards` is used for
write operations to represent values for ActiveShardCount. This
commit moves the cluster health API's usage of `wait_for_active_shards`
to be consistent with its usage in the write operation APIs.
This commit also changes `wait_for_relocating_shards` from a
numeric value to a simple boolean value `wait_for_no_relocating_shards`
to set whether the cluster health operation should wait for
all relocating shards to complete relocation.
* Addresses code review comments
* Don't be lenient if `wait_for_relocating_shards` is set
* master:
Increase visibility of deprecation logger
Skip transport client plugin installed on JDK 9
Explicitly disable Netty key set replacement
percolator: Fail indexing percolator queries containing either a has_child or has_parent query.
Make it possible for Ingest Processors to access AnalysisRegistry
Allow RestClient to send array-based headers
Silence rest util tests until the bogusness can be simplified
Remove unknown HttpContext-based test as it fails unpredictably on different JVMs
Tests: Improve rest suite names and generated test names for docs tests
Add support for a RestClient base path
The deprecation logger is an important way to make visible features of
Elasticsearch that are deprecated. Yet, the default logging makes the
log messages for the deprecation logger invisible. We want these log
messages to be visible, so the default logging for the deprecation
logger should enable these log messages. This commit changes the log
level of deprecation log message to warn, and configures the deprecation
logger so that these log messages are visible out of the box.
Relates #20254
While removing an index isn't actually an alias action, if we add
an alias action that deletes an index then we can delete and index
and add an alias with the same name as the index atomically, in
the same cluster state update.
Closes#20064
This commit adds the support for exclusion filter to the response filtering (filter_path) feature. It changes the XContentBuilder APIs so that it now accepts two types of filters: inclusive and exclusive. Filters are no more String arrays but sets of String instead.
This changes Elasticsearch to automatically downgrade `text` and
`keyword` fields into appropriate `string` fields when changing the
mapping of indexes imported from 2.x. This allows users to use the
modern, documented syntax against 2.x indexes. It also makes it clear
that reindexing in order to recreate the index in 5.0 is required for
any long lived indexes. This change is useful for the times when you
can't (cluster is just starting, not stable enough for reindex) or
shouldn't (index will only live 90 days or something).
This change adds a special field named _none_ that allows to disable the retrieval of the stored fields in a search request or in a TopHitsAggregation.
To completely disable stored fields retrieval (including disabling metadata fields retrieval such as _id or _type) use _none_ like this:
````
POST _search
{
"stored_fields": "_none_"
}
````
Today we do a lot of accounting inside the engine to maintain locations
of documents inside the transaction log. This is only needed to ensure
we can return the documents source from the engine if it hasn't been refreshed.
Aside of the added complexity to be able to read from the currently writing translog,
maintainance of pointers into the translog this also caused inconsistencies like different values
of the `_ttl` field if it was read from the tlog or not. TermVectors are totally different if
the document is fetched from the tranlog since copy fields are ignored etc.
This chance will simply call `refresh` if the documents latest version is not in the index. This
streamlines the semantics of the `_get` API and allows for more optimizations inside the engine
and on the transaction log. Note: `_refresh` is only called iff the requested document is not refreshed
yet but has recently been updated or added.
#Relates to #19787
Deprecates the optimize_bbox parameter on geodistance queries. This has no longer been needed since version 2.2 because lucene geo distance queries (postings and LatLonPoint) already optimize by bounding box.
Fix field examples to make documents actually visible
This commit adds refresh calls to field examples an removes not working
`_routing` and `_field_names` script access.
Closes#20118
This includes:
- All regular numeric types such as int, long, scaled-float, double, etc
- IP addresses
- Dates
- Geopoints and Geoshapes
Relates to #19784
Previously this was possible, which was problematic when issuing a
request like `DELETE /-myindex`, which was interpretted as "delete
everything except for myindex".
Resolves#19800
Most of the examples in the pipeline aggregation docs use a small
"sales" test data set and I converted all of the examples that use
it to `// CONSOLE`. There are still a bunch of snippets in the pipeline
aggregation docs that aren't `// CONSOLE` so they aren't tested. Most
of them are "this is the most basic form of this aggregation" so they
are more immune to errors and bit rot then the examples that I converted.
I'd like to do something with them as well but I'm not sure what.
Also, the moving average docs and serial diff docs didn't get a lot of
love from this pass because they don't use the test data set or follow
the same general layout.
Relates to #18160
Currently both `PUT` and `POST` can be used to create indices. This commit
removes support for `POST index_name` so that we can use it to index documents
with auto-generated ids once types are removed.
Relates #15613
In the example there was a alias removed and then a different alias created for the same index, but I think actually swapping a index by another one for the same alias would make more sense as an example here.
This commit defaults the max local storage nodes to one. The motivation
for this change is that a default value greather than one is dangerous
as users sometimes end up unknowingly starting a second node and start
thinking that they have encountered data loss.
Relates #19964
This commit rewords the expect header bug notice to provide the precise
details for the bug arising. In particular, the bug does not impact any
request over 1024 bytes, but instead impacts any request with a body
that is sent in two requests, the first with an Expect: 100-continue
header. The size is irrelevant, and requests with bodies larger than
1024 bytes are okay as long as the Expect: 100-continue header is not
also sent.
Relates #19911
>However, the version of the new cluster should be the same or newer than the cluster that was
Afaik, you can't restore a snapshot to a newer cluster that is not consecutively newer (i.e. can't restore 1.x snapshot to a 5.x cluster). This is to clarify the statement above moving forward.
When compiling many dynamically changing scripts, parameterized
scripts (<https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-using.html#prefer-params>)
should be preferred. This enforces a limit to the number of scripts that
can be compiled within a minute. A new dynamic setting is added -
`script.max_compilations_per_minute`, which defaults to 15.
If more dynamic scripts are sent, a user will get the following
exception:
```json
{
"error" : {
"root_cause" : [
{
"type" : "circuit_breaking_exception",
"reason" : "[script] Too many dynamic script compilations within one minute, max: [15/min]; please use on-disk, indexed, or scripts with parameters instead",
"bytes_wanted" : 0,
"bytes_limit" : 0
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "i",
"node" : "a5V1eXcZRYiIk8lecjZ4Jw",
"reason" : {
"type" : "general_script_exception",
"reason" : "Failed to compile inline script [\"aaaaaaaaaaaaaaaa\"] using lang [painless]",
"caused_by" : {
"type" : "circuit_breaking_exception",
"reason" : "[script] Too many dynamic script compilations within one minute, max: [15/min]; please use on-disk, indexed, or scripts with parameters instead",
"bytes_wanted" : 0,
"bytes_limit" : 0
}
}
}
],
"caused_by" : {
"type" : "general_script_exception",
"reason" : "Failed to compile inline script [\"aaaaaaaaaaaaaaaa\"] using lang [painless]",
"caused_by" : {
"type" : "circuit_breaking_exception",
"reason" : "[script] Too many dynamic script compilations within one minute, max: [15/min]; please use on-disk, indexed, or scripts with parameters instead",
"bytes_wanted" : 0,
"bytes_limit" : 0
}
}
},
"status" : 500
}
```
This also fixes a bug in `ScriptService` where requests being executed
concurrently on a single node could cause a script to be compiled
multiple times (many in the case of a powerful node with many shards)
due to no synchronization between checking the cache and compiling the
script. There is now synchronization so that a script being compiled
will only be compiled once regardless of the number of concurrent
searches on a node.
Relates to #19396
The payload option was introduced with the new completion
suggester implementation in v5, as a stop gap solution
to return additional metadata with suggestions.
Now we can return associated documents with suggestions
(#19536) through fetch phase using stored field (_source).
The additional fetch phase ensures that we only fetch
the _source for the global top-N suggestions instead of
fetching _source of top results for each shard.
This note in the delete api about broadcasting to all shards is a leftover that should have been removed when the broadcasting feature was removed
Relates to #10136
GeoDistance is implemented using a crazy enum that causes issues with the scripting modules. This commit moves all distance calculations to arcDistance and planeDistance static methods in GeoUtils. It also removes unnecessary distance helper methods from ScriptDocValues.GeoPoints.
This commit enables completion suggester to return documents
associated with suggestions. Now the document source is returned
with every suggestion, which respects source filtering options.
In case of suggest queries spanning more than one shard, the
suggest is executed in two phases, where the last phase fetches
the relevant documents from shards, implying executing suggest
requests against a single shard is more performant due to the
document fetch overhead when the suggest spans multiple shards.
Adds `warnings` syntax to the yaml test that allows you to expect
a `Warning` header that looks like:
```
- do:
warnings:
- '[index] is deprecated'
- quotes are not required because yaml
- but this argument is always a list, never a single string
- no matter how many warnings you expect
get:
index: test
type: test
id: 1
```
These are accessible from the docs with:
```
// TEST[warning:some warning]
```
This should help to force you to update the docs if you deprecate
something. You *must* add the warnings marker to the docs or the build
will fail. While you are there you *should* update the docs to add
deprecation warnings visible in the rendered results.
Today, when listing thread pools via the cat thread pool API, thread
pools are listed in a column-delimited format. This is unfriendly to
command-line tools, and inconsistent with other cat APIs. Instead,
thread pools should be listed in a row-delimited format.
Additionally, the cat thread pool API is limited to a fixed list of
thread pools that excludes certain built-in thread pools as well as all
custom thread pools. These thread pools should be available via the cat
thread pool API.
This commit improves the cat thread pool API by listing all thread pools
(built-in or custom), and by listing them in a row-delimited
format. Finally, for each node, the output thread pools are sorted by
thread pool name.
Relates #19721
Currently both aggregations really share the same implementation. This commit
splits the implementations so that regular histograms can support decimal
intervals/offsets and compute correct buckets for negative decimal values.
However the response API is still the same. So for intance both regular
histograms and date histograms will produce an
`org.elasticsearch.search.aggregations.bucket.histogram.Histogram`
aggregation.
The optimization to compute an identifier of the rounded value and the
rounded value itself has been removed since it was only used by regular
histograms, which now do the rounding themselves instead of relying on the
Rounding abstraction.
Closes#8082Closes#4847
* Rename operation to result and reworking responses
* Rename DocWriteResponse.Operation enum to DocWriteResponse.Result
These are just easier to interpret names.
Closes#19664
The current heuristic to compute a default shard size is pretty aggressive,
it returns `max(10, number_of_shards * size)` as a value for the shard size.
I think making it less aggressive has the benefit that it would reduce the
likelyness of running into OOME when there are many shards (yearly
aggregations with time-based indices can make numbers of shards in the
thousands) and make the use of breadth-first more likely/efficient.
This commit replaces the heuristic with `size * 1.5 + 10`, which is enough
to have good accuracy on zipfian distributions.
* Update gateway.asciidoc
Added a note to clarify that, in cases where nodes in a cluster have different setting, the node that is the elected master takes precedence over anything else.
* Update gateway.asciidoc
Updated as per @bleskes's comments
Performing the bulk request shown in #19267 now results in the following:
```
{"_index":"test","_type":"test","_id":"1","_version":1,"_operation":"create","forced_refresh":false,"_shards":{"total":2,"successful":1,"failed":0},"status":201}
{"_index":"test","_type":"test","_id":"1","_version":1,"_operation":"noop","forced_refresh":false,"_shards":{"total":2,"successful":1,"failed":0},"status":200}
```
This change adds a new special path to the buckets_path syntax
`_bucket_count`. This new option will return the number of buckets for a
multi-bucket aggregation, which can then be used in pipeline
aggregations.
Closes#19553
This adds new circuit breaking with the "request" breaker, which adds
circuit breaks based on the number of buckets created during
aggregations. It consists of incrementing during AggregatorBase creation
This also bumps the REQUEST breaker to 60% of the JVM heap now.
The output when circuit breaking an aggregation looks like:
```json
{
"shard" : 0,
"index" : "i",
"node" : "a5AvjUn_TKeTNYl0FyBW2g",
"reason" : {
"type" : "exception",
"reason" : "java.util.concurrent.ExecutionException: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: CircuitBreakingException[[request] Data too large, data for [<agg [otherthings]>] would be larger than limit of [104857600/100mb]];",
"caused_by" : {
"type" : "execution_exception",
"reason" : "QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: CircuitBreakingException[[request] Data too large, data for [<agg [myagg]>] would be larger than limit of [104857600/100mb]];",
"caused_by" : {
"type" : "circuit_breaking_exception",
"reason" : "[request] Data too large, data for [<agg [otherthings]>] would be larger than limit of [104857600/100mb]",
"bytes_wanted" : 104860781,
"bytes_limit" : 104857600
}
}
}
}
```
Relates to #14046
With #19140 we started persisting the node ID across node restarts. Now that we have a "stable" anchor, we can use it to generate a stable default node name and make it easier to track nodes over a restarts. Sadly, this means we will not have those random fun Marvel characters but we feel this is the right tradeoff.
On the implementation side, this requires a bit of juggling because we now need to read the node id from disk before we can log as the node node is part of each log message. The PR move the initialization of NodeEnvironment as high up in the starting sequence as possible, with only one logging message before it to indicate we are initializing. Things look now like this:
```
[2016-07-15 19:38:39,742][INFO ][node ] [_unset_] initializing ...
[2016-07-15 19:38:39,826][INFO ][node ] [aAmiW40] node name set to [aAmiW40] by default. set the [node.name] settings to change it
[2016-07-15 19:38:39,829][INFO ][env ] [aAmiW40] using [1] data paths, mounts [[ /(/dev/disk1)]], net usable_space [5.5gb], net total_space [232.6gb], spins? [unknown], types [hfs]
[2016-07-15 19:38:39,830][INFO ][env ] [aAmiW40] heap size [1.9gb], compressed ordinary object pointers [true]
[2016-07-15 19:38:39,837][INFO ][node ] [aAmiW40] version[5.0.0-alpha5-SNAPSHOT], pid[46048], build[473d3c0/2016-07-15T17:38:06.771Z], OS[Mac OS X/10.11.5/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_51/25.51-b03]
[2016-07-15 19:38:40,980][INFO ][plugins ] [aAmiW40] modules [percolator, lang-mustache, lang-painless, reindex, aggs-matrix-stats, lang-expression, ingest-common, lang-groovy, transport-netty], plugins []
[2016-07-15 19:38:43,218][INFO ][node ] [aAmiW40] initialized
```
Needless to say, settings `node.name` explicitly still works as before.
The commit also contains some clean ups to the relationship between Environment, Settings and Plugins. The previous code suggested the path related settings could be changed after the initial Environment was changed. This did not have any effect as the security manager already locked things down.
Add parser for anonymous char_filters/tokenizer/token_filters
Using Settings in AnalyzeRequest for anonymous definition
Add breaking changes document
Closed#8878
Remove `ParseField` constants used for names where there are no deprecated
names and just use the `String` version of the registration method instead.
This is step 2 in cleaning up the plugin interface for extending
search time actions. Aggregations are next.
This is breaking for plugins because those that register a new query should
now implement `SearchPlugin` rather than `onModule(SearchModule)`.
Previously if the size of the search request was greater than zero we would not cache the request in the request cache.
This change retains the default behaviour of not caching requests with size > 0 but also allows the `request_cache=true` query parameter
to enable the cache for requests with size > 0
This is a tentative to revive #15939 motivated by elastic/beats#1941.
Half-floats are a pretty bad option for storing percentages. They would likely
require 2 bytes all the time while they don't need more than one byte.
So this PR exposes a new `scaled_float` type that requires a `scaling_factor`
and internally indexes `value*scaling_factor` in a long field. Compared to the
original PR it exposes a lower-level API so that the trade-offs are clearer and
avoids any reference to fixed precision that might imply that this type is more
accurate (actually it is *less* accurate).
In addition to being more space-efficient for some use-cases that beats is
interested in, this is also faster that `half_float` unless we can improve the
efficiency of decoding half-float bits (which is currently done using software)
or until Java gets first-class support for half-floats.
Today the default precision for the cardinality aggregation depends on how many
parent bucket aggregations it had. The reasoning was that the more parent bucket
aggregations, the more buckets the cardinality had to be computed on. And this
number could be huge depending on what the parent aggregations actually are.
However now that we run terms aggregations in breadth-first mode by default when
there are sub aggregations, it is less likely that we have to run the cardinality
aggregation on kagilions of buckets. So we could use a static default, which will
be less confusing to users.
* Removed `Template` class and unified script & template parsing logic. Templates are scripts, so they should be defined as a script. Unless there will be separate template infrastructure, templates should share as much code as possible with scripts.
* Removed ScriptParseException in favour for ElasticsearchParseException
* Moved TemplateQueryBuilder to lang-mustache module because this query is hard coded to work with mustache only
This should make them easier to read and adds them to the test suite
I changed the example from a two node cluster to a single node cluster
because that is what we have running in the integration tests. It is also
what a user just starting out is likely to see so I think that is ok.
Invocation counts can be used to help judge the selectivity of individual query components in the context of the entire query. E.g. a query may not look selective when run by itself (matches most of the index), but when run in context of a full search request, is evaluated only rarely due to execution order
Since this is modifying the base timing class, it'll enrich both query and agg profiles (as well as future profile results)
Today `node.mode` and `node.local` serve almost the same purpose, they
are a shortcut for `discovery.type` and `transport.type`. If `node.local: true`
or `node.mode: local` is set elasticsearch will start in _local_ mode which means
only nodes within the same JVM are discovered and a non-network based transport
is used. The _local_ mode it only really used in tests or if nodes are embedded.
For both, embedding and tests explicit configuration via `discovery.type` and `transport.type`
should be preferred.
This change removes all the usage of these settings and by-default doesn't
configure a default transport implemenation since netty is now a module. Yet, to make
the user expericence flawless, plugins or modules can set a `http.type.default` and
`transport.type.default`. Plugins set this via `PluginService#additionalSettings()`
which enforces _set-once_ which prevents node startup if set multiple times. This means
that our distributions will just startup with netty transport since it's packaged as a
module unless `transport.type` or `http.transport.type` is explicitly set.
This change also found a bunch of bugs since several NamedWriteables were not registered if a
transport client is used. Now that we don't rely on the `node.mode` leniency which is inherited
instead of using explicit settings, `TransportClient` uses `AssertingLocalTransport` which detects these problems since it serializes all messages.
Closes#16234
This commit removes support for properties syntax and config files:
- removed support for elasticsearch.properties
- removed support for logging.properties
- removed support for properties content detection in REST APIs
- removed support for properties content detection in Java API
Relates #19398
Switches most search behavior extensions from push (`onModule(SearchModule)`)
to pull (`implements SearchPlugin`). This effort in general gives plugin
authors a much cleaner view of how to extend Elasticsearch and starts to
set up portions of Elasticsearch as "the plugin API". This commit in
particular does that for search-time behavior like customized suggesters,
highlighters, score functions, and significance heuristics.
It also switches most such customization to being done at search module
construction time which is much, much easier to reason about from a testing
perspective. It also helps significantly in the process of de-guice-ing
Elasticsearch's startup.
There are at least two major search time extensions that aren't covered in
this commit that will simply have to wait for the next commit on the topic
because this one has already grown large: custom aggregations and custom
queries. These will likely live in the same SearchPlugin interface as well.
If there are percolator queries containing `range` queries with ranges based on the current time then this can lead to incorrect results if the `percolate` query gets cached. These ranges are changing each time the `percolate` query gets executed and if this query gets cached then the results will be based on how the range was at the time when the `percolate` query got cached.
The ExtractQueryTermsService has been renamed `QueryAnalyzer` and now only deals with analyzing the query (extracting terms and deciding if the entire query is a verified match) . The `PercolatorFieldMapper` is responsible for adding the right fields based on the analysis the `QueryAnalyzer` has performed, because this is highly dependent on the field mappings. Also the `PercolatorFieldMapper` is responsible for creating the percolate query.
Today when a thread encounters a fatal unrecoverable error that
threatens the stability of the JVM, Elasticsearch marches on. This
includes out of memory errors, stack overflow errors and other errors
that leave the JVM in a questionable state. Instead, the Elasticsearch
JVM should die when these errors are encountered. This commit causes
this to be the case.
Relates #19272
* master: (192 commits)
[TEST] Fix rare OBOE in AbstractBytesReferenceTestCase
Reindex from remote
Rename writeThrowable to writeException
Start transport client round-robin randomly
Reword Refresh API reference (#19270)
Update fielddata.asciidoc
Fix stored_fields message
Add missing footer notes in mapper size docs
Remote BucketStreams
Add doc values support to the _size field in the mapper-size plugin
Bump version to 5.0.0-alpha5.
Update refresh.asciidoc
Update shrink-index.asciidoc
Change Debian repository for Vagrant debian-8 box
[TEST] fix test to account for internal empyt reference optimization
Upgrade to netty 3.10.6.Final (#19235)
[TEST] fix histogram test when extended bounds overlaps data
Remove redundant modifier
Simplify TcpTransport interface by reducing send code to a single send method (#19223)
Fix style violation in InstallPluginCommand.java
...
This adds a remote option to reindex that looks like
```
curl -POST 'localhost:9200/_reindex?pretty' -d'{
"source": {
"remote": {
"host": "http://otherhost:9200"
},
"index": "target",
"query": {
"match": {
"foo": "bar"
}
}
},
"dest": {
"index": "target"
}
}'
```
This reindex has all of the features of local reindex:
* Using queries to filter what is copied
* Retry on rejection
* Throttle/rethottle
The big advantage of this version is that it goes over the HTTP API
which can be made backwards compatible.
Some things are different:
The query field is sent directly to the other node rather than parsed
on the coordinating node. This should allow it to support constructs
that are invalid on the coordinating node but are valid on the target
node. Mostly, that means old syntax.
This change activates the doc_values on the _size field for indices created after 5.0.0-alpha4.
It also adds a note in the breaking changes that explain the situation and how to get around it.
Closes#18334
Node IDs are currently randomly generated during node startup. That means they change every time the node is restarted. While this doesn't matter for ES proper, it makes it hard for external services to track nodes. Another, more minor, side effect is that indexing the output of, say, the node stats API results in creating new fields due to node ID being used as keys.
The first approach I considered was to use the node's published address as the base for the id. We already [treat nodes with the same address as the same](https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/discovery/zen/NodeJoinController.java#L387) so this is a simple change (see [here](https://github.com/elastic/elasticsearch/compare/master...bleskes:node_persistent_id_based_on_address)). While this is simple and it works for probably most cases, it is not perfect. For example, if after a node restart, the node is not able to bind to the same port (because it's not yet freed by the OS), it will cause the node to still change identity. Also in environments where the host IP can change due to a host restart, identity will not be the same.
Due to those limitation, I opted to go with a different approach where the node id will be persisted in the node's data folder. This has the upside of connecting the id to the nodes data. It also means that the host can be adapted in any way (replace network cards, attach storage to a new VM). I
It does however also have downsides - we now run the risk of two nodes having the same id, if someone copies clones a data folder from one node to another. To mitigate this I changed the semantics of the protection against multiple nodes with the same address to be stricter - it will now reject the incoming join if a node exists with the same id but a different address. Note that if the existing node doesn't respond to pings (i.e., it's not alive) it will be removed and the new node will be accepted when it tries another join.
Last, and most importantly, this change requires that *all* nodes persist data to disk. This is a change from current behavior where only data & master nodes store local files. This is the main reason for marking this PR as breaking.
Other less important notes:
- DummyTransportAddress is removed as we need a unique network address per node. Use `LocalTransportAddress.buildUnique()` instead.
- I renamed `node.add_lid_to_custom_path` to `node.add_lock_id_to_custom_path` to avoid confusion with the node ID which is now part of the `NodeEnvironment` logic.
- I removed the `version` paramater from `MetaDataStateFormat#write` , it wasn't really used and was just in the way :)
- TribeNodes are special in the sense that they do start multiple sub-nodes (previously known as client nodes). Those sub-nodes do not store local files but derive their ID from the parent node id, so they are generated consistently.
Rename `fields` to `stored_fields` and add `docvalue_fields`
`stored_fields` parameter will no longer try to retrieve fields from the _source but will only return stored fields.
`fields` will throw an exception if the user uses it.
Add `docvalue_fields` as an adjunct to `fielddata_fields` which is deprecated. `docvalue_fields` will try to load the value from the docvalue and fallback to fielddata cache if docvalues are not enabled on that field.
Closes#18943
We introduced a special response_body assertion to test our docs snippets. The match assertion does the same job though and can be reused and adapted where needed. ResponseBodyAssertion contains provides much better and accurate errors though, which can be now utilized in MatchAssertion so that many more REST tests can benefit from readable error messages.
Each response body gets always stashed and can be retrieved for later evaluations already. Instead of providing the response body as strings that get parsed to json objects separately, then converted to maps as ResponseBodyAssertion did, we parse everything once, the json is part of the yaml test, which is supported. The only downside is that json comments cannot be used, rather yaml comments should be used (// C style vs # ). There were only two docs tests that were using comments in ingest-node.asciidoc where I went ahead and remove the comments which didn't seem that useful anyways.
Update-By-Query and Delete-By-Query use internal versioning to update/delete documents. But documents can have a version number equal to zero using the external versioning... making the UBQ/DBQ request fail because zero is not a valid version number and they only support internal versioning for now. Sequence numbers might help to solve this issue in the future.
As discussed at https://github.com/elastic/elasticsearch-cloud-azure/issues/91#issuecomment-229113595, we know that the current `discovery-azure` plugin only works with Azure Classic VMs / Services (which is somehow Legacy now).
The proposal here is to rename `discovery-azure` to `discovery-azure-classic` in case some users are using it.
And deprecate it for 5.0.
Closes#19144.
`RestHandler`s are highly tied to actions so registering them in the
same place makes sense.
Removes the need to for plugins to check if they are in transport client
mode before registering a RestHandler - `getRestHandlers` isn't called
at all in transport client mode.
This caused guice to throw a massive fit about the circular dependency
between NodeClient and the allocation deciders. I broke the circular
dependency by registering the actions map with the node client after
instantiation.
This pull request adds two util functions to the Mustache templating engine:
- {{#toJson}}my_map{{/toJson}} to render a Map parameter as a JSON string
- {{#join}}my_iterable{{/join}} to render any iterable (including arrays) as a comma separated list of values like `1, 2, 3`. It's also possible de change the default delimiter (comma) to something else.
closes#18970
These are useful methods in groovy that give you control over
the replacements used:
```
'the quick brown fox'.replaceAll(/[aeiou]/,
m -> m.group().toUpperCase(Locale.ROOT))
```
Instead of implementing onModule(ActionModule) to register actions,
this has plugins implement ActionPlugin to declare actions. This is
yet another step in cleaning up the plugin infrastructure.
While I was in there I switched AutoCreateIndex and DestructiveOperations
to be eagerly constructed which makes them easier to use when
de-guice-ing the code base.
This commit modifies TimeValue parsing to keep the input time unit. This
enables round-trip parsing from instances of String to instances of
TimeValue and vice-versa. With this, this commit removes support for the
unit "w" representing weeks, and also removes support for fractional
values of units (e.g., 0.5s).
Relates #19102
This commit fixes several NPEs caused by implicitly performing a get request for a document that exists with its _source disabled and then trying to access the source. Instead of causing an NPE the following queries will throw an exception with a "source disabled" message (similar behavior as if the document does not exist).:
- GeoShape query for pre-indexed shape (throws IllegalArgumentException)
- Percolate query for an existing document (throws IllegalArgumentException)
A Terms query with a lookup will ignore the document if the source does not exist (same as if the document does not exist).
GET and HEAD requests for the document _source will return a 404 if the source is disabled (even if the document exists).
Instead of plugins calling `registerTokenizer` to extend the analyzer
they now instead have to implement `AnalysisPlugin` and override
`getTokenizer`. This lines up extending plugins in with extending
scripts. This allows `AnalysisModule` to construct the `AnalysisRegistry`
immediately as part of its constructor which makes testing anslysis
much simpler.
This also moves the default analysis configuration into `AnalysisModule`
which is how search is setup.
Like `ScriptModule`, `AnalysisModule` no longer extends `AbstractModule`.
Instead it is only responsible for building `AnslysisRegistry`. We still
bind `AnalysisRegistry` but we only do so in `Node`. This is means it
is available at module construction time so we slowly remove the need to
bind it in guice.
This commit fixes several NPEs caused by implicitly performing a get request for a document that exists with its _source disabled and then trying to access the source. Instead of causing an NPE the following queries will throw an exception with a "source disabled" message (similar behavior as if the document does not exist).:
- GeoShape query for pre-indexed shape (throws IllegalArgumentException)
- Percolate query for an existing document (throws IllegalArgumentException)
A Terms query with a lookup will ignore the document if the source does not exist (same as if the document does not exist).
GET and HEAD requests for the document _source will return a 404 if the source is disabled (even if the document exists).
This moves the "Performance Considerations for Elasticsearch Indexing" blog post
to the reference guide and adds similar recommendations for tuning disk usage
and search speed.
* master: (416 commits)
docs: removed obsolete information, percolator queries are not longer loaded into jvm heap memory.
Upgrade JNA to 4.2.2 and remove optionality
[TEST] Increase timeouts for Rest test client (#19042)
Update migrate_5_0.asciidoc
Add ThreadLeakLingering option to Rest client tests
Add a MultiTermAwareComponent marker interface to analysis factories. #19028
Attempt at fixing IndexStatsIT.testFilterCacheStats.
Fix docs build.
Move templates out of the Search API, into lang-mustache module
revert - Inline reroute with process of node join/master election (#18938)
Build valid slices in SearchSourceBuilderTests
Docs: Convert aggs/misc to CONSOLE
Docs: migration notes for _timestamp and _ttl
Group client projects under :client
[TEST] Add client-test module and make client tests use randomized runner directly
Move upgrade test to upgrade from version 2.3.3
Tasks: Add completed to the mapping
Fail to start if plugin tries broken onModule
Remove duplicated read byte array methods
Rename `fields` to `stored_fields` and add `docvalue_fields`
...
This commit moves template support out of the Search API to its own dedicated Search Template API in the lang-mustache module. It provides a new SearchTemplateAction that can be used to render templates before it gets delegated to the usual Search API. The current REST endpoint are identical, but the Render Search Template endpoint now uses the same Search Template API with a new "simulate" option. When this option is enabled, the Search Template API only renders template and returns immediatly, without executing the search.
Closes#17906
We aren't able to actually create an index with _timestamp enabled
to test the migration, or, at least, we won't be able to after #18980
is re-merged. But the docs are still ok.
Closes#19007
If a plugin declares `onModule(SomethingThatIsntAModule)` then refuse
to start. Before this commit we just logged a warning that flies by in
the console and is easy to miss. You can't miss refusing to start!
`stored_fields` parameter will no longer try to retrieve fields from the _source but will only return stored fields.
`fields` will throw an exception if the user uses it.
Add `docvalue_fields` as an adjunct to `fielddata_fields` which is deprecated. `docvalue_fields` will try to load the value from the docvalue and fallback to fielddata cache if docvalues are not enabled on that field.
Closes#18943
The default similarity was set to `classic` which refers to TFIDF and has not been moved after the upgrade to Lucene 6.
Though moving to BM25 could have some downside for queries that relies on coordination factor (match_query, multi_match_query) ?
relates #18944
We wrote that the document is:
```json
{
"value" : ["foo", "bar", "baz"]
}
```
But the processor is using a `values` field:
```json
{
"foreach" : {
"field" : "values",
"processors" : [
// ...
]
}
}
```
It should be `values`.
ES only sends a non-200 response all shards fail but we should
fail the tests generated by docs if any of them fail.
Depending on the outcome of #18978 this might be a temporary
workaround.
The MMapDirectory has a switch that allows the content of files to be loaded
into the filesystem cache upon opening. This commit exposes it with the new
`index.store.pre_load` setting.
This commit removes the search preference _only_node as the same
functionality can be obtained by using the search preference
_only_nodes. This commit also adds a test that ensures that _only_nodes
will continue to support specifying node IDs.
Relates #18875
This commit adds a note to the breaking changes docs that since commit
da74323141, thread pool settings are no
longer cluster-level settings and thus not dynamically updatable.
Painless: Add support for //m
Painless: Add support for //s
Painless: Add support for //i
Painless: Add support for //u
Painless: Add support for //U
Painless: Add support for //l
This means "literal" and is exposed for completeness sake with
the java api.
Painless: Add support for //c
c enables Java's CANON_EQ (canonical equivalence) flag which makes
unicode characters that are canonically equal match. Java's javadoc
gives "a\u030A" being equal to "\u00E5". That is that the "a" code
point followed by the "combining ring above" code point is equal to
the "a with combining ring above" code point.
Update docs and add multi-flag test
Whitelist most of the Pattern class.
Adds support for the find operator (=~) and the match operator (==~)
to painless's regexes. Also whitelists most of the Matcher class and
documents regex support in painless.
The find operator (=~) returns a boolean that is the result of building
a matcher on the lhs with the Pattern on the RHS and calling `find` on
it. Use it like this:
```
if (ctx._source.last =~ /b/)
```
The match operator (==~) returns boolean like find but instead of calling
`find` on the Matcher it calls `matches`.
```
if (ctx._source.last ==~ /[^aeiou].*[aeiou]/)
```
Finally, if you want the actual matcher you do:
```
Matcher m = /[aeiou]/.matcher(ctx._source.last)
```
They have been implemented in https://issues.apache.org/jira/browse/LUCENE-7289.
Ranges are implemented so that the accuracy loss only occurs at index time,
which means that if you are searching for values between A and B, the query will
match exactly all documents whose value rounded to the closest half-float point
is between A and B.
Thread pool settings are no longer dynamically updatable since
da74323141. This commit removes a leftover
note from the thread pool module docs that incorrectly states that
thread pool settings are dynamically updatable.
The search preference _prefer_node allows specifying a single node to
prefer when routing a request. This functionality can be enhanced by
permitting multiple nodes to be preferred. This commit replaces the
search preference _prefer_node with the search preference _prefer_nodes
which supplants the former by specifying a single node and otherwise
adds functionality.
Relates #18872
This adds a get task API that supports GET /_tasks/${taskId} and
removes that responsibility from the list tasks API. The get task
API supports wait_for_complation just as the list tasks API does
but doesn't support any of the list task API's filters. In exchange,
it supports falling back to the .results index when the task isn't
running any more. Like any good GET API it 404s when it doesn't
find the task.
Then we change reindex, update-by-query, and delete-by-query to
persist the task result when wait_for_completion=false. The leads
to the neat behavior that, once you start a reindex with
wait_for_completion=false, you can fetch the result of the task by
using the get task API and see the result when it has finished.
Also rename the .results index to .tasks.
By default the number of searches msearch executes is capped by the number of
nodes multiplied with the default size of the search threadpool. This default can be
overwritten by using the newly added `max_concurrent_searches` parameter.
Before the msearch api would concurrently execute all searches concurrently. If many large
msearch requests would be executed this could lead to some searches being rejected
while other searches in the msearch request would succeed.
The goal of this change is to avoid this exhausting of the search TP.
Closes#17926
* master: (51 commits)
Switch QueryBuilders to new MatchPhraseQueryBuilder
Added method to allow creation of new methods on-the-fly.
more cleanups
Remove cluster name from data path
Remove explicit parallel new GC flag
rehash the docvalues in DocValuesSliceQuery using BitMixer.mix instead of the naive Long.hashCode.
switch FunctionRef over to methodhandles
ingest: Move processors from core to ingest-common module.
Fix some typos (#18746)
Fix ut
convert FunctionRef/Def usage to methodhandles.
Add the ability to partition a scroll in multiple slices. API:
use painless types in FunctionRef
Update ingest-node.asciidoc
compute functional interface stuff in Definition
Use method name in bootstrap check might fork test
Make checkstyle happy (add Lookup import, line length)
Don't hide LambdaConversionException and behave like real javac compiled code when a conversion fails. This works anyways, because fallback is allowed to throw any Throwable
Pass through the lookup given by invokedynamic to the LambdaMetaFactory. Without it real lambdas won't work, as their implementations are private to script class
checkstyle have your upper L
...
Previously Elasticsearch used $DATA_DIR/$CLUSTER_NAME/nodes for the path
where data is stored, this commit changes that to be $DATA_DIR/nodes.
On startup, if the old folder structure is detected it will be used.
This behavior will be removed in Elasticsearch 6.0
Resolves#17810
API:
```
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '{
"slice": {
"field": "_uid", <1>
"id": 0, <2>
"max": 10 <3>
},
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
```
<1> (optional) The field name used to do the slicing (_uid by default)
<2> The id of the slice
By default the splitting is done on the shards first and then locally on each shard using the _uid field
with the following formula:
`slice(doc) = floorMod(hashCode(doc._uid), max)`
For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned
to the first shard and the slices 1 and 3 are assigned to the second shard.
Each scroll is independent and can be processed in parallel like any scroll request.
Closes#13494
Today we allow to shrink to 1 shard but that might not be possible due to
too many document or a single shard doesn't meet the requirements for the index.
The logic can be expanded to N shards if the source index shards is a multiple of N.
This guarantees that there are not hotspots created due to different number of shards
being shrunk into one.