When nested objects are present in the mappings, many queries get deoptimized
due to the need to exclude documents that are not in the right space. For
instance, a filter is applied to all queries that prevents them from matching
non-root documents (`+*:* -_type:__*`). Moreover, a filter is applied to all
child queries of `nested` queries in order to make sure that the child query
only matches child documents (`_type:__nested_path`), which is required by
`ToParentBlockJoinQuery` (the Lucene query behing Elasticsearch's `nested`
queries).
These additional filters slow down `nested` queries. In 1.7-, the cost was
somehow amortized by the fact that we cached filters very aggressively. However,
this has proven to be a significant source of slow downs since 2.0 for users
of `nested` mappings and queries, see #20797.
This change makes the filtering a bit smarter. For instance if the query is a
`match_all` query, then we need to exclude nested docs. However, if the query
is `foo: bar` then it may only match root documents since `foo` is a top-level
field, so no additional filtering is required.
Another improvement is to use a `FILTER` clause on all types rather than a
`MUST_NOT` clause on all nested paths when possible since `FILTER` clauses
are more efficient.
Here are some examples of queries and how they get rewritten:
```
"match_all": {}
```
This query gets rewritten to `ConstantScore(+*:* -_type:__*)` on master and
`ConstantScore(_type:AutomatonQuery {\norg.apache.lucene.util.automaton.Automaton@4371da44})`
with this change. The automaton is the complement of `_type:__*` so it matches
the same documents, but is faster since it is now a positive clause. Simplistic
performance testing on a 10M index where each root document has 5 nested
documents on average gave a latency of 420ms on master and 90ms with this change
applied.
```
"term": {
"foo": {
"value": "0"
}
}
```
This query is rewritten to `+foo:0 #(ConstantScore(+*:* -_type:__*))^0.0` on
master and `foo:0` with this change: we do not need to filter nested docs out
since the query cannot match nested docs. While doing performance testing in
the same conditions as above, response times went from 250ms to 50ms.
```
"nested": {
"path": "nested",
"query": {
"term": {
"nested.foo": {
"value": "0"
}
}
}
}
```
This query is rewritten to
`+ToParentBlockJoinQuery (+nested.foo:0 #_type:__nested) #(ConstantScore(+*:* -_type:__*))^0.0`
on master and `ToParentBlockJoinQuery (nested.foo:0)` with this change. The
top-level filter (`-_type:__*`) could be removed since `nested` queries only
match documents of the parent space, as well as the child filter
(`#_type:__nested`) since the child query may only match nested docs since the
`nested` object has both `include_in_parent` and `include_in_root` set to
`false`. While doing performance testing in the same conditions as above,
response times went from 850ms to 270ms.
This pull request reuses the typed_keys parameter added in #22965, but this time it applies it to suggesters. When set to true, the suggester names in the search response will be prefixed with a prefix that reflects their type.
It looked like it was telling you to add both `// TEST[continued]`
and `// TESTRESPONSE` to the same snippet which doesn't do anything.
Instead it was trying to say that you can have many snippets included
into the same test by doing:
```
request1
```
// CONSOLE
```
response1
```
// TESTRESPONSE
``
request2
```
// CONSOLE
// TEST[continued]
```
response2
```
// TESTRESPONSE
This changes removes the SearchResponseListener that was used by the ExpandCollapseSearchResponseListener to expand collapsed hits.
The removal of SearchResponseListener is not a breaking change because it was never released.
This change also replace the blocking call in ExpandCollapseSearchResponseListener by a single asynchronous multi search request. The parallelism of the expand request can be set via CollapseBuilder#max_concurrent_group_searches
Closes#23048
This pull request adds a new parameter to the REST Search API named `typed_keys`. When set to true, the aggregation names in the search response will be prefixed with a prefix that reflects the internal type of the aggregation.
Here is a simple example:
```
GET /_search?typed_keys
{
"aggs": {
"tweets_per_user": {
"terms": {
"field": "user"
}
}
},
"size": 0
}
```
And the response:
```
{
"aggs": {
"sterms:tweets_per_user": {
...
}
}
}
```
This parameter is intended to make life easier for REST clients that could parse back the prefix and could detect the type of the aggregation to parse. It could also be implemented for suggesters.
This commit removes support for the `application/x-ldjson` Content-Type header as this was only used in the first draft
of the spec and had very little uptake. Additionally, the docs for bulk and msearch have been updated to specifically
call out ndjson and mention that the newline character may be preceded by a carriage return.
Finally, the bulk request handling of the carriage return has been improved to remove this character from the source.
Closes#23025
Elasticsearch v5.0.0 uses allocation IDs to safely allocate primary shards whereas prior versions of ES used a version-based mode instead. Elasticsearch v5 still has support for version-based primary shard allocation as it needs to be able to load 2.x shards. ES v6 can drop the legacy support.
Since `_all` is now deprecated and cannot be set for new indices, we should also
disallow any field that has the `include_in_all` parameter set.
Resolves#22923
These need to be CONSOLEified *now* because we're starting to
require Content-Type headers and they didn't have any.
* cluster/reroute: Marked as CONSOLE but skipped because the docs
build runs with a single node.
* docs/bulk: Marked as NOTCONSOLE because the snippets describe
either examples or `curl` commands. Fixed the `curl` command to
include the `Content-Type` header.
* query-dsl/terms-query: Marked as CONSOLE.
* search/request/rescore: Marked as CONSOLE. Fixed deprecated
syntax.
Relates #23001
Relates #18160
This adds the `COPY AS CURL` and `VIEW IN CONSOLE` links to the docs
and causes the snippets to be tested during Elasticsearch's build.
Relates to #18160
This adds the `COPY AS CURL` and `VIEW IN CONSOLE` buttons to the
docs and makes the build execute the snippets as part of `docs:check`.
Relates to #18160
Painless uses Ruby-like method dispatch (reciever type, method name,
and arity) rather than Java-like (reciever type, method name, and
argument compile time types) or Groovy-like method dispatch (receiver
type, method name, and argument run time types). We do this for
mostly good reasons but we never documented it.
Relates to #22720
Today either all nodes in the cluster connect to remote clusters of only nodes
that have remote clusters configured in their node config. To allow global remote
cluster configuration but restrict connections to a set of nodes in the cluster
this change adds a new setting `search.remote.connect` (defaults to `true`) to allow
to disable remote cluster connections on a per node basis.
This changes the way that replica failures are handled such that not all
failures will cause the replica shard to be failed or marked as stale.
In some cases such as refresh operations, or global checkpoint syncs, it is
"okay" for the operation to fail without the shard being failed (because no data
is out of sync). In these cases, instead of failing the shard we should simply
fail the operation, and, in the event it is a user-facing operation, return a
5xx response code including the shard-specific failures.
This was accomplished by having two forms of the `Replicas` proxy, one that is
for non-write operations that does not fail the shard, and one that is for write
operations that will fail the shard when an operation fails.
Relates to #10708
It was accidentally renamed `enabled_position_increment` in the cleanups
for 5.0. This adds `enable_position_increment` as a deprecated alias
so it will continue to work.
This commit removes the following queries and parameters (which were deprecated in 5.0):
* GeoPointDistanceRangeQuery
* coerce, and ignore_malformed for GeoBoundingBoxQuery, GeoDistanceQuery, GeoPolygonQuery, and GeoDistanceSort
This change also removes the reference to the difference bewteen full name and index name.
They are always the same since 2.x and `name` does not refer anymore to `author.name` automatically.
A simple pattern must be used instead.
Remove redundant code that checks the field name twice.
This change adds a strict mode for xcontent parsing on the rest layer. The strict mode will be off by default for 5.x and in a separate commit will be enabled by default for 6.0. The strict mode, which can be enabled by setting `http.content_type.required: true` in 5.x, will require that all incoming rest requests have a valid and supported content type header before the request is dispatched. In the non-strict mode, the Content-Type header will be inspected and if it is not present or not valid, we will continue with auto detection of content like we have done previously.
The content type header is parsed to the matching XContentType value with the only exception being for plain text requests. This value is then passed on with the content bytes so that we can reduce the number of places where we need to auto-detect the content type.
As part of this, many transport requests and builders were updated to provide methods that
accepted the XContentType along with the bytes and the methods that would rely on auto-detection have been deprecated.
In the non-strict mode, deprecation warnings are issued whenever a request with body doesn't provide the Content-Type header.
See #19388
GeoDistance query, sort, and scripts make use of a crazy GeoDistance enum for handling 4 different ways of computing geo distance: SLOPPY_ARC, ARC, FACTOR, and PLANE. Only two of these are necessary: ARC, PLANE. This commit removes SLOPPY_ARC, and FACTOR and cleans up the way Geo distance is computed.
Implemented by wrapping an array of reused `ModuleDateTime`s that
we grow when needed. The `ModuleDateTime`s are reused when we
move to the next document.
Also improves the error message returned when attempting to modify
the `ScriptdocValues`, removes a couple of allocations, and documents
that the date functions are available in Painless.
Relates to #22162
Currently, stored scripts use a namespace of (lang, id) to be put, get, deleted, and executed. This is not necessary since the lang is stored with the stored script. A user should only have to specify an id to use a stored script. This change makes that possible while keeping backwards compatibility with the previous namespace of (lang, id). Anywhere the previous namespace is used will log deprecation warnings.
The new behavior is the following:
When a user specifies a stored script, that script will be stored under both the new namespace and old namespace.
Take for example script 'A' with lang 'L0' and data 'D0'. If we add script 'A' to the empty set, the scripts map will be ["A" -- D0, "A#L0" -- D0]. If a script 'A' with lang 'L1' and data 'D1' is then added, the scripts map will be ["A" -- D1, "A#L1" -- D1, "A#L0" -- D0].
When a user deletes a stored script, that script will be deleted from both the new namespace (if it exists) and the old namespace.
Take for example a scripts map with {"A" -- D1, "A#L1" -- D1, "A#L0" -- D0}. If a script is removed specified by an id 'A' and lang null then the scripts map will be {"A#L0" -- D0}. To remove the final script, the deprecated namespace must be used, so an id 'A' and lang 'L0' would need to be specified.
When a user gets/executes a stored script, if the new namespace is used then the script will be retrieved/executed using only 'id', and if the old namespace is used then the script will be retrieved/executed using 'id' and 'lang'
* Integrate UnifiedHighlighter
This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter can extract offsets from either postings, term vectors, or via re-analyzing text.
The best strategy is picked automatically at query time and depends on the field and the query to highlight.
Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs.
Includes new StackOverflow index setup in build.gradle
Closes#22746
* Formatting tweaks
This change removes the ability to set region for s3 repositories.
Endpoint should be used instead if a custom s3 location needs to be
used.
closes#22758
Follow up of #22857 where we deprecate automatic creation of azure containers.
BTW I found that the `AzureSnapshotRestoreServiceIntegTests` does not bring any value because it runs basically a Snapshot/Restore operation on local files which we already test in core.
So instead of trying to fix it to make it pass with this PR, I simply removed it.
This PR adds a new option for `host_type`: `tag:TAGNAME` where `TAGNAME` is the tag field you defined for your ec2 instance.
For example if you defined a tag `my-elasticsearch-host` in ec2 and set it to `myhostname1.mydomain.com`, then
setting `host_type: tag:my-elasticsearch-host` will tell Discovery Ec2 plugin to read the host name from the
`my-elasticsearch-host` tag. In this case, it will be resolved to `myhostname1.mydomain.com`.
Closes#22566.
Adds "Appending B. Painless API Reference", a reference of all classes
and methods available from Painless. Removes links to java packages
because they contain methods that we don't expose and don't contain
methods that we do expose (the ones in Augmentation). Instead this
generates a list of every class and every exposed method using the same
type information available to the
interpreter/compiler/whatever-we-call-it. From there you can jump to
the relevant docs.
Right now you build all the asciidoc files by running
```
gradle generatePainlessApi
```
These files are expected to be committed because we build the docs
without running `gradle`.
Also changes the output of `Debug.explain` so that it is easy to
search for the class in the generated reference documentation.
You can also run it in an IDE safely if you pass the path to the
directory in which to generate the docs as the first parameter. It'll
blow away the entire directory an recreate it from scratch so be careful.
And then you can build the docs by running something like:
```
../docs/build_docs.pl --out ../built_docs/ --doc docs/reference/index.asciidoc --open
```
That is, if you have checked out https://github.com/elastic/docs in
`../docs`. Wait a minute or two and your browser will pop open in with
all of Elasticsearch's reference documentation. If you go to
`http://localhost:8000/painless-api-reference.html` you can see this
list. Or you can get there by following the links to `Modules` and
`Scripting` and `Painless` and then clicking the link in the paragraphs
below titled `Appendix B. Painless API Reference`.
I like having these in asciidoc because we can deep link to them from the
rest of the guide with constructs like
`<<painless-api-reference-Object-hashCode-0>>` and
`<<painless-api-reference->>` and we get link checking. Then the only
brittle link maintenance bit is the link generation for javadoc. Which
sucks. But I think it is important that we link to the methods directly
so they are easy to find.
Relates to #22720
Reported at: https://discuss.elastic.co/t/combine-elasticsearch-5-1-1-and-repository-hdfs/69659
If you define as described [in our docs](https://www.elastic.co/guide/en/elasticsearch/plugins/current/repository-hdfs-config.html) the following `elasticsearch.yml` settings:
```yml
repositories:
hdfs:
uri: "hdfs://es-master:9000/" # optional - Hadoop file-system URI
path: "some/path" # required - path with the file-system where data is stored/loaded
```
It fails at startup because we don't register the global setting `repositories.hdfs.path` in `HdfsPlugin`.
This PR removes that from our docs so people must provide those settings only when registering the repository with:
```
PUT _snapshot/my_hdfs_repository
{
"type": "hdfs",
"settings": {
"uri": "hdfs://namenode:8020/",
"path": "elasticsearch/respositories/my_hdfs_repository",
"conf.dfs.client.read.shortcircuit": "true"
}
}
```
Based on issue #22800.
Closes#22301
Add unit tests for `TopHitsAggregator` and convert some snippets in
docs for `top_hits` aggregation to `// CONSOLE`.
Relates to #22278
Relates to #18160
This adds the necessary `AuthCache` needed to support preemptive authorization. By adding every host to the cache, the automatically added `RequestAuthCache` interceptor will add credentials on the first pass rather than waiting to do it after _each_ anonymous request is rejected (thus always sending everything twice when basic auth is required).
There was a typo in the `ParseField` declaration. I know
we want to port these parsers to `ObjectParser` eventually
but I don't have the energy for that today and want to get
this fixed.
Closes#22722
When users need to specify a custom location for configuration files,
they also need to specify a custom location for the jvm.options file yet
our docs are absent in this regard. This commit adds a note to the
rolling upgrade docs explaining this situation.
Relates #22747
* Add top hits collapsing to search request
The field collapsing is done with a custom top docs collector that "collapse" search hits with same field value.
The distributed aspect is resolve using the two passes that the regular search uses. The first pass "collapse" the top hits, then the coordinating node merge/collapse the top hits from each shard.
```
GET _search
{
"collapse": {
"field": "category",
}
}
```
This change also adds an ExpandCollapseSearchResponseListener that intercepts the search response and expands collapsed hits using the CollapseBuilder#innerHit} options.
The retrieval of each inner_hits is done by sending a query to all shards filtered by the collapse key.
```
GET _search
{
"collapse": {
"field": "category",
"inner_hits": {
"size": 2
}
}
}
```
Adds the `VIEW IN CONSOLE` and `COPY AS CURL` links to the snippets
in the `value_count` docs and causes the build to execute the snippets
for testing.
Release #18160
This adds the `VIEW IN CONSOLE` and `COPY AS CURL` links to the
snippets in the docs for the `date_range` aggregation and tests
those snippets as part of the build.
Relates to #18160
This adds the `VIEW IN SENSE` and `COPY AS CURL` links and has
the build automatically execute the snippets and verify that they
work.
Relates to #18160
Adds the `VIEW IN CONSOLE` and `COPY AS CURL` links to the example
`global` aggregation. Also improves the example by adding a
non-`global` aggregation to compare it to.
Relates to #18160
Similar to the Filters aggregation but only supports "keyed" filter buckets and automatically "ANDs" pairs of filters to produce a form of adjacency matrix.
The intersection of buckets "A" and "B" is named "A&B" (the choice of separator is configurable). Empty intersection buckets are removed from the final results.
Closes#22169
This PR removes all leniency in the conversion of Strings to booleans: "true"
is converted to the boolean value `true`, "false" is converted to the boolean
value `false`. Everything else raises an error.
Relates to #22024
On top of documentation, the PR adds deprecation loggers and deals with the resulting warning headers.
The yaml test is set exclude versions up to 6.0. This is need to make sure bwc tests pass until this is backported to 5.2.0 . Once that's done, I will change the yaml test version limits
This change makes it possible for custom routing values to go to a subset of shards rather than
just a single shard. This enables the ability to utilize the spatial locality that custom routing can
provide while mitigating the likelihood of ending up with an imbalanced cluster or suffering
from a hot shard.
This is ideal for large multi-tenant indices with custom routing that suffer from one or both of
the following:
- The big tenants cannot fit into a single shard or there is so many of them that they will likely
end up on the same shard
- Tenants often have a surge in write traffic and a single shard cannot process it fast enough
Beyond that, this should also be useful for use cases where most queries are done under the context
of a specific field (e.g. a category) since it gives a hint at how the data can be stored to minimize
the number of shards to check per query. While a similar solution can be achieved with multiple
concrete indices or aliases per value today, those approaches breakdown for high cardinality fields.
A partitioned index enforces that mappings have routing required, that the partition size does not
change when shrinking an index (the partitions will shrink proportionally), and rejects mappings
that have parent/child relationships.
Closes#21585
This commit corrects the grammar in a list in the how-to docs; namely,
the word "and" was missing preceding the final element in a list.
Relates #22663
* Grammatical correction
* Add note for legacy string mapping type
* Update truncate token filter to not mention the keyword tokenizer
The advice predates the existence of the keyword field
Closes#22650
All the language clients support a special ignore parameter that doesn't get passed to elasticsearch with the request, but used to indicate which error code should not lead to an exception if returned for a specific request.
Moving this to the low level REST client will allow the high level REST client to make use of it too, for instance so that it doesn't have to intercept ResponseExceptions when the get api returns a 404.
Currently both ProfileResult and CollectorResult print the time field in a human readable string format
(e.g. "time": "55.20315000ms"). When trying to parse this back to a long value, for example to use in
the planned high level java rest client, we can lose precision because of conversion and rounding issues.
This change adds a new additional field (`time_in_nanos`) to the profile response to be able to get the
original time value in nanoseconds back.
The old `time` field is only printed when the `?`human=true` flag in the url is set. This follow the behaviour for
all other stats-related apis. Also the format of the `time` field is slightly changed. Instead of always formatting
the output as a 10-digit ms value, by using the `XContentBuilder#timeValueField()` method we now print
the largest time unit present is used (e.g. "s", "ms", "micros").
For certain situations, end-users need the base path for Elasticsearch
logs. Exposing this as a property is better than hard-coding the path
into the logging configuration file as otherwise the logging
configuration file could easily diverge from the Elasticsearch
configuration file. Additionally, Elasticsearch will only have
permissions to write to the log directory configured in the
Elasticsearch configuration file. This commit adds a property that
exposes this base path.
One use-case for this is configuring a rollover strategy to retain logs
for a certain period of time. As such, we add an example of this to the
documentation.
Additionally, we expose the property es.logs.cluster_name as this is
used as the name of the log files in the default configuration.
Finally, we expose es.logs.node_name in cases where node.name is
explicitly set in case users want to include the node name as part of
the name of the log files.
Relates #22625
This change disables the _all meta field by default.
Now that we have the "all-fields" method of query execution, we can save both
indexing time and disk space by disabling it.
_all can no longer be configured for indices created after 6.0.
Relates to #20925 and #21341Resolves#19784
Today when you do not specify a port for an entry in
discovery.zen.ping.unicast.hosts, the default port is the value of the
setting transport.profiles.default.port and falls back to the value of
transport.tcp.port if this is not set. For a node that is explicitly
bound to a different port than the default port, this means that the
default port will be equal to this explicitly bound port. Yet, the docs
say that we fall back to 9300 here. This commit corrects the docs.
Relates #22568
Instead of `search.remote.seeds.${clustername}` we now specify the seeds as:
`search.remote.${clustername}.seeds` which is a real list setting compared to an unvalidated
group setting before.
Fix the `Dockerfile` example in the `Customizing image` third configuration
method by adding the missing RUN instruction.
Originally reported by Shankar Vasudevan (@vshank77).
Relates #21973
#9261 added a warning about the use of `add-apt-repository` which is becoming obsolete over time as new distribution releases include later versions of `add-apt-repository` which don't automatically add the `deb-src` line. This change updates the documentation to make the block a note rather than a warning and adds two other reasons for avoiding `add-apt-repository` which are still relevant: avoiding edits to a system shared file and not requiring a large number of non-default packages to add one line of text to a file.
The example
```
/<logstash-{now/d-2d}>,<logstash-{now/d-1d}>,<logstash-{now/d}>/_search
```
shows escaped URL where `, => %2C`, so I assume it should be escaped and be present in the table
This commit updates the cluster allocation explain API documentation to
explain the new request parameters and response formats, and gives
examples of the explain API responses under various scenarios.
This commit adds a document describing our data replication model in high level terms. The goal is give people basic insight into how things work in order to better understand how read and writes interact, both during normal operations and under failures.
* Promote longs to doubles when a terms agg mixes decimal and non-decimal number
This change makes the terms aggregation work when the buckets coming from different indices are a mix of decimal numbers and non-decimal numbers. In this case non-decimal number (longs) are promoted to decimal (double) which can result in a loss of precision for big numbers.
Fixes#22232
Provides an example of using is and an example return description
and explains that we've added descriptions for some tasks but not
even close to all of them. And that we expect to change the
descriptions as we learn more.
Closes#22407
* Fix example
Getting a single task is always detailed, no need to specify.
* Rewrite like imotov wants it
After deprecating getters and setters and the query DSL parameter in 5.x,
support for `minimum_number_should_match` can be removed entirely. Also
consolidated comments with the ones on 5.x branch and added an entry to the
migration docs.
The `Script source settings` section currently states that `false` means scripting is ENABLED.
The other sections seem to indicate that `false` means scripting is DISABLED.
If the current documentation is correct, that would imply that `inline` and `stored` scripting are ENABLED by default, which seems to conflict with all the other sections in the document.
This adds a new `normalizer` property to `keyword` fields that pre-processes the
field value prior to indexing, but without altering the `_source`. Note that
only the normalization components that work on a per-character basis are
applied, so for instance stemming filters will be ignored while lowercasing or
ascii folding will be applied.
Closes#18064
Today we try to pull stats from index writer but we do not get a
consistent view of stats. Under heavy indexing, this inconsistency can
be very skewed indeed. In particular, it can lead to the number of
deleted docs being reported as negative and this leads to serialization
issues. Instead, we should provide a consistent view of the stats by
using an index reader.
Relates #22317
This change is the first towards providing the ability to store
sensitive settings in elasticsearch. It adds the
`elasticsearch-keystore` tool, which allows managing a java keystore.
The keystore is loaded upon node startup in Elasticsearch, and used by
the Setting infrastructure when a setting is configured as secure.
There are a lot of caveats to this PR. The most important is it only
provides the tool and setting infrastructure for secure strings. It does
not yet provide for keystore passwords, keypairs, certificates, or even
convert any existing string settings to secure string settings. Those
will all come in follow up PRs. But this PR was already too big, so this
at least gets a basic version of the infrastructure in.
The two main things to look at. The first is the `SecureSetting` class,
which extends `Setting`, but removes the assumption for the raw value of the
setting to be a string. SecureSetting provides, for now, a single
helper, `stringSetting()` to create a SecureSetting which will return a
SecureString (which is like String, but is closeable, so that the
underlying character array can be cleared). The second is the
`KeyStoreWrapper` class, which wraps the java `KeyStore` to provide a
simpler api (we do not need the entire keystore api) and also extend
the serialized format to add metadata needed for loading the keystore
with no assumptions about keystore type (so that we can change this in
the future) as well as whether the keystore has a password (so that we
can know whether prompting is necessary when we add support for keystore
passwords).
* Remove a checked exception, replacing it with `ParsingException`.
* Remove all Parser classes for the yaml sections, replacing them with static methods.
* Remove `ClientYamlTestFragmentParser`. Isn't used any more.
* Remove `ClientYamlTestSuiteParseContext`, replacing it with some static utility methods.
I did not rewrite the parsers using `ObjectParser` because I don't think it is worth it right now.
Today we only expose `value_type` in scriptable aggregations, however it is
also useful with unmapped fields. I suspect we never noticed because
`value_type` was not documented (fixed) and most aggregations are scriptable.
Closes#20163
* Repeated language analyzers
The `catalan` analyzer was repeated on the supported list :)
* Reordered the languages to have alphabetic order
* Added space for format
* Reordered the languages and removed repeated
Added a new section detailing how to use the attachment processor
within an array.
This reverts commit #22296 and instead links to the foreach processor.
Our `float`/`double` fields generally assume that `-0` compares less than `+0`,
except when bounds are exclusive: an exclusive lower bound on `-0` excludes
`+0` and an exclusive upper bound on `+0` excludes `-0`.
Closes#22167
With this commit, we introduce a cache to the geoip ingest processor.
The cache is enabled by default and caches the 1000 most recent items.
The cache size is controlled by the setting `ingest.geoip.cache_size`.
Closes#22074
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default for all XContent types (not only JSON).
We have also changed the name of the system property to disable this feature
from `es.json.strict_duplicate_detection` to the now more appropriate name
`es.xcontent.strict_duplicate_detection`.
Relates elastic/elasticsearch#19614
Relates elastic/elasticsearch#22073
With this commit we change the data type of the 'TIMESTAMP'
meta-data field from a formatted date string to a plain
`java.util.Date` instance. The main reason for this change is
that our benchmarks have indicated that this contributes
significantly to the time spent in the ingest pipeline.
The overhead in terms of indexing throughput of the ingest
pipeline is about 15% and breaks down roughly as follows:
* 5% overhead caused by the conversion from `XContent` -> `Map`
* 5% overhead caused by the timestamp formatting
* 5% overhead caused by the conversion `Map` -> `XContent`
Relates #22074
We try to install a system call filter on various operating systems
(Linux, macOS, BSD, Solaris, and Windows) but the setting
(bootstrap.seccomp) to control this is named after the Linux
implementation (seccomp). This commit replaces this setting with
bootstrap.system_call_filter. For backwards compatibility reasons, we
fallback to bootstrap.seccomp and log a deprecation message if
bootstrap.seccomp is set. We intend to remove this fallback in
6.0.0. Note that now is the time to make this change it's likely that
most users are not making this setting anyway as prior to version 5.2.0
(currently unreleased) it was not necessary to configure anything to
enable a node to start up if the system call filter failed to install
(we marched on anyway) but starting in 5.2.0 it will be necessary in
this case.
Relates #22226
The JSON processor has an optional field called "target_field".
If you don't specify target_field then target_field becomes what you specified as "field".
There isn't anyway to add the fields to the root of a document. By
setting `add_to_root`, now serialized fields will be inserted into the
top-level fields of the ingest document.
Closes#21898.
When using a bulk processor in test, you might write something like:
```java
BulkProcessor bulkProcessor = BulkProcessor.builder(client, new BulkProcessor.Listener() {
@Override public void beforeBulk(long executionId, BulkRequest request) {}
@Override public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {}
@Override public void afterBulk(long executionId, BulkRequest request, Throwable failure) {}
})
.setBulkActions(10000)
.setFlushInterval(TimeValue.timeValueSeconds(10))
.build();
for (int i = 0; i < 10000; i++) {
bulkProcessor.add(new IndexRequest("foo", "bar", "doc_" + i)
.source(jsonBuilder().startObject().field("foo", "bar").endObject()
));
}
bulkProcessor.flush();
client.admin().indices().prepareRefresh("foo").get();
SearchResponse response = client.prepareSearch("foo").get();
// response does not contain any hit
```
The problem is that by default bulkProcessor defines the number of concurrent requests to 1 which is using behind the scene an Async BulkRequestHandler.
When you call `flush()` in a test, you expect it to flush all the content of the bulk so you can search for your docs.
But because of the async handling, there is a great chance that none of the documents has been indexed yet when you call the `refresh` method.
We should advice in our Java guide to explicitly set concurrent requests to `0` so users will use behind the scene the Sync BulkRequestHandler.
```java
BulkProcessor bulkProcessor = BulkProcessor.builder(client, new BulkProcessor.Listener() {
@Override public void beforeBulk(long executionId, BulkRequest request) {}
@Override public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {}
@Override public void afterBulk(long executionId, BulkRequest request, Throwable failure) {}
})
.setBulkActions(5000)
.setFlushInterval(TimeValue.timeValueSeconds(10))
.setConcurrentRequests(0)
.build();
```
Closes#22158.
This commit fixes a silly doc bug where the field that represents the
total CPU time consumed by all tasks in the same cgroup was mistakenly
reported as "usage" instead of "usage_nanos".
Relates #21029
Sends the `error_trace` parameter with all requests sent by the
yaml test framework, including the doc snippet tests. This can be
overridden by settings `error_trace: false`. While this drift's
core's handling of the yaml tests from the client's slightly this
should only be a problem for tests that rely on the default value,
both of which I've fixed by setting the value explicitly.
This also escapes `\n` and `\t` in the `Stash dump on failure` so
the `stack_trace` is more readable.
Also fixes `RestUpdateSettingsAction` to not think of the `error_trace`
parameter as a setting.
* Replace _suggest endpoint to _search in docs
In 5.0, the _suggest endpoint is just sugar for _search
with suggestions specified. Users should move away from
using the _suggest endpoint, as it is marked as deprecated in 5.x and
will be removed in 6.0
* update docs to use _search endpoint instead of _suggest
* Add deprecation logging to RestSuggestAction
* Use search endpoint instead of suggest endpoint in rest tests
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.
As a last resort, we allow users to still disable strict duplicate checks
by setting `-Des.json.strict_duplicate_detection=false` which is
intentionally undocumented.
Closes#19614
When using dynamic templates, ES will now throw an exception if a
`match_mapping_type` is used that doesn't correspond to an actual type.
Relates to #17285
Our query DSL supports empty queries (`{}`), which have a different meaning depending on the query that holds it, either ignored, match_all or match_none. We deprecated the support for empty queries in 5.0, where we log a deprecation warning wherever they are used.
The way we supported it once we moved query parsing to the coordinating node was having an Optional<QueryBuilder> return type in all of our parse methods (called fromXContent). See #17624. The central place for this was QueryParseContext#parseInnerQueryBuilder. We can now remove all the optional return types and simply throw an exception whenever an empty query is found.
When we decided to deprecate and remove fuzzy query in #15760, we didn't realize we would take away the possibililty for uses to use a fuzzy query as part of a span query, which is not possible using match query. This means we have to go back and un-deprecate fuzzy query, which will not be removed.
Closes#15760
This change allows specifying alias/wildcard expression in indices_boost.
And added another format for specifying indices_boost. It accepts array of index name and boost pair.
If an index is included in multiple aliases/wildcard expressions, the first match will be used.
With new format, old format is marked as deprecated.
Closes#4756
The documentation reads:
> You can disable this behavior by setting "detect_noop": false like this:
Followed by a code example, that originally set `"detect_noop": true`.
Please correct me if I got the change backwards (i.e. the paragraph should be changed to `true`), but this seems like it makes the most sense.
In 5.0, the search slow log switched to the multi-line format with no option to get back to the origin single-line format that was used prior to 5.0 by default. This commit removes the reformat option from the search slow log and returns the search slow log back to the single-line format.
Closes#21711
When overriding a systemd configuration via a drop-in file, the
[Service] header is required. This commit adds this to an example
drop-in override in the configuring docs.
Relates #22038
The use of the avg aggregation for sorting the terms aggregation is not encouraged since it has unbounded error. This changes the examples to use the max aggregation which does not suffer the same issues
* Include unindexed field in FieldStats response
This change adds non-searchable fields to the FieldStats response. These fields do not have min/max informations but they can be aggregatable. Fields that are only stored in _source (store:no, index:no, doc_values:no) will still be missing since they do not have any useful information to show. Indices and clients must be at least on V_5_2_0 to see this change.
Changes the default socket and connection timeouts for the rest
client from 10 seconds to the more generous 30 seconds.
Defaults reindex-from-remote to those timeouts and make the
timeouts configurable like so:
```
POST _reindex
{
"source": {
"remote": {
"host": "http://otherhost:9200",
"socket_timeout": "1m",
"connect_timeout": "10s"
},
"index": "source",
"query": {
"match": {
"test": "data"
}
}
},
"dest": {
"index": "dest"
}
}
```
Closes#21707
Today if system call filters fail to install on startup, we log a
message but otherwise march on. This might leave users without system
call filters installed not knowing that they have implicitly accepted
the additional risk. We should not be lenient like this, instead clearly
informing the user that they have to either fix their configuration or
accept the risk of not having system call filters installed. This commit
adds a bootstrap check that if system call filters are enabled, they
must successfully install.
Relates #21940
The reference for the jvm.options docs recently changed from
es-java-opts to jvm-options. This commit fixes a broken reference that
arose as a result of this change.
Elasticsearch can be run in a few different ways:
- from the command line on Linux and Windows
- as a service on Linux and Windows
on both 32-bit client and 64-bit server VMs. We strive for a great
out-of-the-box experience any of these combinations but today it is
lacking on 32-bit client JVMs and on the Windows service. There are two
deficiencies that arise:
- on any 32-bit client JVM we fail to start out of the box because we
force the server JVM in jvm.options
- when installing the Windows service, the thread stack size must be
specified in jvm.options
This commit attempts to address these deficiencies.
We should continue to force the server JVM because there are systems
where the server JVM is not active by default (e.g., the 32-bit JDK on
Windows). This does mean that if a user tries to run with a client JVM
they will see a failure message at startup but this is the best that we
can do if we want to continue to force the server JVM. Thus, this commit
at least documents this situation.
To improve the situation with installing the Windows service, this
commit adds a default setting for the thread stack size. This default is
chosen based on the default thread stack size across all 64-bit server
JVMs. This means that if a user tries to run with a 32-bit JVM they
could otherwise see significantly higher memory usage (this situation is
complicated, it's really only on Windows where the extra memory usage is
egregious, but cutting into the 32-bit address space on any system is
bad). So this commit makes it so that the out-of-the-box experience is
improved for the Windows service on 64-bit server JVMs and we document
the need to adjust this setting on 32-bit JVMs.
Again, we are focusing on the out-of-the-box experience here and this
means optimizing for the best experience on any 64-bit server JVM as
this covers the vast majority of the user base. The users that are on
32-bit JVMs will suffer a little bit but at least now any user on any
64-bit server JVM can start Elasticsearch out of the box.
Finally, we fix some references to the jvm.options documentation.
Relates #21920
During package install on systemd-based systems, we try to set
vm.max_map_count. On some systems (e.g., containers), users do not have
the ability to tune these parameters from within the container. This
commit provides an option for these users to skip setting such kernel
parameters.
Relates #21899
It used to be a hybrid store between `niofs` and `mmapfs`, which we removed when
we switched to `fs` by default (which is `mmapfs` on 64-bits systems).
Currently, the `terms` query is just syctactic sugar for a `bool` query when
used in a query context. This change proposes to always generate the same query
in query and filter contexts, which is less confusing.
For the record, I also had to remove the geo-hash cell and geo-distance range
queries to make the code compile. These queries already throw an exception in
all cases with 5.x indices, so that does not hurt any more.
I also had to rename all 2.x bwc indices from `index-${version}` to
`unsupported-${version}` to make `OldIndexBackwardCompatibilityIT`
happy.
These query names were all deprecated in 5.0.0:
- in is removed in favour of terms
- geo_bbox is removed in favour of geo_bounding_box
- mlt is removed in favour of more_like_this
- fuzzy_match and match_fuzzy are removed in favour of match
Set lucene version to 6.4.0-snapshot-ec38570 and update all the sha1s/license
Fix invalid combo after upgrade in query_string query. split_on_whitespace=false is disallowed if auto_generate_phrase_queries=true
Adapt the expectations of some tests to the new format of the Lucene explain output
Lucene 6.2 added index and query support for numeric ranges. This commit adds a new RangeFieldMapper for indexing numeric (int, long, float, double) and date ranges and creating appropriate range and term queries. The design is similar to NumericFieldMapper in that it uses a RangeType enumerator for implementing the logic specific to each type. The following range types are supported by this field mapper: int_range, float_range, long_range, double_range, date_range.
Lucene does not provide a DocValue field specific to RangeField types so the RangeFieldMapper implements a CustomRangeDocValuesField for handling doc value support.
When executing a Range query over a Range field, the RangeQueryBuilder has been enhanced to accept a new relation parameter for defining the type of query as one of: WITHIN, CONTAINS, INTERSECTS. This provides support for finding all ranges that are related to a specific range in a desired way. As with other spatial queries, DISJOINT can be achieved as a MUST_NOT of an INTERSECTS query.
Integrate the patch from LUCENE-6664 into elasticsearch and
add support for handling a graph token stream in match/multi-match
queries.
This fixes longstanding bugs with multi-token synonyms returning
incorrect results with proximity queries.
> Pharo Smalltalk
> Pharo emerged as a fork of Squeak Smalltalk. It focuses on modern software engineering and development techniques.
project:
- http://pharo.org/ Pharo Smalltalk