ingest: Introduction of a bytes processor
This processor allows for human readable byte values (e.g. 1kb) to be converted to value in bytes (e.g. 1024). Internally this processor re-uses "ByteSizeValue.parseBytesSizeValue" which supports conversions up to Long.MAX_VALUE and the following units: "b", "kb", "mb", "gb", "tb", pb".
This change also introduces a generic return type for the AbstractStringProcessor to allow for code reuse while supporting a String -> T conversion. (String -> Long in this case).
Significantly improve the example snippets in the documentation.
The examples are part of the test suite and checked nightly.
To help readability, the existing dataset was extended (test_emp renamed
to emp plus library).
Improve output of JDBC tests to be consistent with the CLI
Add lenient flag to JDBC asserts to allow type widening (a long is
equivalent to a integer as long as the value is the same).
So far the in-flight request circuit breaker has only accounted for the
on-the-wire representation of a request. However, we convert the raw
request into XContent internally which increases the overhead.
Therefore, we increase the value of the corresponding setting
`network.breaker.inflight_requests.overhead` from one to two. While this
value is still rather conservative (we assume that the representation as
structured objects has no overhead compared to the byte[]), it is closer
to reality than the current value.
Relates #31613
The range docs had an introductory section that described how to set up
and index *and* a test setup section in `docs/build.gradle` that
duplicated that section. This is bad because these section can (and do)
drift from one another. This change removes the setup in build.gradle
and marks the introductor snippet with `// TESTSETUP` so it is used on
all the snippets.
* Migrate scripted metric aggregation scripts to ScriptContext design #29328
* Rename new script context container class and add clarifying comments to remaining references to params._agg(s)
* Misc cleanup: make mock metric agg script inner classes static
* Move _score to an accessor rather than an arg for scripted metric agg scripts
This causes the score to be evaluated only when it's used.
* Documentation changes for params._agg -> agg
* Migration doc addition for scripted metric aggs _agg object change
* Rename "agg" Scripted Metric Aggregation script context variable to "state"
* Rename a private base class from ...Agg to ...State that I missed in my last commit
* Clean up imports after merge
A link to the ip_range datatype page provides a way for newer users to know
it exists if they land directly on the ip datatype page first via a search.
The `multiplexer` filter emits multiple tokens at the same position, each
version of the token haivng been passed through a different filter chain.
Identical tokens at the same position are removed.
This allows users to, for example, index lowercase and original-case tokens,
or stemmed and unstemmed versions, in the same field, so that they can search
for a stemmed term within x positions of an unstemmed term.
Folks tend to want to be able to make a single `_reindex` call to
migrate many indices. You *can* do that and we even have an example of
how to do that in the docs but it isn't always a good idea. This change
adds some advice to the docs: generally you want to make one reindex
call per index.
Closes#22920
This switches the docs tests from the `oss-zip` distribution to the
`zip` distribution so they have xpack installed and configured with the
default basic license. The goal is to be able to merge the
`x-pack/docs` directory into the `docs` directory, marking the x-pack
docs with some kind of marker. This is the first step in that process.
This also enables `-Dtests.distribution` support for the `docs`
directory so you can run the tests against the `oss-zip` distribution
with something like
```
./gradlew -p docs check -Dtests.distribution=oss-zip
```
We can set up Jenkins to run both.
Relates to #30665
The other metric aggregations (min/max/etc) return `null` as their XContent value and string when nothing was computed (due to empty/missing fields). Percentiles and Percentile Ranks, however, return `NaN `which is inconsistent and confusing for the user. This fixes the inconsistency by making the aggs return `null`. This applies to both the numeric value and the "as string" value.
Note: like the metric aggs, this does not change the value if fetched directly from the percentiles object, which will return as `NaN`/`"NaN"`. This only changes the XContent output.
While this is a bugfix, it still breaks bwc in a minor way as the response changes from prior version.
Closes#29066
This commit adds the is-write-index flag for aliases.
It allows requests to set the flag, and responses to display the flag.
It does not validate and/or affect any indexing/getting/updating behavior
of Elasticsearch -- this will be done in a follow-up PR.
With #29331 we added support for the cluster health API to the
high-level REST client. The transport client does not support the level
parameter, and it always returns all the info needed for shards level
rendering. We have maintained that behaviour when adding support for
cluster health to the high-level REST client, to ease migration, but the
correct thing to do is to default the high-level REST client to
`cluster` level, which is the same default as when going through the
Elasticsearch REST layer.
This commit removes all the API methods that accept a `Header` varargs
argument, in favour of the newly introduced API methods that accept a
`RequestOptions` argument.
Relates to #31069
This adds a thread interrupter that allows us to encapsulate calls to org.joni.Matcher#search()
This method can hang forever if the regex expression is too complex.
The thread interrupter in the background checks every 3 seconds whether there are threads
execution the org.joni.Matcher#search() method for longer than 5 seconds and
if so interrupts these threads.
Joni has checks that that for every 30k iterations it checks if the current thread is interrupted and
if so returns org.joni.Matcher#INTERRUPTED
Closes#28731
With `max_concurrent_shard_requests` we used to throttle / limit
the number of concurrent shard requests a high level search request
can execute per node. This had several problems since it limited the
number on a global level based on the number of nodes. This change
now throttles the number of concurrent requests per node while still
allowing concurrency across multiple nodes.
Closes#31192
Adds support for `ignore_unmapped` parameter in geo distance sorting,
which is functionally equivalent to specifying an `unmapped_type` in
the field sort.
Closes#28152
This field is similar to the `feature` field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.
Relates #27552
By default span_multi query will limit term expansions = boolean max clause.
This will limit high heap usage in case of high cardinality term
expansions. This applies only if top_terms_N is not used in inner multi
query.
* Fix index prefixes to work with span_multi
Text fields that use `index_prefixes` can rewrite `prefix` queries into
`term` queries internally. This commit fix the handling of this rewriting
in the `span_multi` query.
This change also copies the index options of the text field into the
prefix field in order to be able to run positional queries. This is mandatory
for `span_multi` to work but this could also be useful to optimize `match_phrase_prefix`
queries in a follow up. Note that this change can only be done on indices created
after 6.3 since we set the index options to doc only in this version.
Fixes#31056
When `lenient=false`, attempts to create match phrase queries with custom analyzers against non-text fields will throw an IllegalArgumentException.
Also changes `*Match*QueryBuilderTests` so that it avoids this scenario
Fixes#31061
The documentation for the account circuit breaker listed the settings for it's limit and overhead to be `network.breaker.accounting.limit` and `network.breaker.accounting.overhead` when in `HieratchyCircuitBreakerService` it seems the settings are actually `indices.breaker.accounting.limit` and `indices.breaker.accounting.overhead`.
* [DOCS] Rewords _field_names documentation
Corrects the language around when we write to `_field_names` and when you might want to disable it given that n recent versions it does not carry the indexing overhead it once did.
Relates to #30862
* Update wording following review
Specifying `index_phrases: true` on a text field mapping will add a subsidiary
[field]._index_phrase field, indexing two-term shingles from the parent field.
The parent analysis chain is re-used, wrapped with a FixedShingleFilter.
At query time, if a phrase match query is executed, the mapping will redirect it
to run against the subsidiary field.
This should trade faster phrase querying for a larger index and longer indexing
times.
Relates to #27049
This change adds an option named `split_queries_on_whitespace` to the `keyword`
field type. When set to true full text queries (`match`, `multi_match`, `query_string`, ...) that target the field will split the input on whitespace to build the query terms. Defaults to `false`.
Closes#30393
This change deprecates completion queries and documents without context that target a
context enabled completion field. Querying without context degrades the search
performance considerably (even when the number of indexed contexts is low).
This commit targets master but the deprecation will take place in 6.x and the functionality
will be removed in 7 in a follow up.
Closes#29222
Currently failures to compile a script usually lead to a ScriptException, which
inherits the 500 INTERNAL_SERVER_ERROR from ElasticsearchException if it does
not contain another root cause. Instead, this should be a 400 Bad Request error.
This PR changes this more generally for script compilation errors by changing
ScriptException to return 400 (bad request) as status code.
Closes#12315
This change adds a new option to the composite aggregation named `missing_bucket`.
This option can be set by source and dictates whether documents without a value for the
source should be ignored. When set to true, documents without a value for a field emits
an explicit `null` value which is then added in the composite bucket.
The `missing` option that allows to set an explicit value (instead of `null`) is deprecated in this change and will be removed in a follow up (only in 7.x).
This commit also changes how the big arrays are allocated, instead of reserving
the provided `size` for all sources they are created with a small intial size and they grow
depending on the number of buckets created by the aggregation:
Closes#29380
Include size of snapshot in snapshot metadata
Adds difference of number of files (and file sizes) between prev and current snapshot. Total number/size reflects total number/size of files in snapshot.
Closes#18543
Treats geohashes as grid cells instead of just points when the
geohashes are used to specify the edges in the geo_bounding_box
query. For example, if a geohash is used to specify the top_left
corner, the top left corner of the geohash cell will be used as the
corner of the bounding box.
Closes#25154
The current documentation isn't very clear about how incomplete dates are
treated when specifying custom formats in a `range` query. This change adds a
note explaining how missing month or year coordinates translate to dates that
have the missings slots filled with unix time start date (1970-01-01)
Closes#30634
This commit reintroduces 31251c9 and 63a5799. These commits introduced a
memory leak and were reverted. This commit brings those commits back
and fixes the memory leak by removing unnecessary retain method calls.
This reverts commit 31251c9 introduced in #30695.
We suspect this commit is causing the OOME's reported in #30811 and we will use this PR to test this assertion.
This commit adds the ability to configure how a docvalue field should be
formatted, so that it would be possible eg. to return a date field
formatted as the number of milliseconds since Epoch.
Closes#27740
Lucene has a new `FeatureField` which gives the ability to record numeric
features as term frequencies. Its main benefit is that it allows to boost
queries with the values of these features and efficiently skip non-competitive
documents at the same time using block-max WAND and indexed impacts.
Currently the first snippet in the documentation test in script-fields.asciidoc
isn't executed, although it has the CONSOLE annotation. Adding a test setup
annotation to it seems to fix the problem.
This is related to #29500 and #28898. This commit removes the abilitiy
to disable http pipelining. After this commit, any elasticsearch node
will support pipelined requests from a client. Additionally, it extracts
some of the http pipelining work to the server module. This extracted
work is used to implement pipelining for the nio plugin.
=== Char Group Tokenizer
The `char_group` tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, `pattern` tokenizer>>
is not acceptable.
=== Configuration
The `char_group` tokenizer accepts one parameter:
`tokenize_on_chars`::
A string containing a list of characters to tokenize the string on.
Whenever a character
from this list is encountered, a new token is started. Also supports
escaped values like `\\n` and `\\f`,
and in addition `\\s` to represent whitespace, `\\d` to represent
digits and `\\w` to represent letters.
Defaults to an empty list.
=== Example output
```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2```
When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the
above sentence would produce the following terms:
```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]```
This commit fixes docs failure on language analyzers when compared to the built in analyzers.
The `elision` filters used by the rebuilt language analyzers should be case insensitive to match
the definition of the prebuilt analyzers.
Closes#30557
The getDate() and getDates() existed prior to 5.x on long fields in
scripting. In 5.x, a new Date type for ScriptDocValues was added. The
getDate() and getDates() methods were left on long fields and added to date
fields to ease the transition. This commit removes those methods for
7.0.
This pipeline aggregation gives the user the ability to script functions that "move" across a window
of data, instead of single data points. It is the scripted version of MovingAvg pipeline agg.
Through custom script contexts, we expose a number of convenience methods:
- MovingFunctions.max()
- MovingFunctions.min()
- MovingFunctions.sum()
- MovingFunctions.unweightedAvg()
- MovingFunctions.linearWeightedAvg()
- MovingFunctions.ewma()
- MovingFunctions.holt()
- MovingFunctions.holtWinters()
- MovingFunctions.stdDev()
The user can also define any arbitrary logic via their own scripting, or combine with the above methods.
The geo_bounding_box query might produce false positives alongside
the right and upper edges and false negatives alongside left and
bottom edges. This commit documents the behavior and defines the
maximum error.
Closes#29196
This commit changes the default out-of-the-box configuration for the
number of shards from five to one. We think this will help address a
common problem of oversharding. For users with time-based indices that
need a different default, this can be managed with index templates. For
users with non-time-based indices that find they need to re-shard with
the split API in place they no longer need to resort only to
reindexing.
Since this has the impact of changing the default number of shards used
in REST tests, we want to ensure that we still have coverage for issues
that could arise from multiple shards. As such, we randomize (rarely)
the default number of shards in REST tests to two. This is managed via a
global index template. However, some tests check the templates that are
in the cluster state during the test. Since this template is randomly
there, we need a way for tests to skip adding the template used to set
the number of shards to two. For this we add the default_shards feature
skip. To avoid having to write our docs in a complicated way because
sometimes they might be behind one shard, and sometimes they might be
behind two shards we apply the default_shards feature skip to all docs
tests. That is, these tests will always run with the default number of
shards (one).
We want copying settings to be the default behavior. This commit
deprecates not copying settings, and disallows explicitly not copying
settings. This gives users a transition path to the future default
behavior.
We have a pile of documentation describing how to rebuild the built in
language analyzers and, previously, our documentation testing framework
made sure that the examples successfully built *an* analyzer but they
didn't assert that the analyzer built by the documentation matches the
built in anlayzer. Unsuprisingly, some of the examples aren't quite
right.
This adds a mechanism that tests that the analyzers built by the docs.
The mechanism is fairly simple and brutal but it seems to be working:
build a hundred random unicode sequences and send them through the
`_analyze` API with the rebuilt analyzer and then again through the
built in analyzer. Then make sure both APIs return the same results.
Each of these calls to `_anlayze` takes about 20ms on my laptop which
seems fine.
We had been using `task_id:1` or `taskId:1` because it is parses as a
valid task identifier but the `:1` part is confusing. This replaces
those examples with `task_id` which matches the response from the list
tasks API.
Closes#28314
This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
This commit removes the http.enabled setting. While all real nodes (started with bin/elasticsearch) will always have an http binding, there are many tests that rely on the quickness of not actually needing to bind to 2 ports. For this case, the MockHttpTransport.TestPlugin provides a dummy http transport implementation which is used by default in ESIntegTestCase.
closes#12792
Systemd overrides should happen through /etc/systemd/system, not
directly editing the service file. This commit removes marking the
service file as configuration for rpm and deb packages.
This adds a new `_ignored` meta field which indexes and stores fields that have
been ignored at index time because of the `ignore_malformed` option. It makes
malformed documents easier to identify by using `exists` or `term(s)` queries
on the `_ignored` field.
Closes#29494