In the Google Groups forum there appears to be some confusion as to what mlt
does. This documentation update should hopefully help demystifying this
feature, and provide some understanding as to how to use its parameters.
Closes#6092
- Randomized integration tests for the benchmark API.
- Negative tests for cases where the cluster cannot run benchmarks.
- Return 404 on missing benchmark name.
- Allow to specify 'types' as an array in the JSON syntax when describing a benchmark competition.
- Don't record slowest for single-request competitions.
Closes#6003, #5906, #5903, #5904
Adds a table with the exhaustive list of all available headers with a brief description (mostly from `org.elasticsearch.rest.action.cat.RestNodesAction`) so that people do not need to go searching for them in the code like I did, or search through `nodes?help`.
Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.
This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.
On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.
closes#5998closes#6041
Update `geo-shape-type.asciidoc` to include all `GeoShapeType`s supported by the `org.elasticsearch.common.geo.builders.ShapeBuilder`.
Changes include:
1. A tabular mapping of GeoJSON types to Elasticsearch types
2. Listing all types, with brief examples, for all support Elasticsearch types
3. Putting non-standard types to the bottom (really just moving Envelope to the bottom)
4. Linking to all GeoJSON types.
5. Adding whitespace around tightly nested arrays (particularly `multipolygon`) for readability
The possibility of filtering for index templates in the cluster state API
had been introduced before there was a dedicated index templates API. This
commit removes this support from the cluster state API, as it was not really
clean, requiring you to specify the metadata and the index templates.
Closes#4954
A boost terms factor of 1.0 is not the same as no boosting of terms.
The desired behavior is to deactivate boosting by default. If the user
specifies any value other than 0, then boosting is activated.
Closes#6021
Updating to this version allows to configure a special JNA directory,
in case the /tmp directory is mounted with the noexec option, as JNA
extracts some data and tries to execute parts of it.
Also updated documentation to clarify mlockall and memory settings as well
as pointing to the new jna.tmpdir system property.
Closes#5493
Decay functions currently only use the first value in a field that contains
multiple values to compute the distance to the origin. Instead, it should
consider all distances if more values are in the field and then use
one of min/max/sum/avg which is defined by the user.
Relates to #3960closes#5940
Separate version check logic for reads and writes for all version types, which allows different behavior in these cases.
Change `VersionType.EXTERNAL` & `VersionType.EXTERNAL_GTE` to behave the same as `VersionType.INTERNAL` for read operations.
The previous behavior was fit for writes but is useless in reads.
This commit also makes the usage of `EXTERNAL` & `EXTERNAL_GTE` in the update api raise a validation error as it make cause data to
be lost.
Closes#5663 , Closes#5661, Closes#5929
The current setting of 20MB/sec seems to be too conservative given
the capabilities of modern hardware / network throughput.
A 50MB default should provide better out of the box performance.
Change the default numeric precision_step to 16 for 64-bit types,
8 for 32-bit and 16-bit types. Disable precision_step for the 8-bit
byte type.
Closes#5905
The current setting of 20MB/sec seems to be too conservative given
the capabilities of modern hardware. Even on cloud infrastructure this
seems to be too lowish. A 50MB default should provide better out of the box
performance
Currently we use 5k operations as a flush threshold. Indexing 5k documents
per second is rather common which would cause the index to be committed on
the lucene level each time the flush logic runs which is 5 seconds by default.
We should rather use a size based threshold similar to the lucene index writer
that doesn't cause such agressive commits which can slow down indexing significantly
especially since they cause the underlying devices to fsync their data.
Load tests showed that SerialMS has problems to keep up with
the merges under high load. We should switch back to CMS
until we have a better story to balance merge
threads / efforts across shards on a single node.
Closes#5817
Add an API endpoint at /_bench for submitting, listing, and aborting
search benchmarks. This API can be used for timing search requests,
subject to various user-defined settings.
Benchmark results provide summary and detailed statistics on such
values as min, max, and mean time. Values are reported per-node so that
it is easy to spot outliers. Slow requests are also reported.
Long running benchmarks can be viewed with a GET request, or aborted
with a POST request.
Benchmark results are optionally stored in an index for subsequent
analysis.
Closes#5407
The default precision was way too exact and could lead people to
think that geo context suggestions are not working. This patch now
requires you to set the precision in the mapping, as elasticsearch itself
can never tell exactly, what the required precision for the users
suggestions are.
Closes#5621
The `field_value_factor` function uses the value of a field in the
document to influence the score.
A query that looks like:
{
"query": {
"function_score": {
"query": {"match": { "body": "foo" }},
"functions": [
{
"field_value_factor": {
"field": "popularity",
"factor": 1.1,
"modifier": "square"
}
}
],
"score_mode": "max",
"boost_mode": "sum"
}
}
}
Would have the score modified by:
square(1.1 * doc['popularity'].value)
Closes#5519
allow to configure on the index level which blocks can optionally be applied using tribe.blocks.indices prefix settings.
allow to control what will be done when a conflict is detected on index names coming from several clusters using the tribe.on_conflict setting. Defaults remains "any", but now support also "drop" and "prefer_[tribeName]".
closes#5501
Adds a new API endpoint at /_recovery as well as to the Java API. The
recovery API allows one to see the recovery status of all shards in the
cluster. It will report on percent complete, recovery type, and which
files are copied.
Closes#4637
By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when requesting empty buckets (min_doc_count : 0), this causes a confusion, specifically, when the data is also filtered.
To understand why, let's look at an example:
Lets say the you're filtering your request to get all docs from the last month, and in the date_histogram aggs you'd like to slice the data per day. You also specify min_doc_count:0 so that you'd still get empty buckets for those days to which no document belongs. By default, if the first document that fall in this last month also happen to fall on the first day of the **second week** of the month, the date_histogram will **not** return empty buckets for all those days prior to that second week. The reason for that is that by default the histogram aggregations only start building buckets when they encounter documents (hence, missing on all the days of the first week in our example).
With extended_bounds, you now can "force" the histogram aggregations to start building buckets on a specific min values and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if the min_doc_count is greater than 0).
Note that (as the name suggest) extended_bounds is **not** filtering buckets. Meaning, if the min bounds is higher than the values extracted from the documents, the documents will still dictate what the min bucket will be (and the same goes to the extended_bounds.max and the max bucket). For filtering buckets, one should nest the histogram agg under a range filter agg with the appropriate min/max.
Closes#5224
Today, we use ConcurrentMergeScheduler, and this can be painful since it is concurrent on a shard level, with a max of 3 threads doing concurrent merges. If there are several shards being indexed, then there will be a minor explosion of threads trying to do merges, all being throttled by our merge throttling.
Moving to serial merge scheduler will still maintain concurrency of merges across shards, as we have the merge thread pool that schedules those merges. It will just be a serial one on a specific shard.
Also, on serial merge scheduler, we now have a limit of how many merges it will do at one go, so it will let other shards get their fair chance of merging. We use the pending merges on IW to check if merges are needed or not for it.
Note, that if a merge is happening, it will not block due to a sync on the maybeMerge call at indexing (flush) time, since we wrap our merge scheduler with the EnabledMergeScheduler, where maybeMerge is not activated during indexing, only with explicit calls to IW#maybeMerge (see Merges).
closes#5447
If we want to have a full picture of versions running in a cluster, we need to add a `_cat/plugins` endpoint.
Response could look like:
```sh
% curl es2:9200/_cat/plugins?v
node component version type url desc
es1 mapper-attachments 1.7.0 j Adds the attachment type allowing to parse difference attachment formats
es1 lang-javascript 1.4.0 j JavaScript plugin allowing to add javascript scripting support
es1 analysis-smartcn 1.9.0 j Smart Chinese analysis support
es1 marvel 1.1.0 j/s http://localhost:9200/_plugins/marvel Elasticsearch Management & Monitoring
es1 kopf 0.5.3 s http://localhost:9200/_plugins/kopf kopf - simple web administration tool for ElasticSearch
es2 mapper-attachments 2.0.0.RC1 j Adds the attachment type allowing to parse difference attachment formats
es2 lang-javascript 2.0.0.RC1 j JavaScript plugin allowing to add javascript scripting support
es2 analysis-smartcn 2.0.0.RC1 j Smart Chinese analysis support
```
Closes#4824.
Significance is related to the changes in document frequency observed between everyday use in the corpus and
frequency observed in the result set. The asciidocs include extensive details on the applications of this feature.
Closes#5146
This aggregation computes unique term counts using the hyperloglog++ algorithm
which uses linear counting to estimate low cardinalities and hyperloglog on
higher cardinalities.
Since this algorithm works on hashes, it is useful for high-cardinality fields
to store the hash of values directly in the index, which is the purpose of
the new `murmur3` field type. This is less necessary on low-cardinality
string fields because the aggregator is smart enough to only compute the hash
once per unique value per segment thanks to ordinals, or on numeric fields
since hashing them is very fast.
Close#5426
================
This commit extends the `CompletionSuggester` by context
informations. In example such a context informations can
be a simple string representing a category reducing the
suggestions in order to this category.
Three base implementations of these context informations
have been setup in this commit.
- a Category Context
- a Geo Context
All the mapping for these context informations are
specified within a context field in the completion
field that should use this kind of information.
Introduced two levels of randomization for the number of shards (between 1 and 10) when running tests:
1) through the existing random index template, which now sets a random number of shards that is shared across all the indices created in the same test method unless overwritten
2) through `createIndex` and `prepareCreate` methods, similar to what happens using the `indexSettings` method, which changes for every `createIndex` or `prepareCreate` unless overwritten (overwrites index template for what concerns the number of shards)
Added the following facilities to deal with the random number of shards:
- `getNumShards` to retrieve the number of shards of a given existing index, useful when doing comparisons based on the number of shards and we can avoid specifying a static number. The method returns an object containing the number of primaries, number of replicas and the total number of shards for the existing index
- added `assertFailures` that checks that a shard failure happened during a search request, either partial failure or total (all shards failed). Checks also the error code and the error message related to the failure. This is needed as without knowing the number of shards upfront, when simulating errors we can run into either partial (search returns partial results and failures) or total failures (search returns an error)
- added common methods similar to `indexSettings`, to be used in combination with `createIndex` and `prepareCreate` method and explicitly control the second level of randomization: `numberOfShards`, `minimumNumberOfShards` and `maximumNumberOfShards`. Added also `numberOfReplicas` despite the number of replicas is not randomized (default not specified but can be overwritten by tests)
Tests that specified the number of shards have been reviewed and the results follow:
- removed number_of_shards in node settings, ignored anyway as it would be overwritten by both mechanisms above
- remove specific number of shards when not needed
- removed manual shards randomization where present, replaced with ordinary one that's now available
- adapted tests that didn't need a specific number of shards to the new random behaviour
- fixed a couple of test bugs (e.g. 3 levels parent child test could only work on a single shard as the routing key used for grand-children wasn't correct)
- also done some cleanup, shared code through shard size facets and aggs tests and used common methods like `assertAcked`, `ensureGreen`, `refresh`, `flush` and `refreshAndFlush` where possible
- made sure that `indexSettings()` is always used as a basis when using `prepareCreate` to inject specific settings
- converted indexRandom(false, ...) + refresh to indexRandom(true, ...)
Supports sorting on sub-aggs down the current hierarchy. This is supported as long as the aggregation in the specified order path are of a single-bucket type, where the last aggregation in the path points to either a single-bucket aggregation or a metrics one. If it's a single-bucket aggregation, the sort will be applied on the document count in the bucket (i.e. doc_count), and if it is a metrics type, the sort will be applied on the pointed out metric (in case of a single-metric aggregations, such as avg, the sort will be applied on the single metric value)
NOTE: this commit adds a constraint on what should be considered a valid aggregation name. Aggregations names must be alpha-numeric and may contain '-' and '_'.
Closes#5253
Lucene 4.7 supports a setter for the `filler_token` that is
inserted if there are gaps in the token stream. This change exposes
this setting.
Closes#4307
In #4052 we added support for highlighting multi term queries using the postings highlighter. That worked only for top-level queries though, and not for multi term queries that are nested for instance within a bool query, or filtered query, or a constant score query.
The way we make this work is by walking the query structure and temporarily overriding the query rewrite method with a method that allows for multi terms extraction.
Closes#5102
Fixes#5128
Remove java 7 specific Locale functions, add "coming[1.1.0]" to documentation
add LocaleUtils utility class for dealing with Locale functions
Adds support for storing mustache based query templates that can later be filled
with query parameter values at execution time. Templates may be both quoted,
non-quoted and referencing templates stored in config/scripts/*.mustache by file
name.
See docs/reference/query-dsl/queries/template-query.asciidoc for templating
examples.
Implementation detail: mustache itself is being shaded as it depends directly on
guava - so having it marked optional but included in the final distribution
raises chances of version conflicts downstream.
Fixes#4879
It is now possible to specify aliases during index creation:
curl -XPUT 'http://localhost:9200/test' -d '
{
"aliases" : {
"alias1" : {},
"alias2" : {
"filter" : { "term" : {"field":"value"}}
}
}
}'
Closes#4920
In order to be consistent (and because in 1.0 we switched from
parameter driven information to specifzing the metrics as part of the URI)
this patch moves from 'plugin' to 'plugins' in the Nodes Info API.
`cross_fields` attemps to treat fields with the same analysis
configuration as a single field and uses maximum score promotion or
combination of the scores based depending on the `use_dis_max` setting.
By default scores are combined. `cross_fields` can also search across
fields of hetrogenous types for instance if numbers can be part of
the query it makes sense to search also on numeric fields if an analyzer
is provided in the reqeust.
Relates to #2959
* Mostly minor things like typos and grammar stuff
* Some clarifications
* The note on the deprecation was ambiguous. I've removed the problematic part so that it now definitely says it's deprecated
Currently, boosting on `copy_to` is misleading and does not work as originally specified in #4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field. If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6. This was not the intention.
The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
By default active, rejected and queue thread statistics are included for the index, bulk and search thread pool.
Other thread statistics of other thread pools can be included via the `h` query string parameter.
Closes#4907
Detects if rescores arrive as an array instead of a plain object. If so
then parse each element of the array as a separate rescore to be executed
one after another. It looks like this:
"rescore" : [ {
"window_size" : 100,
"query" : {
"rescore_query" : {
"match" : {
"field1" : {
"query" : "the quick brown",
"type" : "phrase",
"slop" : 2
}
}
},
"query_weight" : 0.7,
"rescore_query_weight" : 1.2
}
}, {
"window_size" : 10,
"query" : {
"score_mode": "multiply",
"rescore_query" : {
"function_score" : {
"script_score": {
"script": "log10(doc['numeric'].value + 2)"
}
}
}
}
} ]
Rescores as a single object are still supported.
Closes#4748
Removed unused misc.asciidoc file
Added plugins directory to directory layout
Fixed transport.tcp.connect_timeout value to match the code found in NetworkService.TcpSettings
Clarified that phrase query does not preserve order of terms
Clarified merge page
Added instructions on how to build documentation to docs/README
Terms aggregations return up to `size` terms, so up to now, the way to get all
matching terms back was to set `size` to an arbitrary high number that would be
larger than the number of unique terms.
Terms aggregators already made sure to not allocate memory based on the `size`
parameter so this commit mostly consists in making `0` an alias for the
maximum integer value in the TermsParser.
Close#4837
Adds a new FetchSubPhase, FieldDataFieldsFetchSubPhase, which loads the
field data cache for a field and returns an array of values for the
field.
Also removes `doc['<field>']` and `_source.<field>` workaround no longer
needed in field name resolving.
Closes#4492
* Make it clearer that `aggs` is an allowed synomym
for the `aggregations` key
* Fix broken example in for datehistogram, `1.5M` is
not an allowed interval
* Make use of colon before examples consistent
* Fix typos
- Removed "ok": true from response examples
- Added "created" flag to index response examples
- Replaced exists flag with found in delete response examples
* Made GET mappings consistent, supporting
* /{index}/_mappings/{type}
* /{index}/_mapping/{type}
* /_mapping/{type}
* Added "mappings" in the JSON response to align it with other responses
* Made GET warmers consistent, support /{index}/_warmers/{type} and /_warmer, /_warner/{name}
as well as wildcards and _all notation
* Made GET aliases consistent, support /{index}/_aliases/{name} and /_alias, /_aliases/{name}
as well as wildcards and _all notation
* Made GET settings consistent, added /{index}/_setting/{name}, /_settings/{name}
as well as supportings wildcards in settings name
* Returning empty JSON instead of a 404, if a specific warmer/
setting/alias/type is missing
* Added a ton of spec tests for all of the above
* Added a couple of more integration tests for several features
Relates #4071
See issue #4071
PUT options for _mapping:
Single type can now be added with
`[PUT|POST] {index|_all|*|regex|blank}/[_mapping|_mappings]/type`
and
`[PUT|POST] {index|_all|*|regex|blank}/type/[_mapping|_mappings]`
PUT options for _warmer:
PUT with a single warmer can now be done with
`[PUT|POST] {index|_all|*|prefix*|blank}/{type|_all|*|prefix*|blank}/[_warmer|_warmers]/warmer_name`
PUT options for _alias:
Single alias can now be PUT with
`[PUT|POST] {index|_all|*|prefix*|blank}/[_alias|_aliases]/alias`
DELETE options _mapping:
Several mappings can be deleted at once by defining several indices and types with
`[DELETE] /{index}/{type}`
`[DELETE] /{index}/{type}/_mapping`
`[DELETE] /{index}/_mapping/{type}`
where
`index= * | _all | glob pattern | name1, name2, …`
`type= * | _all | glob pattern | name1, name2, …`
Alternatively, the keyword `_mapings` can be used.
DELETE options for _warmer:
Several warmers can be deleted at once by defining several indices and names with
`[DELETE] /{index}/_warmer/{type}`
where
`index= * | _all | glob pattern | name1, name2, …`
`type= * | _all | glob pattern | name1, name2, …`
Alternatively, the keyword `_warmers` can be used.
DELETE options for _alias:
Several aliases can be deleted at once by defining several indices and names with
`[DELETE] /{index}/_alias/{type}`
where
`index= * | _all | glob pattern | name1, name2, …`
`type= * | _all | glob pattern | name1, name2, …`
Alternatively, the keyword `_aliases` can be used.
Fixes#4701. Changes behavior of the snapshot operation. The operation now fails if not all primary shards are available at the beginning of the snapshot operation. The restore operation no longer tries to restore indices with shards that failed or were missing during snapshot operation.
Currently it is possible to index a document as:
```
POST /myindex/mytype/1
{ "foo"...}
```
or as:
```
POST /myindex/mytype/1
{
"mytype": {
"foo"...
}
}
```
This makes indexing non-deterministic and fields can be misinterpreted
as type names.
This changes makes Elasticsearch accept only the first form by default,
ie without the type wrapper. This can be changed by setting
`index.mapping.allow_type_wrapper` to `true`` when creating the index.
Closes#4484
When set to false a new strict mode of parsing is employed which
a) does not permit numbers to be passed as JSON strings in quotes
b) rejects numbers with fractions that are passed to integer, short or long fields.
Closes#4117
Java Builder apis drop old “len” methods in favour of new “length”
Rest APIs support both old “len: and new “length” forms using new ParseField class to a) provide compiler-checked consistency between Builder and Parser classes and
b) a common means of handling deprecated syntax in the DSL.
Documentation and rest specs only document the new “*length” forms
Closes#4083
`standard_html_strip` and `pattern` analyzer support stopwords which are
set to the default `english` stopwords by default. Those analyzers
should not use stopwords by default since they are language neutral
Closes#4699
`min_doc_count` is the minimum number of hits that a term or histogram key
should match in order to appear in the response.
`min_doc_count=0` replaces `compute_empty_buckets` for histograms and will
behave exactly like facets' `all_terms=true` for terms aggregations.
Close#4662
When upgrading to ES 1.0 the existing mappings with a multi-field type automatically get replaced to a core field with the new `fields` option.
If a `multi_field` type-ed field doesn't have a main / default field, a default field will be chosen for the multi fields syntax. The new main field type
will be equal to the first `multi_field` fields' field or type string if no fields have been configured for the `multi_field` field and in both cases
the default index will not be indexed (`index=no` is set on the default field).
If a `multi_field` typed field has a default field, that field will replace the `multi_field` typed field.
Closes to #4521
============
The default unit for measuring distances is *MILES* in most cases. This commit moves ES
over to the *International System of Units* and make it work on a default which relates
to *METERS* . Also the current structures of the `GeoBoundingBox Filter` changed in
order to define the *Bounding* by setting abitrary corners.
Distances
---------
Since the default unit for measuring distances has changed to a default unit
`DistanceUnit.DEFAULT` relating to *meters*, the **REST API** has changed at the
following places:
* `ScriptDocValues.factorDistance()` returns *meters* instead of *miles*
* `ScriptDocValues.factorDistanceWithDefault()` returns *meters* instead of *miles*
* `ScriptDocValues.arcDistance()` returns *meters* instead of *miles*
one might use `ScriptDocValues.arcDistanceInMiles()`
* `ScriptDocValues.arcDistanceWithDefault()` returns *meters* instead of *miles*
* `ScriptDocValues.distance()` returns *meters* instead of *miles*
one might use `ScriptDocValues.distanceInMiles()`
* `ScriptDocValues.distanceWithDefault()` returns *meters* instead of *miles*
one might use `ScriptDocValues.distanceInMilesWithDefault()`
* `GeoDistanceFilter` default unit changes from *kilometers* to *meters*
* `GeoDistanceRangeFilter` default unit changes from *miles* to *meters*
* `GeoDistanceFacet` default unit changes from *miles* to *meters*
Geo Bounding Box Filter
-----------------------
The naming of the GeoBoundingBoxFilter properties allows to set arbitrary corners
(see #4084) namely `top_right`, `top_left`, `bottom_right` and `bottom_left`. This
change also includes the fields `topRight` and `bottomLeft` Also it is be possible to
set the single values by using just `top`, `bottom`, `left` and `right` parameters.
Closes#4515, #4084
Also changes the response format of that section to:
```
"open_file_descriptors": {
"min": 200,
"max": 346,
"avg": 273
}
```
Closes#4681
Note: this is an aggregate of 3 commits in the 0.90 branch
A lot of different API's currently use different names for the
same logical parameter. Since lucene moved away from the notion
of a `similarity` and now uses an `fuzziness` we should generalize
this and encapsulate the generation, parsing and creation of these
settings across all queries.
This commit adds a new `Fuzziness` class that handles the renaming
and generalization in a backwards compatible manner.
This commit also added a ParseField class to better support deprecated
Query DSL parameters
The ParseField class allows specifying parameger that have been deprecated.
Those parameters can be more easily tracked and removed in future version.
This also allows to run queries in `strict` mode per index to throw
exceptions if a query is executed with deprected keys.
Closes#4082
`allocation.disable_new_allocation`, `allocation.disable_allocation`, `allocation.disable_replica_allocation`,
in favour for the enable allocation decider which has a single option `allocation.enable` wich can be set to the following values:
`none`, `new_primaries`, `primaries` and `all` (default).
Closes#4488
The new internal get index settings api is more efficient when it comes to sending the index settings from the master to the client via the
Also the get index settings support now all the indices options.
Closes#4620
The FVH was throwing away some boosts on queries stopping a number of
ways to boost phrase matches to the top of the list of fragments from
working.
The plain highlighter also doesn't work for this but that is because it
doesn't support the concept of the same term having a different score at
different positions.
Also update documentation claiming that FHV is nicer for weighing terms
found by query combinations.
Closes#4351
Important: This breaks backwards compatibility with 0.90
* Removed endpoints: /_cluster/nodes, /_cluster/nodes/nodeId1,nodeId2
* Disallow usage of parameters, but make required metrics part of URI
* Changed NodesInfoRequest to return everything by default
* Fixed NPE in NodesInfoResponse
Closes#4055
Instead of specifying what kind of data should be filtered, this commit
streamlines the API to actually specify, what kind of data should be displayed.
This makes its behaviour similar to the other requests, like NodeIndicesStats.
A small feature has been added as well: If you specify an index to select on, not
only the metadata, but also the routing tables are filtered by index in order
to prevent too big cluster states to be returned.
Also the CAT apis have been changed to only return the wanted data in order to keep
network traffic as small as needed.
Tests for the cluster state API filtering have been added as well.
Note: This change breaks backwards compatibility with 0.90!
Closes#4065
Added a long-based representation of GeoHashes to GeoHashUtils for fast evaluation in aggregations.
The new BucketUtils provides a common heuristic for determining the number of results to obtain from each shard in "top N" type requests.
* Clean up s/ElasticSearch/Elasticsearch on docs/*
* Clean up s/ElasticSearch/Elasticsearch on src/* bin/* & pom.xml
* Clean up s/ElasticSearch/Elasticsearch on NOTICE.txt and README.textile
Closes#4634
Norms can be eagerly loaded on a per-field basis by setting norms.loading to
`eager` instead of the default `lazy`:
```
"my_string_field" : {
"type": "string",
"norms": {
"loading": "eager"
}
}
```
In case this behavior should be applied to all fields, it is possible to change
the default value by setting `index.norms.loading` to `eager`.
Close#4079
First, this breaks backwards compatibility!
* Removed /_cluster/nodes/stats endpoint
* Excpect the stats types not as parameters, but as part of the URL
* Returning all indices stats by default, returning all nodes stats by default
* Supporting groups & types in nodes stats now as well
* Updated documentation & tests accordingly
* Allow level parameter for "shards" and "indices" (cluster does not make sense here)
Closes#4057
Note: This breaks backward compatibility
* Removed clear/all parameters, now all stats are returned by default
* Made the metrics part of the URL
* Removed a lot of handlers
* Added shards/indices/cluster level paremeter to change response serialization
* Returning translog statistics in IndicesStats
* Added TranslogStats class
* Added IndexShard.translogStats() method to get the stats from concrete implementation
* Updated documentation
Closes#4054
The reason to not start packages on installation is to allow to configure
them before starting up (setting heap, cluster.name etc)
Also the documentation was updated in order to show, which statements need
to be executed.
In addition, these statements are also printed out when the package is
installed, depending on whether chkconfig, system or update-rc.d is used.
Closes#3722
When testing plugin manager with real downloads, it could happen that the test run forever. Fortunately, test suite will be interrupted after 20 minutes, but it could be useful not to fail the whole test suite but only warn in that case.
By default, plugin manager still wait indefinitely but it can be modified using new `--timeout` option:
```sh
bin/plugin --install elasticsearch/kibana --timeout 30s
bin/plugin --install elasticsearch/kibana --timeout 1h
```
Closes#4603.
Closes#4600.
This adds the field data circuit breaker, which is used to estimate
the amount of memory required to load field data before loading it. It
then raises a CircuitBreakingException if the limit is exceeded.
It is configured with two parameters:
`indices.fielddata.cache.breaker.limit` - the maximum number of bytes
of field data to be loaded before circuit breaking. Defaults to
`indices.fielddata.cache.size` if set, unbounded otherwise.
`indices.fielddata.cache.breaker.overhead` - a contast for all field
data estimations to be multiplied with before aggregation. Defaults to
1.03.
Both settings can be configured dynamically using the cluster update
settings API.
Currently there are two get aliases apis that both have the same functionality, but have a different response structure. The reason for having 2 apis is historic.
The GET _alias api was added in 0.90.x and is more efficient since it only sends the needed alias data from the cluster state between the master node and the node that received the request. In the GET _aliases api the complete cluster state is send to the node that received the request and then the right information is filtered out and send back to the client.
The GET _aliases api should be removed in favour for the alias api
Closes to #4539
* `ignore_unavailable` - Controls whether to ignore if any specified indices are unavailable, this includes indices that don't exist or closed indices. Either `true` or `false` can be specified.
* `allow_no_indices` - Controls whether to fail if a wildcard indices expressions results into no concrete indices. Either `true` or `false` can be specified. For example if the wildcard expression `foo*` is specified and no indices are available that start with `foo` then depending on this setting the request will fail. This setting is also applicable when `_all`, `*` or no index has been specified.
* `expand_wildcards` - Controls to what kind of concrete indices wildcard indices expression expand to. If `open` is specified then the wildcard expression if expanded to only open indices and if `closed` is specified then the wildcard expression if expanded only to closed indices. Also both values (`open,closed`) can be specified to expand to all indices.
Closes to #4436
This commits add doc values support to geo point using the exact same approach
as for numeric data: geo points for a given document are stored uncompressed
and sequentially in a single binary doc values field.
Close#4207
This commit allows to trade precision for memory when storing geo points.
This new field data impl accepts a `precision` parameter that controls the
maximum expected error for storing coordinates. This option can be updated on
a live index with the PUT mapping API.
Default precision is 1cm, which requires 8 bytes per geo-point (50% memory
saving compared to using 2 doubles).
Close#4386
Instead of using the '-f' parameter to start elasticsearch in the
foreground, this is now the default modus.
In order to start elasticsearch in the background, the '-d' parameter
can be used.
Closes#4392
The `text` query was replaced by the `match` query and has been
deprecated for quite a while.
The `field` query should be replaced by a `query_string` query with
the `default_field` specified.
Fixes#4033
This commit changes field data configuration updates so that they are
immediately taken into account for loading new segments. The way it works
is that field data configuration is now cached separately from the field
data cache, meaning that it is now possible to clear the field data
configuration from IndexFieldDataService while the cache will stay around. On
the next time that Elasticsearch will reload field data configuration, it will
check if there is already a cache entry, and reuse it if it exists.
To disable field data loading, all that is required is to change the field
data format to "none" (supported by all field data types) using the update
mapping API. Elasticsearch will then refuse to load field data on any new
segment, but field data which has been loaded on the previous segments will
remain available. So you need to clear the field data cache in order to
reclaim memory (otherwise memory will be reclaimed slower, as segments get
merged).
Close#4430Close#4431
When the ValuesSource has ordinals, terms ordinals are used as a cache key to
bucket ordinals. This can make terms aggregations on String terms significantly
faster.
Close#4350
The percolator uses this option to deal with the fact that the MemoryIndex doesn't support stored fields,
this is possible b/c the _source of the document being percolated is always present.
Closes#4348
This adds support for Lucene's SimpleQueryParser by adding a new type
of query called the `simple_query_string`. The `simple_query_string`
query is designed to be able to parse human-entered queries without
throwing any exceptions.
Resolves#4159
In order to be sure that memory mapped lucene directories are working
one can configure the kernel about how many memory mapped areas
a process may have. This setting ensure for the debian and redhat initscripts
as well as the systemd startup, that this setting is set high enough.
Closes#4397
This contribution is based on the feedback given in issue #4254 and
issue #4255, and should clear things up, when suggestions are being
removed and not displayed anymore after deletion of data.
The Fast Vector Highlighter can combine matches on multiple fields to
highlight a single field using `matched_fields`. This is most
intuitive for multifields that analyze the same string in different
ways. Example:
{
"query": {
"query_string": {
"query": "content.plain:running scissors",
"fields": ["content"]
}
},
"highlight": {
"order": "score",
"fields": {
"content": {
"matched_fields": ["content", "content.plain"],
"type" : "fvh"
}
}
}
}
Closes#3750
* Minor alignments (like setter to ctor)
* FuzzySuggester has a unicode aware flag, which is not exposed in the fuzzy completion request parameters
* Made XAnalyzingSuggester flags (PAYLOAD_SEP, END_BYTE, SEP_LABEL) to be written into the postings format, so we can retain backwards compatibility
* The above change also implies, that these flags can be set per instantiated XAnalyzingSuggester
* CompletionPostingsFormatTest now uses a randomProvider for writing data to check for bwc
Add FieldDataTermsFilter that compares terms out of
the fielddata cache. When filtering on a large
set of terms this filter can be considerably faster
than using a standard lucene terms filter.
Add the "fielddata" execution mode to the
terms filter parser to enable the use of
the new FieldDataTermsFilter.
Add supporting tests and documentation.
Closes#4209
The 'default' / 'standard' analyzer can be a trappy default sicne it filters
english stopwords by default. Yet a default should not be dedicated to a certain language
since elasticsearch is used in many different scenarios where a standard analysis chain
with specialization to english full-text might be rather counter productive.
This commit changes the 'standard' analyzer to use an empty stopword list for indices
that are created from 1.0.0.Beta1 version onwards but will maintain backwards compatibiliy
for older indices.
Closes#3775
Use .percolator as the internal (hidden) type name for percolators within the index. Seems nicer name to represent "hidden" types within an index.
closes#4090
This allows the RegexpQueryBuilder to be used in span queries
Added tests for all span multi term queries.
Also updated the documentation and removed mentioning of numeric range
queries for span queries (they have to be terms).
Closes#3392
This new API allows to get the mapping for a specific set of fields rather than get the whole index mapping and traverse it.
The fields to be retrieved can be specified by their full path, index name and field name and will be resolved in this order.
In case multiple field match, the first one will be returned.
Since we are now generating the output (rather then fall back to the stored mapping), you can specify `include_defaults`=true on the request to have default values returned.
Closes#3941
The setting causes the upper bound for a range query/filter to be rounded up,
therefore the name `round_ceil` seems to make more sense.
Also this commit removes the redundant fourth parameter to DateMathParser.parse(..)
which was never used.
was: parse(String text, long now, boolean roundUp, boolean upperInclusive)
is now: parse(String text, long now, boolean roundCeil)
closes#3914
Requires field index_options set to "offsets" in order to store positions and offsets in the postings list.
Considerably faster than the plain highlighter since it doesn't require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be.
Requires less disk space than term_vectors, needed for the fast_vector_highlighter.
Breaks the text into sentences and highlights them. Uses a BreakIterator to find sentences in the text. Plays really well with natural text, not quite the same if the text contains html markup for instance.
Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm.
Uses forked version of lucene postings highlighter to support:
- per value discrete highlighting for fields that have multiple values, needed when number_of_fragments=0 since we want to return a snippet per value
- manually passing in query terms to avoid calling extract terms multiple times, since we use a different highlighter instance per doc/field, but the query is always the same
The lucene postings highlighter api is quite different compared to the existing highlighters api, the main difference being that it allows to highlight multiple fields in multiple docs with a single call, ensuring sequential IO.
The way it is introduced in elasticsearch in this first round is a compromise trying not to change the current highlight api, which works per document, per field. The main disadvantage is that we lose the sequential IO, but we can always refactor the highlight api to work with multiple documents.
Supports pre_tag, post_tag, number_of_fragments (0 highlights the whole field), require_field_match, no_match_size, order by score and html encoding.
Closes#3704
You can configure the highlighting api to return an excerpt of a field
even if there wasn't a match on the field.
The FVH makes excerpts from the beginning of the string to the first
boundary character after the requested length or the boundary_max_scan,
whichever comes first. The Plain highlighter makes excerpts from the
beginning of the string to the end of the last token before the requested
length.
Closes#1171
The description of the timeout parameter was worded misleadingly; it implied that the API would wait until the cluster reached the desired level and then stayed at that level for the timeout. I've tweaked the sentence to remove the risk of confusion.
The suggest stop filter is an improved version of the stop filter, which
takes stopwords only into account if the last char of a query is a
whitespace. This allows you to keep stopwords, but to allow suggesting for
"a".
Example: Index document content "a word". You are now able to suggest for
"a" and get back results in the completion suggester, if the suggest stop
filter is used on the query side, but will not get back any results for
"a " as this is identified as a stopword.
The implementation allows to set the `remove_trailing` parameter for a
custom stop filter and thus use the suggest stop filter instead of the
standard stop filter.
- "boost" should be "boost_factor"
- "mult" should be "multiply"
Also, store combine function names in ImmutableMap instead of iterating
over all possible names each time.
closes#3872 for master
Now that we properly fixed the ability to set the queue size on the index / bulk thread pool, we should actually set them to a somehow reasonable value to protect from users potentially overflowing our system.
I suggest defaults to be 50 for bulk, and 200 for indexing.
Also, set the thread pool for get, which we should set (in a similar value to a "read" queue size we have today).
closes#3888
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.
So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
- field data filtering: this will fail if doc values are enabled,
- field data cache clearing, even for memory-based doc values formats,
- getting the memory usage for a specific field,
- knowing whether a field is actually multi-valued.
This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).
Closes#3806
* Merged segments are now warmed-up at the end of the merge operation instead
of _refresh, so that _refresh doesn't pay the price for the warm-up of merged
segments, which is often higher than flushed segments because of their size.
* Even when no _warmer is registered, some basic warm-up of the segments is
performed: norms, doc values (_version). This should help a bit people who
forget to register warmers.
* Eager loading support for the parent id cache and field data: when one
can't predict what terms will be present in the index, it is tempting to use
a match_all query in a warmer, but in that case, query execution might not be
much faster than field data loading so having a warmer that only loads field
data without running a query can be useful.
Closes#3819
Stats can be retrieved on a per-feature / per-component basis including the fields
they apply to. This commit add support for a 'completion' flag to include statistics
for the complition feature as well as 'completion_fields' to only
include certain fields into the returned statistics.
To disambiguate between 'fielddata' and 'completion' fields this commit
uses 'fields' as the default inclusion filter for stats fields only used
if not dedicated '[completion|fielddata]_fields' paramter is provided.
Relates to #3522
Add a dedicated suggest thread pool for the suggest API. With the new completion suggest type, which is purely CPU bounded, it makes more sense to have a dedicated thread pool for suggest compared to having it share the search thread pool and "competing" against other search operations.
closes#3698
Refresh flag in optimize is problematic, since the shards refresh is allowed to execute on is different compared to the optimize shards. In order to do optimize and then refresh, they should be executed as separate APIs when needed.
closes#3690
Refresh flag in flush is problematic, since the shards refresh is allowed to execute on is different compared to the flush shards. In order to do flush and then refresh, they should be executed as separate APIs when needed.
closes#3689
/_template shows: No handler found for uri [/_template] and method [GET]
It would make sense to list the templates as they are listed in the /_cluster/state call.
Closes#2532.
The clear scroll api allows clear all resources associated with a `scroll_id` by deleting the `scroll_id` and its associated SearchContext.
Closes#3657
Now with have proper exit codes for elasticsearch plugin manager (see #3463), we can add a silent mode to plugin manager.
```sh
bin/plugin --install karmi/elasticsearch-paramedic --silent
```
Closes#3628.
When installing a plugin, we use:
```sh
bin/plugin --install groupid/artifactid/version
```
But when removing the plugin, we only support:
```sh
bin/plugin --remove dirname
```
where `dirname` is the directory name of the plugin under `/plugins` dir.
Closes#3421.
This commit adds two main pieces, the first is a ClusterInfoService
that provides a service running on the master nodes that fetches the
total/free bytes for each data node in the cluster as well as the
sizes of all shards in the cluster. This information is gathered by
default every 30 seconds, and can be changed dynamically by setting
the `cluster.info.update.interval` setting. This ClusterInfoService
can hopefully be used in the future to weight nodes for allocation
based on their disk usage, if desired.
The second main piece is the DiskThresholdDecider, which can disallow
a shard from being allocated to a node, or from remaining on the node
depending on configuration parameters. There are three main
configuration parameters for the DiskThresholdDecider:
`cluster.routing.allocation.disk.threshold_enabled` controls whether
the decider is enabled. It defaults to false (disabled). Note that the
decider is also disabled for clusters with only a single data node.
`cluster.routing.allocation.disk.watermark.low` controls the low
watermark for disk usage. It defaults to 0.70, meaning ES will not
allocate new shards to nodes once they have more than 70% disk
used. It can also be set to an absolute byte value (like 500mb) to
prevent ES from allocating shards if less than the configured amount
of space is available.
`cluster.routing.allocation.disk.watermark.high` controls the high
watermark. It defaults to 0.85, meaning ES will attempt to relocate
shards to another node if the node disk usage rises above 85%. It can
also be set to an absolute byte value (similar to the low watermark)
to relocate shards once less than the configured amount of space is
available on the node.
Closes#3480
Restrict the size of the input length to a reasonable size otherwise very
long strings can cause StackOverflowExceptions deep down in lucene land.
Yet, this is simply a saftly limit set to `50` UTF-16 codepoints by default.
This limit is only present at index time and not at query time. If prefix
completions > 50 UTF-16 codepoints are expected / desired this limit should be raised.
Critical string sizes are beyone the 1k UTF-16 Codepoints limit.
Closes#3596