Switches most search behavior extensions from push (`onModule(SearchModule)`)
to pull (`implements SearchPlugin`). This effort in general gives plugin
authors a much cleaner view of how to extend Elasticsearch and starts to
set up portions of Elasticsearch as "the plugin API". This commit in
particular does that for search-time behavior like customized suggesters,
highlighters, score functions, and significance heuristics.
It also switches most such customization to being done at search module
construction time which is much, much easier to reason about from a testing
perspective. It also helps significantly in the process of de-guice-ing
Elasticsearch's startup.
There are at least two major search time extensions that aren't covered in
this commit that will simply have to wait for the next commit on the topic
because this one has already grown large: custom aggregations and custom
queries. These will likely live in the same SearchPlugin interface as well.
If there are percolator queries containing `range` queries with ranges based on the current time then this can lead to incorrect results if the `percolate` query gets cached. These ranges are changing each time the `percolate` query gets executed and if this query gets cached then the results will be based on how the range was at the time when the `percolate` query got cached.
The ExtractQueryTermsService has been renamed `QueryAnalyzer` and now only deals with analyzing the query (extracting terms and deciding if the entire query is a verified match) . The `PercolatorFieldMapper` is responsible for adding the right fields based on the analysis the `QueryAnalyzer` has performed, because this is highly dependent on the field mappings. Also the `PercolatorFieldMapper` is responsible for creating the percolate query.
This exposes a method to start an action and return a task from
`NodeClient`. This allows reindex to use the injected `Client` rather
than require injecting `TransportAction`s
After #13834 many tests that used Groovy scripts (for good or bad reason) in their tests have been moved in the lang-groovy module and the issue #13837 has been created to track these messy tests in order to clean them up.
This commit moves more tests back in core, removes the dependency on Groovy, changes the scripts in order to use the mocked script engine, and change the tests to integration tests.
This commit moves back some messy tests that have been placed in lang-groovy module in https://github.com/elastic/elasticsearch/pull/13834. It removes the dependency on Groovy plugin as well as change back the tests to integration tests (IT suffix).
It also changes the current MockScriptEngine and MockScriptPlugin to make it easier to use.
This adds a remote option to reindex that looks like
```
curl -POST 'localhost:9200/_reindex?pretty' -d'{
"source": {
"remote": {
"host": "http://otherhost:9200"
},
"index": "target",
"query": {
"match": {
"foo": "bar"
}
}
},
"dest": {
"index": "target"
}
}'
```
This reindex has all of the features of local reindex:
* Using queries to filter what is copied
* Retry on rejection
* Throttle/rethottle
The big advantage of this version is that it goes over the HTTP API
which can be made backwards compatible.
Some things are different:
The query field is sent directly to the other node rather than parsed
on the coordinating node. This should allow it to support constructs
that are invalid on the coordinating node but are valid on the target
node. Mostly, that means old syntax.
This commit renames writeThrowable to writeException. The situation here
stems from the fact that the StreamOutput method for serializing
Exceptions needs to accept Throwables too as Throwables can be the cause
of serialized Exceptions. Yet, we do not serialize Throwables in the
Error sub-hierarchy in a way that they can be deserialized into their
initial type. This leads to an asymmetry in the StreamOutput method for
serializing Exceptions and the StreamInput method for writing
Excpetions. Namely, the former will accept Throwables but the latter
will only return Exceptions. A goal with the stream methods has always
been symmetry in the method names so that serialization/deserialization
routines appear symmetrical in code. It is this asymmetry on the
input/output types for Exceptions on StreamOutput/StreamInput that
clashes with the desired symmetry of naming. Despite this, we should
favor symmetry in the naming of the methods. This commit renames
StreamOutput#writeThrowable to StreamOutput#writeException which leaves
us with Exception StreamInput#readException and void
StreamOutput#writeException(Throwable).
Today throughout the codebase, catch throwable is used with reckless
abandon. This is dangerous because the throwable could be a fatal
virtual machine error resulting from an internal error in the JVM, or an
out of memory error or a stack overflow error that leaves the virtual
machine in an unstable and unpredictable state. This commit removes
catch throwable from the codebase and removes the temptation to use it
by modifying listener APIs to receive instances of Exception instead of
the top-level Throwable.
Relates #19231
Rename `fields` to `stored_fields` and add `docvalue_fields`
`stored_fields` parameter will no longer try to retrieve fields from the _source but will only return stored fields.
`fields` will throw an exception if the user uses it.
Add `docvalue_fields` as an adjunct to `fielddata_fields` which is deprecated. `docvalue_fields` will try to load the value from the docvalue and fallback to fielddata cache if docvalues are not enabled on that field.
Closes#18943
BytesReference should be a really simple interface, yet it has a gazillion
ways to achieve the same this. Methods like `#hasArray`, `#toBytesArray`, `#copyBytesArray`
`#toBytesRef` `#bytes` are all really duplicates. This change simplifies the interface
dramatically and makes implementations of it much simpler. All array access has been removed
and is streamlined through a single `#toBytesRef` method. Utility methods to materialize a
compact byte array has been added too for convenience.
We'll migrate to NamedWriteable so we can share code with the rest
of the system. So we can work on this in multiple pull requests without
breaking Elasticsearch in between the commits this change supports
*both* old style `InternalAggregations.stream` serialization and
`NamedWriteable` style serialization. As such it creates about a
half dozen `// NORELEASE` comments that will have to be removed
once the migration is complete.
This also introduces a boolean `transportClient` flag to `SearchModule`
which is used to skip inappropriate registrations for for the
transport client while still registering the things it needs. In
this case that means that the `InternalAggregation` subclasses are
registered with the `NamedWriteableRegistry` but the `AggregationBuilder`
subclasses are not.
Finally, this moves aggregation registration from guice configuration
time to `SearchModule` construction time. This will make it simpler to
work with in the future as we further clean up Elasticsearch's
extension points.
We have long worked to capture different partitioning scenarios in our testing infra. This PR adds a new variant, inspired by the Jepsen blogs, which was forgotten far - namely a partition where one node can still see and be seen by all other nodes. It also updates the resiliency page to better reflect all the work that was done in this area.
Update-By-Query and Delete-By-Query use internal versioning to update/delete documents. But documents can have a version number equal to zero using the external versioning... making the UBQ/DBQ request fail because zero is not a valid version number and they only support internal versioning for now. Sequence numbers might help to solve this issue in the future.
The factory for ingest processor is generic, but that is only for the
return type of the create mehtod. However, the actual consumer of the
factories only cares about Processor, so generics are not needed.
This change removes the generic type from the factory. It also removes
AbstractProcessorFactory which only existed in order pull the optional
tag from config. This functionality is moved to the caller of the
factories in ConfigurationUtil, and the create method now takes the tag.
This allows the covariant return of the implementation to work with
tests not needing casts.
Previously all rest handlers would take Client in their injected ctor.
However, it was only to hold the client around for runtime. Instead,
this can be done just once in the HttpService which handles rest
requests, and passed along through the handleRequest method. It also
should always be a NodeClient, and other types of Clients (eg a
TransportClient) would not work anyways (and some handlers can be
simplified in follow ups like reindex by taking NodeClient).
`RestHandler`s are highly tied to actions so registering them in the
same place makes sense.
Removes the need to for plugins to check if they are in transport client
mode before registering a RestHandler - `getRestHandlers` isn't called
at all in transport client mode.
This caused guice to throw a massive fit about the circular dependency
between NodeClient and the allocation deciders. I broke the circular
dependency by registering the actions map with the node client after
instantiation.
This pull request adds two util functions to the Mustache templating engine:
- {{#toJson}}my_map{{/toJson}} to render a Map parameter as a JSON string
- {{#join}}my_iterable{{/join}} to render any iterable (including arrays) as a comma separated list of values like `1, 2, 3`. It's also possible de change the default delimiter (comma) to something else.
closes#18970
These are useful methods in groovy that give you control over
the replacements used:
```
'the quick brown fox'.replaceAll(/[aeiou]/,
m -> m.group().toUpperCase(Locale.ROOT))
```
Stored scripts are pulled from the cluster state, and the current api
requires passing the ClusterState on each call to compile. However, this
means every user of the ScriptService needs to depend on the
ClusterService. Instead, this change makes the ScriptService a
ClusterStateListener. It also simplifies tests a lot, as they no longer
need to create fake cluster states (except when testing stored scripts).
Instead of implementing onModule(ActionModule) to register actions,
this has plugins implement ActionPlugin to declare actions. This is
yet another step in cleaning up the plugin infrastructure.
While I was in there I switched AutoCreateIndex and DestructiveOperations
to be eagerly constructed which makes them easier to use when
de-guice-ing the code base.
Previous to this change the unresolved extended bounds was passed into the histogram aggregator which meant extendedbounds.min and extendedbounds.max was passed through as null. This had two effects on the histogram aggregator:
1. If the histogram aggregator was unmapped across all shards, the reduce phase would not add buckets for the extended bounds and the response would contain zero buckets
2. If the histogram aggregator was not unmapped in some shards, the reduce phase might sometimes chose to reduce based on the unmapped shard response and therefore the extended bounds would be ignored.
This change resolves the extended bounds in the unmapped case and solves the above two issues.
Closes#19009
This commit fixes several NPEs caused by implicitly performing a get request for a document that exists with its _source disabled and then trying to access the source. Instead of causing an NPE the following queries will throw an exception with a "source disabled" message (similar behavior as if the document does not exist).:
- GeoShape query for pre-indexed shape (throws IllegalArgumentException)
- Percolate query for an existing document (throws IllegalArgumentException)
A Terms query with a lookup will ignore the document if the source does not exist (same as if the document does not exist).
GET and HEAD requests for the document _source will return a 404 if the source is disabled (even if the document exists).
Instead of plugins calling `registerTokenizer` to extend the analyzer
they now instead have to implement `AnalysisPlugin` and override
`getTokenizer`. This lines up extending plugins in with extending
scripts. This allows `AnalysisModule` to construct the `AnalysisRegistry`
immediately as part of its constructor which makes testing anslysis
much simpler.
This also moves the default analysis configuration into `AnalysisModule`
which is how search is setup.
Like `ScriptModule`, `AnalysisModule` no longer extends `AbstractModule`.
Instead it is only responsible for building `AnslysisRegistry`. We still
bind `AnalysisRegistry` but we only do so in `Node`. This is means it
is available at module construction time so we slowly remove the need to
bind it in guice.
This commit fixes several NPEs caused by implicitly performing a get request for a document that exists with its _source disabled and then trying to access the source. Instead of causing an NPE the following queries will throw an exception with a "source disabled" message (similar behavior as if the document does not exist).:
- GeoShape query for pre-indexed shape (throws IllegalArgumentException)
- Percolate query for an existing document (throws IllegalArgumentException)
A Terms query with a lookup will ignore the document if the source does not exist (same as if the document does not exist).
GET and HEAD requests for the document _source will return a 404 if the source is disabled (even if the document exists).
If we don't care about scoring then for certain candidate matches we can be certain, that if they are a candidate match,
then they will always match. So verifying these queries with the MemoryIndex can be skipped.
This commit moves template support out of the Search API to its own dedicated Search Template API in the lang-mustache module. It provides a new SearchTemplateAction that can be used to render templates before it gets delegated to the usual Search API. The current REST endpoint are identical, but the Render Search Template endpoint now uses the same Search Template API with a new "simulate" option. When this option is enabled, the Search Template API only renders template and returns immediatly, without executing the search.
Closes#17906
`stored_fields` parameter will no longer try to retrieve fields from the _source but will only return stored fields.
`fields` will throw an exception if the user uses it.
Add `docvalue_fields` as an adjunct to `fielddata_fields` which is deprecated. `docvalue_fields` will try to load the value from the docvalue and fallback to fielddata cache if docvalues are not enabled on that field.
Closes#18943
This changes adds a MapperPlugin interface which allows pull style
retrieval of mappers and metadata mappers added by plugins. For now, I
have kept the MapperRegistry, but this should be removed in the future
as it is just a silly container for 2 maps which could themselves be
passed around.
Makes ScriptModule just a plain class that manages building the
ScriptSettings and ScriptService from plugins. When we *need*
to bind ScriptService with guice we bind it in a lambda.
- fix it so that processors with the `ignore_failure` option do not
record their exception in the response
- add more tests to make empty `on_failure`. This now throws an
exception
This makes this sequence:
```
curl -XDELETE localhost:9200/source,dest?pretty
for i in $( seq 1 100 ); do
curl -XPOST localhost:9200/source/test -d'{"test": "test"}'; echo
done
curl localhost:9200/_refresh?pretty
curl -XPOST 'localhost:9200/_reindex?pretty&wait_for_completion=false' -d'{
"source": {
"index": "source"
},
"dest": {
"index": "dest"
}
}'
curl 'localhost:9200/_tasks/Jsyd6d9wSRW-O-NiiKbPcQ:237?wait_for_completion&pretty'
```
Return task *AND* the response to the user.
This also renames "result" to "response" in the persisted task info
to line it up with how we name the objects in Elasticsearch.
Perviously we used token level lookbehind in the parser. That worked,
but only if the parser didn't have any ambiguity *at all*. Since the
parser has ambiguity it didn't work everywhere. In particular it failed
when parsing blocks in lambdas like `a -> {int b = a + 2; b * b}`.
This moves the hack from the parser into the lexer. There we can use
token lookbehind (same trick) to *insert* semicolons into the token
stream. This works much better for antlr because antlr's prediction
code can work with real tokens.
Also, the lexer is simpler than the parser, so if there is a place
to introduce a hack, that is a better place.
This change removes some unnecessary dependencies from ClusterService
and cleans up ClusterName creation. ClusterService is now not created
by guice anymore.
Painless: Add support for //m
Painless: Add support for //s
Painless: Add support for //i
Painless: Add support for //u
Painless: Add support for //U
Painless: Add support for //l
This means "literal" and is exposed for completeness sake with
the java api.
Painless: Add support for //c
c enables Java's CANON_EQ (canonical equivalence) flag which makes
unicode characters that are canonically equal match. Java's javadoc
gives "a\u030A" being equal to "\u00E5". That is that the "a" code
point followed by the "combining ring above" code point is equal to
the "a with combining ring above" code point.
Update docs and add multi-flag test
Whitelist most of the Pattern class.
Today we have a push model for registering basically anything. All our extension points
are defined on modules which we pass in to plugins. This is harder to maintain and adds
unnecessary dependencies on the modules itself. This change moves towards a pull model
where the plugin offers a getter kind of method to get the extensions. This will also
help in the future if we need to pass dependencies to the extension points which can
easily be defined on the method as arguments if a pull model is used.
Adds support for the find operator (=~) and the match operator (==~)
to painless's regexes. Also whitelists most of the Matcher class and
documents regex support in painless.
The find operator (=~) returns a boolean that is the result of building
a matcher on the lhs with the Pattern on the RHS and calling `find` on
it. Use it like this:
```
if (ctx._source.last =~ /b/)
```
The match operator (==~) returns boolean like find but instead of calling
`find` on the Matcher it calls `matches`.
```
if (ctx._source.last ==~ /[^aeiou].*[aeiou]/)
```
Finally, if you want the actual matcher you do:
```
Matcher m = /[aeiou]/.matcher(ctx._source.last)
```
Registering a script engine or native scripts still uses Guice today
and is much more complicated than needed. This change moves to a pull
based model where script plugins have to implement a dedicated interface
`ScriptPlugin` and defines simple getter returning instances rather than
classes.
In 2.0 we added plugin descriptors which require defining a name and
description for the plugin. However, we still have name() and
description() which must be overriden from the Plugin class. This still
exists for classpath plugins. But classpath plugins are mainly for
tests, and even then, referring to classpath plugins with their class is
a better idea. This change removes name() and description(), replacing
the name for classpath plugins with the full class name.
This interface used to have dedicated methods to prevent calling execute
methods. These methods are unnecessary as the checks can simply be
done inside the execute methods itself. This simplifies the interface
as well as its usage.
This adds a get task API that supports GET /_tasks/${taskId} and
removes that responsibility from the list tasks API. The get task
API supports wait_for_complation just as the list tasks API does
but doesn't support any of the list task API's filters. In exchange,
it supports falling back to the .results index when the task isn't
running any more. Like any good GET API it 404s when it doesn't
find the task.
Then we change reindex, update-by-query, and delete-by-query to
persist the task result when wait_for_completion=false. The leads
to the neat behavior that, once you start a reindex with
wait_for_completion=false, you can fetch the result of the task by
using the get task API and see the result when it has finished.
Also rename the .results index to .tasks.
Adds `/regex/` as a regex constructor. A couple of fun points:
1. This makes generic the idea of arbitrary stuff adding a constant.
Both SFunction and LRegex create a statically initialized constant.
Both go through Locals to do this because they LRegex isn't directly
iterable from SScript.
2. Differentiating `/` as-in-division from `/` as-in-start-of-regex
is hard. See:
http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions
The javascript folks have a way, way tougher time of it then we do
because they have semicolon insertion. We have the much simpler
delimiter rules. Even with our simpler life we still have to add
a hack to get lexing `/regex/` to work properly. I chose to add
token-level lookbehind because it seems to be a pretty contained hack.
I considered and rejected lexer modes, a lexer member variable,
having the parser set variables on the lexer (this is a fairly common
solution for js, I believe), and moving regex parsing to the parser
level.
3. I've only added a very small subset of java.util.regex to the
whitelist because it is the subset I needed to test LRegex sanely.
More deserves to be added, and maybe more regex syntax like `=~` and
`==~`. Those can probably be added without too much pain.
Add `}` is statement delimiter but only in places where it is
otherwise a valid part of the syntax, specificall the end of a block.
We do this by matching but not consuming it. Antlr 4 doesn't have
syntax for this so we have to kind of hack it together by actually
matching the `}` and then seeking backwards in the token stream to
"unmatch" it. This looks reasonably efficient. Not perfect, but way
better than the alternatives.
I tried and rejected a few options:
1. Actually consuming the `}` and piping a boolean all through the
grammar from the last statement in a block to the delimiter. This
ended up being a rather large change and made the grammar way more
complicated.
2. Adding a semantic predicate to delimiter that just does the
lookahead. This doesn't work out well because it doesn't work (I
never figured out why) and because it generates an *amazing*
`adaptivePredict` which makes a super huge DFA. It looks super
inefficient.
Closes#18821
Writeable is better for immutable objects like TimeValue.
Switch to writeZLong which takes up less space than the original
writeLong in the majority of cases. Since we expect negative
TimeValues we shouldn't use
writeVLong.