================
It might sometimes be desirable to have a tool available that allows to multiply the original score for a document with a function that decays depending on the distance of a numeric field value of the document from a user given reference.
These functions could be computed for several numeric fields and eventually be combined as a sum or a product and multiplied on the score of the original query.
This commit adds new score functions similar to boost factor and custom script scoring, that can be used togeter with the <code>function_score</code> keyword in a query.
To use distance scoring, the user has to define
1. a reference and
2. a scale
for each field the function should be applied on. A reference is needed to define a distance for the document and a scale to define the rate of decay.
Example use case
----------------
Suppose you are searching for a hotel in a certain town. Your budget is limited. Also, you would like the hotel to be close to the town center, so the farther the hotel is from the desired location the less likely you are to check in.
You would like the query results that match your criterion (for example, "hotel, Berlin, non-smoker") to be scored with respect to distance to the town center and also the price.
Intuitively, you would like to define the town center as the origin and maybe you are willing to walk 2km to the town center from the hotel.
In this case your *reference* for the location field is the town center and the *scale* is ~2km.
If your budget is low, you would probably prefer something cheap above something expensive.
For the price field, the *reference* would be 0 Euros and the *scale* depends on how much you are willing to pay, for example 20 Euros.
Usage
----------------
The distance score functions can be applied in two ways:
In the most simple case, only one numeric field is to be evaluated. To do so, call <code>function_score</code>, with the appropriate function. In the above example, this might be:
curl 'localhost:9200/hotels/_search/' -d '{
"query": {
"function_score": {
"gauss": {
"location": {
"reference": [
52.516272,
13.377722
],
"scale": "2km"
}
},
"query": {
"bool": {
"must": {
"city": "Berlin"
}
}
}
}
}
}'
which would then search for hotels in berlin with a balcony and weight them depending on how far they are from the Brandenburg Gate.
If you have more that one numeric field, you can combine them by defining a series of functions and filters, like, for example, this:
curl 'localhost:9200/hotels/_search/' -d '{
"query": {
"function_score": {
"functions": [
{
"filter": {
"match_all": {}
},
"gauss": {
"location": {
"reference": "11,12",
"scale": "2km"
}
}
},
{
"filter": {
"match_all": {}
},
"linear": {
"price": {
"reference": "0",
"scale": "20"
}
}
}
],
"query": {
"bool": {
"must": {
"city": "Berlin"
}
}
},
"score_mode": "multiply"
}
}
}'
This would effectively compute the decay function for "location" and "price" and multiply them onto the score. See <code> function_score</code> for the different options for combining functions.
Supported fields
----------------
Only single valued numeric fields, including time and geo locations, are be supported.
What is a field is missing?
----------------
Is the numeric field is missing in the document, that field will not be taken into account at all for this document. The function value for this field is set to 1 for this document. Suppose you have two hotels both of which are in Berlin and cost the same. If one of the documents does not have a "location", this document would get a higher score than the document having the "location" field set.
To avoid this, you could, for example, use the exists or the missing filter and add a custom boost factor to the functions.
…
"functions": [
{
"filter": {
"match_all": {}
},
"gauss": {
"location": {
"reference": "11, 12",
"scale": "2km"
}
}
},
{
"filter": {
"match_all": {}
},
"linear": {
"price": {
"reference": "0",
"scale": "20"
}
}
},
{
"boost_factor": 0.001,
"filter": {
"bool": {
"must_not": {
"missing": {
"existence": true,
"field": "coordinates",
"null_value": true
}
}
}
}
}
],
...
Closes#3423
The following are the API affected by this change and support now the readable_format flag (default false when not specified):
- indices segments
- indices stats
- indices status
- cluster nodes stats
- cluster nodes info
Closes#3432
Also:
Bulk update one less retry then requested
Document for retries on conflict says it default to 1 (but default is 0)
TransportShardReplicationOperationAction methods now catches Throwables instead of exceptions
Added a little extra check to UpdateTests.concurrentUpdateWithRetryOnConflict
Closes#3447 & #3448
Some rare tests require to busy-wait a short time until a given
condition occurs for instance until a threadpool scaled down the
number of threads. This commit adds a util that waits a give time
until a condition is met, in contrast to Thread.sleep this method
waits increases the wait time by doubleling the waiting time
iterativly by doubeling it to prevent fast tests to always wait
a given sleep interval.
This commit also adds a suite timeout to fail a test if the test
times out. The test infrastructure will provide thread stack traces
if the timeout kicks in. The default timeout is set to 1h.
The current implementation does not overwrite, but only prepend the new PID into the pidfile.
So if the process is 4 digits long, but the file is already there with a 5 digit number, the file will contain 5 digits after the write.
Note: If the pidfile still exists this usually means, there either is already an instance running using this pidfile or the process has not finished correctly.
Closes#3425
* Added HEAD support for index templates to find out of they exist
* Returning a 404 instead of a 200 if a GET hits on a non-existing index template
Closes#3434
Update HighlightBuilder.Field API, it should allow for the same API
as SearchConstextHighlight.Field. In other words, what is possible
to setup using DSL in highlighting at the field level is also
possible via the Java API.
Closes#3435
even though we use keyword analyzer for the bool type, we should mark it as not tokenized in the lucene field type as well, no reason to take it though analysis phase to begin with
Added a new percolate api that only returns the number of percolate queries that have matched with the document being percolated. The actual query ids are not included. The percolate total count will be put in the total field and is the only result that will be returned from the dedicated count apis.
The total field will also be included in the already existing percolate and percolating existing document apis and are equal to the number of matches.
Closes#3430
Currently the bin/plugin command did not allow one to set jvm parameters
for startup. Usually this parameters are not needed (no need to configure
heap sizes for such a short running process), but one could not set the
configuration path. And that one is important for plugins in order find
out, where the plugin directory is.
This is especially problematic when elasticsearch is installed as
debian/rpm package, because the configuration file is not placed in the
same directory structure the plugin shell script is put.
This pull request allows to call bin/plugin like this
bin/plugin -Des.default.config=/etc/elasticsearch/elasticsearch.yml -install mobz/elasticsearch-head
As a last small improvement, the PluginManager now outputs the directort
the plugin was installed to in order to avoid confusion.
Closes#3304
make sure relocation shards add their corresponding initializing shard routing when search across initializing shards
also, make shardFailures lazy again
closes#3427
At the final stage of a relocation, during the final flip of the states, a search request might hit a node that would then execute it on a shard that has already relocated.
For this, we need to execute broadcast and search operations against initializing shards as well, but only as a last resort. The operation will be rejected if not applicable (i.e. IndexShard#searcher() checked for read allowed).
Note, this requires careful though about which failures we send back. If we try and initializing shard and it fails, its failure should not override an actual failure of an active shard.
Also, removed an atomic integer used in broadcast request and use a similar shard index trick we now have in our search execution.
closes#3427
UpdateNumberOfReplicasTests#simpleUpdateNumberOfReplicasTests is very
fragile due to executing searches based on dated knowledge of
the cluster state and calling shards that have been relocating away in
the mean time. A fix is on the way.
* Changed the response to include the alias as part of each match.
* Added `percolate_format=ids` query string option to just serialize the ids in the rest response.
* Added support for multiple indices in the percolate api.
Closes#3420
This commit introduces near realtime suggestions. For more information about
its usage refer to github issue #3376
From the implementation point of view, a custom AnalyzingSuggester is used
in combination with a custom postingsformat (which is not exposed to the user
anywhere for him to use).
Closes#3376
We have an optimization where we try to delay reroute after we processed the shard started events to try and combine a few into the same event. With teh queueing of shard started events in places, we don't need to do it, and we can reroute right away, which will actually reduce the amount of cluster state events we send.
This will also have a nice side effect of not missing on "waitForRelocatingShards(0)" on cluster health checks since relocations will happen right away.
closes#3417
- checking routing table taken from same (up-to-date) cluster state
- added @Slow annotation
- forced cluster reroute when needed
- changed order of assertions so that if it fails again it's easier to understand why