FVH deploys some recursive logic to extract terms from documents
that need to highlighted. For documents that have terms with super
large term frequency like a document that repeats a terms very
very often this can produce some very large stacks when extracting
the terms. Taken to an extreme this causes stack overflow errors
when this grow beyond a term frequency >= 6000.
The ultimate solution is a iterative implementation of the extract
logic but until then we should protect users from these massive
term extractions which might be not very useful in the first place.
Closes#3486
Added the FuzzySuggester in order to support completion queries
The following options have been added for the fuxxy suggester
* edit_distance: Maximum edit distance
* transpositions: Sets if transpositions should be counted as one or two changes
* min_prefix_len: Minimum length of the input before fuzzy suggestions are returned
* non_prefix_len: Minimum length of the input, which is not checked for fuzzy alternatives
Closes#3465
By making use of the lsb provided functions, one does not depend on the start-stop daemon version to test if elasticsearch is running.
This ensures, that the init script works on debian wheezy, squeeze, current ubuntu and LTS versions.
Closes#3452
This commit adds support for failing fast when running a test
case with `-Dtests.iters=N` and uses some goodness from LuceneTestCase
in a new base `AbstractRandomizedTest`. This class checks among other
things if a tests doesn't call `super.setup` / `super.tearDown` when it
should do and checks if a large static resources are not cleaned up
after the tests ie. a running node.
Retrieving termvectors for a document that does not have the requested field
caused a null pointer exception. Same for documents if the field has no term vectors,
for example, because the field only contains "?".
Now, an empty response is returned.
Closes#3471
MultiOrdinals.MultiDocs returned 'null' ordinals which caused
a NPE if the field was single valued and would allow a significantly
smaller in memory representation than single packed int ordinals.
Closes#3470
We currently return with status code 0 when an IOException occurs.
The plugin manager should in any case return a nonzero status if
the operation was not successful. Now the PluginManager uses the
following reponse codes based on 'sysexists.sh':
* '0' on success
* '64' command line usage error
* '70' internal software error
* '74' input/output
Closes#3463
Lucene 4.4 shipped with a fundamental change in how the decision
on when to write compound files is made. During segment flush the
compound files are written by default which solely relies on a flag
in the IndexWriterConfig. The merge policy has been factored out to
only make decisions on merges and not on IW flushes. The default now
is always writing CFS on flush to reduce resource usage like open files
etc. if segments are flushed regularly. While providing a senseable
default certain users / usecases might need to change this setting if
re-packing flushed segments into CFS is not desired.
Closes#3461
The ClusterService might not see the latest cluster state and therefore
might not contain the local node id. Discovery will always see the local
node id since it's set on startup.
Today due the optimizations in the boolean query builder we adjust
a pure negative query with a 'match_all'. This is not the desired
behavior in the MLT API if all the fields in a document are unsupported.
If that happens today we return all documents but the one MLT is
executed on.
Closes#3453
In the _parent field the type and id of the parent are stored as type#id, because of this a term filter on the _parent field with the parent id is always resolved to a terms filter with a type / id combination for each type in the mapping.
This can be improved by automatically use the most optimized filter (either term or terms) based on the number of parent types in the mapping.
Also added support to use the parent type in the term filter for the _parent field. Like this:
```json
{
"term" : {
"_parent" : "parent_type#1"
}
}
```
This will then always automatically use the term filter.
Closes#3454
The `size` option in the percolate api will limit the number of matches being returned:
```bash
curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{
"size" : 10,
"doc" : {...}
}'
```
In the above request no more than 10 matches will be returned. The `count` field will still return the total number of matches the document matched with.
The `size` option is not applicable for the count percolate api.
Closes#3440
This change requires different request processing on the binary protocol
level since it has been we provide compatibilty across minor version.
Yet, the suggest feature is still experimental but we try best effort
to make upgrades as seamless as possible.
This commit adds general highlighting support to the suggest feature.
The only implementation that implements this functionality at this
point is the phrase suggester.
The API supports a 'pre_tag' and a 'post_tag' that are used
to wrap suggested parts of the given user input changed by the
suggester.
Closes#3442
ScoreFunction scoring might result in under or overflow, for example if a user
decides to use the timestamp as a boost in the script scorer. Therefore, check
if cast causes a huge precision loss. Note that this does not always detect
casting issues. For example in
ScriptFunction.score()
the function
SearchScript.runAsDouble()
is called. AbstractFloatSearchScript implements it as follows:
@Override public double runAsDouble() { return runAsFloat(); }
In this case the cast happens before the assertion and therfore precision
lossor over/underflows cannot be detected by the assertion.
================
It might sometimes be desirable to have a tool available that allows to multiply the original score for a document with a function that decays depending on the distance of a numeric field value of the document from a user given reference.
These functions could be computed for several numeric fields and eventually be combined as a sum or a product and multiplied on the score of the original query.
This commit adds new score functions similar to boost factor and custom script scoring, that can be used togeter with the <code>function_score</code> keyword in a query.
To use distance scoring, the user has to define
1. a reference and
2. a scale
for each field the function should be applied on. A reference is needed to define a distance for the document and a scale to define the rate of decay.
Example use case
----------------
Suppose you are searching for a hotel in a certain town. Your budget is limited. Also, you would like the hotel to be close to the town center, so the farther the hotel is from the desired location the less likely you are to check in.
You would like the query results that match your criterion (for example, "hotel, Berlin, non-smoker") to be scored with respect to distance to the town center and also the price.
Intuitively, you would like to define the town center as the origin and maybe you are willing to walk 2km to the town center from the hotel.
In this case your *reference* for the location field is the town center and the *scale* is ~2km.
If your budget is low, you would probably prefer something cheap above something expensive.
For the price field, the *reference* would be 0 Euros and the *scale* depends on how much you are willing to pay, for example 20 Euros.
Usage
----------------
The distance score functions can be applied in two ways:
In the most simple case, only one numeric field is to be evaluated. To do so, call <code>function_score</code>, with the appropriate function. In the above example, this might be:
curl 'localhost:9200/hotels/_search/' -d '{
"query": {
"function_score": {
"gauss": {
"location": {
"reference": [
52.516272,
13.377722
],
"scale": "2km"
}
},
"query": {
"bool": {
"must": {
"city": "Berlin"
}
}
}
}
}
}'
which would then search for hotels in berlin with a balcony and weight them depending on how far they are from the Brandenburg Gate.
If you have more that one numeric field, you can combine them by defining a series of functions and filters, like, for example, this:
curl 'localhost:9200/hotels/_search/' -d '{
"query": {
"function_score": {
"functions": [
{
"filter": {
"match_all": {}
},
"gauss": {
"location": {
"reference": "11,12",
"scale": "2km"
}
}
},
{
"filter": {
"match_all": {}
},
"linear": {
"price": {
"reference": "0",
"scale": "20"
}
}
}
],
"query": {
"bool": {
"must": {
"city": "Berlin"
}
}
},
"score_mode": "multiply"
}
}
}'
This would effectively compute the decay function for "location" and "price" and multiply them onto the score. See <code> function_score</code> for the different options for combining functions.
Supported fields
----------------
Only single valued numeric fields, including time and geo locations, are be supported.
What is a field is missing?
----------------
Is the numeric field is missing in the document, that field will not be taken into account at all for this document. The function value for this field is set to 1 for this document. Suppose you have two hotels both of which are in Berlin and cost the same. If one of the documents does not have a "location", this document would get a higher score than the document having the "location" field set.
To avoid this, you could, for example, use the exists or the missing filter and add a custom boost factor to the functions.
…
"functions": [
{
"filter": {
"match_all": {}
},
"gauss": {
"location": {
"reference": "11, 12",
"scale": "2km"
}
}
},
{
"filter": {
"match_all": {}
},
"linear": {
"price": {
"reference": "0",
"scale": "20"
}
}
},
{
"boost_factor": 0.001,
"filter": {
"bool": {
"must_not": {
"missing": {
"existence": true,
"field": "coordinates",
"null_value": true
}
}
}
}
}
],
...
Closes#3423
The following are the API affected by this change and support now the readable_format flag (default false when not specified):
- indices segments
- indices stats
- indices status
- cluster nodes stats
- cluster nodes info
Closes#3432