Commit Graph

5227 Commits

Author SHA1 Message Date
Shay Banon 5c7d7fb399 Failure to execute search request with empty top level filter
closes #3477
2013-08-10 10:21:30 +02:00
Simon Willnauer be103c188b Disable UpdateMappingTets#updateDefaultMappingSettings
Test has been too flaky over nightly builds. Disabling it
with AwaitFix.
2013-08-10 07:57:58 +02:00
Boaz Leskes 4debf44cd9 Separated index creation from mapping creation pending bug fix concerning concurrent not-acked mapping requests 2013-08-09 21:39:47 +02:00
Boaz Leskes 5f4dc5433e when changing the mapping of the _default_ mapping, do not apply the old _default_ mapping to the new one and also do not validate the new version with a merge but parse is as a new type.
Closes #3474, Closes #3476
2013-08-09 20:15:51 +02:00
Britta Weber f64065c9d2 termvectors: fix null pointer exception if field has no term vectors
Retrieving termvectors for a document that does not have the requested field
caused a null pointer exception. Same for documents if the field has no term vectors,
for example, because the field only contains "?".
Now, an empty response is returned.

Closes #3471
2013-08-09 15:06:09 +02:00
Simon Willnauer ec770373ab Added random sort test for dense and sparse fields.
This test triggers a MultiDoc / MultiOrds in-memory representation
even if the field is not multivalued

Relates to #3470
2013-08-09 14:15:26 +02:00
Simon Willnauer 417c193cc3 Return ordinals from MultiOrdinals.MultiDocs
MultiOrdinals.MultiDocs returned 'null' ordinals which caused
a NPE if the field was single valued and would allow a significantly
smaller in memory representation than single packed int ordinals.

Closes #3470
2013-08-09 08:03:08 +02:00
Simon Willnauer 2ed87b5312 Use nonzero status code to signal abnormal termination
We currently return with status code 0 when an IOException occurs.
The plugin manager should in any case return a nonzero status if
the operation was not successful. Now the PluginManager uses the
following reponse codes based on 'sysexists.sh':
 * '0' on success
 * '64' command line usage error
 * '70' internal software error
 * '74' input/output

Closes #3463
2013-08-08 17:48:56 +02:00
Martijn van Groningen f8f8cac0ed ttl can be as lower than 0 (purge interval) 2013-08-08 17:43:11 +02:00
Martijn van Groningen c568fb6344 In case ttl has passed, then just check the delete count 2013-08-08 17:42:12 +02:00
Simon Willnauer 5b8ce393db Create mapping ahead of time and don't rely on index request in test 2013-08-08 17:28:59 +02:00
Simon Willnauer 4e2b9ff2ad Expose 'index.compound_on_flush' via engine settings
Lucene 4.4 shipped with a fundamental change in how the decision
on when to write compound files is made. During segment flush the
compound files are written by default which solely relies on a flag
in the IndexWriterConfig. The merge policy has been factored out to
only make decisions on merges and not on IW flushes. The default now
is always writing CFS on flush to reduce resource usage like open files
etc. if segments are flushed regularly. While providing a senseable
default certain users / usecases might need to change this setting if
re-packing flushed segments into CFS is not desired.

Closes #3461
2013-08-08 13:36:05 +02:00
Simon Willnauer 04b23a8fab Catch RejectedExecutionException on node shutdown 2013-08-08 13:10:13 +02:00
Simon Willnauer ef365098e7 Use DiscoveryModule instead of ClusterService to obtain local node id
The ClusterService might not see the latest cluster state and therefore
might not contain the local node id. Discovery will always see the local
node id since it's set on startup.
2013-08-08 12:39:49 +02:00
Martijn van Groningen d450d3b016 Simplified checks 2013-08-08 11:33:06 +02:00
Shay Banon c7d5881686 make sure we add the _uid as the first field in a doc
this will improve early termination loading times, but requires potential improvements in Lucene in terms of decompression
2013-08-07 23:28:07 +02:00
Simon Willnauer 6c91ff83f2 Assert on index delete in tests to ensure all indices are wiped even on disk 2013-08-07 17:56:55 +02:00
Simon Willnauer bcda6dfe54 Remove random empty string from test since it triggers a different exception 2013-08-07 14:17:27 +02:00
Shay Banon 80fa91d873 improve effort into figuring out the shard associated with a search failure 2013-08-07 14:16:22 +02:00
Martijn van Groningen d26b165af3 Added improvements for terms filter on _parent field similar to what has been for term filter.
Relates to #3454
2013-08-07 14:05:40 +02:00
Simon Willnauer f2dc4f810c Added tests for malformed mappings with no root object
This commit also makes the error message more consistent with
other exception messages in the DocumentMapperParser.
2013-08-07 14:01:32 +02:00
Manuel Bernhardt 27518b5e41 Improved error message when the mapping document is malformed 2013-08-07 13:41:49 +02:00
Simon Willnauer 7f0115ba9a Return nothing instead of everything in MLT if no field is supported.
Today due the optimizations in the boolean query builder we adjust
a pure negative query with a 'match_all'. This is not the desired
behavior in the MLT API if all the fields in a document are unsupported.
If that happens today we return all documents but the one MLT is
executed on.

Closes #3453
2013-08-07 13:25:09 +02:00
Martijn van Groningen 73c038fb48 Improved filtering by _parent field
In the _parent field the type and id of the parent are stored as type#id, because of this a term filter on the _parent field with the parent id is always resolved to a terms filter with a type / id combination for each type in the mapping.

This can be improved by automatically use the most optimized filter (either term or terms) based on the number of parent types in the mapping.

Also added support to use the parent type in the term filter for the _parent field. Like this:
```json
{
   "term" : {
        "_parent" : "parent_type#1"
    }
}
```
This will then always automatically use the term filter.

Closes #3454
2013-08-07 13:20:21 +02:00
Martijn van Groningen 5e0b1621b4 added Lucene upgrade reminder 2013-08-07 10:46:25 +02:00
Martijn van Groningen 12c7eeb262 Added `size` option to percolate api
The `size` option in the percolate api will limit the number of matches being returned:
 ```bash
 curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{
    "size" : 10,
    "doc" : {...}
 }'
 ```
 In the above request no more than 10 matches will be returned. The `count` field will still return the total number of matches the document matched with.

 The `size` option is not applicable for the count percolate api.

 Closes #3440
2013-08-07 10:27:20 +02:00
Simon Willnauer 662bb80d6b Add binary protocol backwards compatibility for suggest highlights
This change requires different request processing on the binary protocol
level since it has been we provide compatibilty across minor version.
Yet, the suggest feature is still experimental but we try best effort
to make upgrades as seamless as possible.
2013-08-07 10:19:11 +02:00
Luca Cavanna 3574d9de49 added explicit creation of parent type in create index 2013-08-06 23:10:33 +02:00
Nik Everett 72d6d822ae Add highlighting support for suggester.
This commit adds general highlighting support to the suggest feature.
The only implementation that implements this functionality at this
point is the phrase suggester.
The API supports a 'pre_tag' and a 'post_tag' that are used
to wrap suggested parts of the given user input changed by the
suggester.

Closes #3442
2013-08-06 20:57:39 +02:00
Britta Weber a938bd57a9 add assertion for cast double->float
ScoreFunction scoring might result in under or overflow, for example if a user
decides to use the timestamp as a boost in the script scorer. Therefore, check
if cast causes a huge precision loss. Note that this does not always detect
casting issues. For example in
ScriptFunction.score()
the function
SearchScript.runAsDouble()
is called. AbstractFloatSearchScript implements it as follows:
@Override
    public double runAsDouble() {
        return runAsFloat();
    }
In this case the cast happens before the assertion and therfore precision
lossor over/underflows cannot be detected by the assertion.
2013-08-06 18:39:36 +02:00
Britta Weber e707308f1f Distance scoring
================

It might sometimes be desirable to have a tool available that allows to multiply the original score for a document with a function that decays depending on the distance of a numeric field value of the document from a user given reference.

These functions could be computed for several numeric fields and eventually be combined as a sum or a product and multiplied on the score of the original query.

This commit adds new score functions similar to boost factor and custom script scoring, that can be used togeter with the <code>function_score</code> keyword in a query.

To use distance scoring, the user has to define

 1. a reference and
 2. a scale

for each field the function should be applied on. A reference is needed to define a distance for the document and a scale to define the rate of decay.

Example use case
----------------

Suppose you are searching for a hotel in a certain town. Your budget is limited. Also, you would like the hotel to be close to the town center, so the farther the hotel is from the desired location the less likely you are to check in.
You would like the query results that match your criterion (for example, "hotel, Berlin, non-smoker") to be scored with respect to distance to the town center and also the price.

Intuitively, you would like to define the town center as the origin and maybe you are willing to walk 2km to the town center from the hotel.
In this case your *reference* for the location field is the town center and the *scale* is ~2km.

If your budget is low, you would probably prefer something cheap above something expensive.
For the price field, the *reference* would be 0 Euros and the *scale* depends on how much you are willing to pay, for example 20 Euros.

Usage
----------------

The distance score functions can be applied in two ways:

In the most simple case, only one numeric field is to be evaluated. To do so, call <code>function_score</code>, with the appropriate function. In the above example, this might be:

    curl 'localhost:9200/hotels/_search/' -d '{
    "query": {
        "function_score": {
            "gauss": {
                "location": {
                    "reference": [
                        52.516272,
                        13.377722
                    ],
                    "scale": "2km"
                }
            },
            "query": {
                "bool": {
                    "must": {
                        "city": "Berlin"
                    }
                }
            }
        }
    }
    }'

which would then search for hotels in berlin with a balcony and weight them depending on how far they are from the Brandenburg Gate.

If you have more that one numeric field, you can combine them by defining a series of functions and filters, like, for example, this:

    curl 'localhost:9200/hotels/_search/' -d '{
    "query": {
        "function_score": {
            "functions": [
                {
                    "filter": {
                        "match_all": {}
                    },
                    "gauss": {
                        "location": {
                            "reference": "11,12",
                            "scale": "2km"
                        }
                    }
                },
                {
                    "filter": {
                        "match_all": {}
                    },
                    "linear": {
                        "price": {
                            "reference": "0",
                            "scale": "20"
                        }
                    }
                }
            ],
            "query": {
                "bool": {
                    "must": {
                        "city": "Berlin"
                    }
                }
            },
            "score_mode": "multiply"
        }
    }
    }'

This would effectively compute the decay function for "location" and "price" and multiply them onto the score. See <code> function_score</code> for the different options for combining functions.

Supported fields
----------------
Only single valued numeric fields, including time and geo locations, are be supported.

What is a field is missing?
----------------

Is the numeric field is missing in the document, that field will not be taken into account at all for this document. The function value for this field is set to 1 for this document. Suppose you have two hotels both of which are in Berlin and cost the same. If one of the documents does not have a "location", this document would get a higher score than the document having the "location" field set.

To avoid this, you could, for example, use the exists or the missing filter and add a custom boost factor to the functions.

      …
     "functions": [
        {
            "filter": {
                "match_all": {}
            },
            "gauss": {
                "location": {
                    "reference": "11, 12",
                    "scale": "2km"
                }
            }
        },
        {
            "filter": {
                "match_all": {}
            },
            "linear": {
                "price": {
                    "reference": "0",
                    "scale": "20"
                }
            }
        },
        {
            "boost_factor": 0.001,
            "filter": {
                "bool": {
                    "must_not": {
                        "missing": {
                            "existence": true,
                            "field": "coordinates",
                            "null_value": true
                        }
                    }
                }
            }
        }
    ],
    ...

Closes #3423
2013-08-06 18:37:55 +02:00
Britta Weber 720b550a94 Unify custom scores
===================

The custom boost factor, custom script boost and the filters function query all do the same thing: They take a query and for each found document compute a new score based on the query score and some script, come custom boost factor or a combination of these two. However, the json format for these three functionalities is very different. This makes it hard to add new functions.

This commit introduces one keyword <code>function_score</code> for all three functions.

The new format can be used to either compute a new score with one function:

	"function_score": {
        "(query|filter)": {},
        "boost": "boost for the whole query",
        "function": {}
    }

or allow to combine the newly computed scores

    "function_score": {
        "(query|filter)": {},
        "boost": "boost for the whole query",
        "functions": [
            {
                "filter": {},
                "function": {}
            },
            {
                "function": {}
            }
        ],
        "score_mode": "(mult|max|...)"
    }

<code>function</code> here can be either

	"script_score": {
    	"lang": "lang",
    	"params": {
        	"param1": "value1",
        	"param2": "value2"
   		 },
    	"script": "some script"
	}

or

	"boost_factor" : number

New custom functions can be added via the function score module.

Changes
---------

The custom boost factor query

	"custom_boost_factor" : {
    	"query" : {
        	....
    	},
    	"boost_factor" : 5.2
	}

becomes

	"function_score" : {
    	"query" : {
        	....
    	},
    	"boost_factor" : 5.2
	}

The custom script score

	"custom_score" : {
    	"query" : {
        	....
	    },
    	"params" : {
        	"param1" : 2,
 	       	"param2" : 3.1
    	},
	    "script" : "_score * doc['my_numeric_field'].value / pow(param1, param2)"
	}

becomes

	"custom_score" : {
    	"query" : {
        	....
	    },
	    "script_score" : {

    		"params" : {
        		"param1" : 2,
 	       		"param2" : 3.1
    		},
	    	"script" : "_score * doc['my_numeric_field'].value / pow(param1, param2)"
	    }
	}

and the custom filters score query

    "custom_filters_score" : {
        "query" : {
            "match_all" : {}
       	 },
        "filters" : [
            {
                "filter" : { "range" : { "age" : {"from" : 0, "to" : 10} } },
                "boost" : "3"
            },
            {
                "filter" : { "range" : { "age" : {"from" : 10, "to" : 20} } },
                "script" : "_score * doc['my_numeric_field'].value / pow(param1, param2)"
            }
        ],
        "score_mode" : "first",
        "params" : {
        	"param1" : 2,
 	       	"param2" : 3.1
    	}
    	"score_mode" : "first"
    }

becomes:

    "function_score" : {
        "query" : {
            "match_all" : {}
       	},
        "functions" : [
            {
                "filter" : { "range" : { "age" : {"from" : 0, "to" : 10} } },
                "boost" : "3"
            },
            {
                "filter" : { "range" : { "age" : {"from" : 10, "to" : 20} } },
                "script_score" : {
                	"script" : "_score * doc['my_numeric_field'].value / pow(param1, param2)",
                	"params" : {
        				"param1" : 2,
 	       				"param2" : 3.1
    				}

            	}
            }
        ],
        "score_mode" : "first",
    }

Partially closes issue #3423
2013-08-06 18:37:34 +02:00
Luca Cavanna e1c739fe6f Improved test, printed out potential shard failures 2013-08-06 16:24:29 +02:00
Alexander Reelsen 0db2db612b RPM Init script bugfix, which might prevent startup
Removing dangerous set calls, which might not set back the current state, but something invalid which leads to stop the script when proceeding
2013-08-06 16:19:53 +02:00
Luca Cavanna a3071540d7 Added support for readable_format parameter when printing out time and size values
The following are the API affected by this change and support now the readable_format flag (default false when not specified):
- indices segments
- indices stats
- indices status
- cluster nodes stats
- cluster nodes info

Closes #3432
2013-08-06 16:08:47 +02:00
Shay Banon ebb4bcd45e add 0.90.4 2013-08-06 15:28:02 +02:00
Alexander Reelsen 68b77c1ae3 Included only runtime dependencies when copying
This makes sure, that no test dependencies are placed in the distribution
2013-08-06 15:13:25 +02:00
Martijn van Groningen fec196b8d8 Better check for verifying that the _percolator type is removed 2013-08-06 14:20:36 +02:00
Boaz Leskes 43e374f793 Maxing out retries on conflict in bulk update cause null pointer exceptions
Also:
Bulk update one less retry then requested
Document for retries on conflict says it default to 1 (but default is 0)
TransportShardReplicationOperationAction methods now catches Throwables instead of exceptions
Added a little extra check to UpdateTests.concurrentUpdateWithRetryOnConflict

Closes #3447 & #3448
2013-08-06 13:06:06 +02:00
Luca Cavanna 636c35d0d4 Added missing metadata fields to upserted documents (parent, routing, ttl, timestamp, version and versionType)
Closes #3444
2013-08-06 12:00:44 +02:00
Simon Willnauer 88a0e4628a Catch RejectedExecutionException in outer ping request 2013-08-05 23:33:38 +02:00
Martijn van Groningen a237eead55 If the _percolator has been removed then also remove percolator queries. 2013-08-05 18:43:11 +02:00
Simon Willnauer 1983a3676a Use domain specific assertions for shard failures across tests 2013-08-05 17:50:24 +02:00
Simon Willnauer df747836d8 Use busy sleeps in NoMasterNodeTests
The busy sleep is less prone to slow tests / machines while still
fails if the actual condition isn't met.
2013-08-05 16:50:45 +02:00
Simon Willnauer d949f67241 Add better assertion reporting if nodes are not present in the ClusterState 2013-08-05 15:40:54 +02:00
Martijn van Groningen e55dab94ea the ttl purger might have already deleted the documents. 2013-08-05 14:22:47 +02:00
Shay Banon d7922b8554 Streamline Search / Broadcast (count, suggest, refresh, ...) APIs header
closes #3441
2013-08-05 12:55:38 +02:00
Simon Willnauer 539ffb9ef5 Fix occasionally hanging test moving away from timeouts.
Fixes EsExecutorTests to use latches and a busy wait util from
ElasticsearchTestCase. This commit also adds some minor randomization
to the test.
2013-08-05 11:43:48 +02:00
Simon Willnauer 094c10d62d Added busy waiting util and add suite timeout.
Some rare tests require to busy-wait a short time until a given
condition occurs for instance until a threadpool scaled down the
number of threads. This commit adds a util that waits a give time
until a condition is met, in contrast to Thread.sleep this method
waits increases the wait time by doubleling the waiting time
iterativly by doubeling it to prevent fast tests to always wait
a given sleep interval.

This commit also adds a suite timeout to fail a test if the test
times out. The test infrastructure will provide thread stack traces
if the timeout kicks in. The default timeout is set to 1h.
2013-08-05 11:43:47 +02:00
Alexander Reelsen 9c7a87f118 Overwriting pidfile on startup
The current implementation does not overwrite, but only prepend the new PID into the pidfile.
So if the process is 4 digits long, but the file is already there with a 5 digit number, the file will contain 5 digits after the write.

Note: If the pidfile still exists this usually means, there either is already an instance running using this pidfile or the process has not finished correctly.

Closes #3425
2013-08-05 11:28:37 +02:00