OpenSearch

mirror of https://github.com/honeymoose/OpenSearch.git synced 2025-02-17 18:35:25 +00:00

Author	SHA1	Message	Date
Simon Willnauer	876b5a3dcd	prefer totalTermFrequency over docFreq in PhraseSuggester	2013-03-05 10:46:25 +01:00
Simon Willnauer	315744be55	Set shardSize according to the total size if not explicitly specified. Closes #2729	2013-03-05 09:22:23 +01:00
Shay Banon	3e264f6b95	cleanup deletion of content in shards we are very conservative on when we delete data, remove the actual options of deleting data that are not used	2013-03-04 20:41:19 -08:00
Shay Banon	1ed07c1794	add a list of files that exists in the index to the failure	2013-03-04 18:15:06 -08:00
Shay Banon	d609571897	add close method to field data	2013-03-04 16:42:29 -08:00
Shay Banon	cfd8bddde4	Remove JMX connector creation flags, and JMX attributes closes #2728	2013-03-04 16:12:18 -08:00
Shay Banon	774622abfb	Change field data stats header from `field_data` to `fielddata`. fixes #2727	2013-03-04 23:50:33 +01:00
Shay Banon	d2dc672f43	allow to specify a list of settings to get a value for	2013-03-04 23:41:43 +01:00
Drew Raines	a8d52b58b6	Remove obsolete test.	2013-03-04 15:22:40 -06:00
Andrii Gakhov	dc28151ad7	fixed interchanged values in field_data stats fixes #2724	2013-03-04 11:19:33 +01:00
Shay Banon	a1b2434339	revert change on listing plugins on /_plugin we should provide it as part of nodes info relates to #2664	2013-03-03 21:52:44 +01:00
Shay Banon	a7da27c714	Field Data: Add `node` level cache type closes #2722	2013-03-03 19:55:06 +01:00
Shay Banon	e01879a698	add evictions stats to field data	2013-03-03 18:41:17 +01:00
Simon Willnauer	e9ba98913b	simplify searchShard selection when routing is present	2013-03-03 14:32:19 +01:00
Benjamin Devèze	09f20e3d4c	Fix bug when searching concrete and routing aliased indices Closes #2683	2013-03-03 14:31:57 +01:00
uboness	881cb7900c	Change geo_shapes support: * Exposed the spatial strategy to be configurable as part of the geo_shape mappings * Exposed the spatial strategy to be customizable at query time (will be used to generate the geo_shape filter/query) * Removed XTermQueryPrefixTreeStrategy and reverted to use the lucene TermQueryPrefixTreeStrategy instead * Made the RecursivePrefixTreeStrategy the default strategy to be used * Removed support for all spatial operations except "intersects" * Updated both the GeoShapeQueryBuilder and GeoShapeFilterBuilder with all the changes (removed the option of specifying the operation type (as only intersects is supported) and added the option of setting the filter/query spatial strategy Closes #2720	2013-03-02 17:13:58 +01:00
Simon Willnauer	b9513511e0	Check for null query on Percolator query loading and omit the query if it can't be parsed. Closes #2547	2013-03-02 16:55:39 +01:00
Shay Banon	0be5a7888f	fix local flag in cluster health	2013-03-02 16:00:10 +01:00
Shay Banon	5dd18acd0e	proper reason for cluster state task	2013-03-02 15:48:01 +01:00
Shay Banon	50d121315b	add ability for cluster health to wait for current events to be processed help with tests that run on slow machines	2013-03-02 14:25:45 +01:00
tristanbuckner	9273d76cdf	Make BoolFilterBuilder output proper json	2013-03-02 01:07:50 +01:00
Shay Banon	ea097afd91	add proper testing for bool filter	2013-03-02 01:07:05 +01:00
Shay Banon	361d6bf89a	spin a bit to wait for condition in test, so slow machines will still run it correctly	2013-03-01 23:36:13 +01:00
Shay Banon	fe8b3725bb	lazy set the indices on the search request now that its validated	2013-03-01 22:45:59 +01:00
Shay Banon	6687ecb038	Query DSL: Filtered query to make query optional (defaults to mach_all) closes #2718	2013-03-01 22:40:22 +01:00
Matt Weber	dfd92265b7	Correct order of routing and parent params for Get The order in which routing and parent parameters are set is important. The routing parameter must be set first or it will overwrite the parent routing value.	2013-03-01 22:24:14 +01:00
Shay Banon	2eea99255d	Analyze API returns in YAML format if analyzed string begins with --- fixes #2624	2013-03-01 22:17:09 +01:00
Shay Banon	9b68e98ea2	more strict check before trying to parse and detect a string as a date fixes #2694	2013-03-01 22:15:32 +01:00
Jeremy Jongsma	d16efbe47f	Throw correct ClassNotFoundException to debug classloader issues	2013-03-01 21:56:59 +01:00
Simon Willnauer	aaa3c48b3c	Throw IAE if indices is null or contains a null value. Closes #2656	2013-03-01 21:26:23 +01:00
Simon Willnauer	fced68c22d	ensure that suggestion only added on reduce if they are present in the shard response	2013-03-01 21:09:10 +01:00
Martijn van Groningen	d99b532f0f	Supporting sort modes `avg` and `sum` when sorting inside nested objects. Previously this commit either sort modes `min` or `max` (depending on sort order) was used when sort modes `avg` and `sum` were picked. Closes #2701	2013-03-01 19:53:20 +01:00
Simon Willnauer	39f362326e	Short Curcuit response if no indices exits and make sure listener is notified. Closes #2692	2013-03-01 15:15:56 +01:00
Simon Willnauer	3c1f291801	Fail in metadata parsing if the id path is not a value but rather an array or an object. Closes #2275	2013-03-01 13:00:29 +01:00
Simon Willnauer	b03f3fcd6c	throw IAE if fieldname is null - Closes #2711	2013-03-01 12:10:07 +01:00
Simon Willnauer	9c3898900d	always use the max score across the shards in suggest response	2013-03-01 12:09:29 +01:00
Shay Banon	30075bb6f9	add info in test for actual search failures	2013-03-01 00:00:09 +01:00
Shay Banon	849a3677cd	improve timing in test to wait for state with graceful timeouts (yet, validate early and exit when relevant)	2013-02-28 23:44:52 +01:00
Simon Willnauer	c90c5cbf85	fix bug in StupidBackoffScorer were previous word and current word were flipped creating non-existing bigram	2013-02-28 21:23:41 +01:00
Simon Willnauer	b4b3e350a6	Expose _explain via POST Closes #2710	2013-02-28 18:19:08 +01:00
Simon Willnauer	d4ec03ed76	# Phrase Suggester The `term` suggester provides a very convenient API to access word alternatives on token basis within a certain string distance. The API allows accessing each token in the stream individually while suggest-selection is left to the API consumer. Yet, often already ranked / selected suggestions are required in order to present to the end-user. Inside ElasticSearch we have the ability to access way more statistics and information quickly to make better decision which token alternative to pick or if to pick an alternative at all. This `phrase` suggester adds some logic on top of the `term` suggester to select entire corrected phrases instead of individual tokens weighted based on a ngram-langugage models. In practice it will be able to make better decision about which tokens to pick based on co-occurence and frequencies. The current implementation is kept quite general and leaves room for future improvements. # API Example The `phrase` request is defined along side the query part in the json request: ```json curl -s -XPOST 'localhost:9200/_search' -d { "suggest" : { "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "analyzer" : "body", "field" : "bigram", "size" : 1, "real_word_error_likelihood" : 0.95, "max_errors" : 0.5, "gram_size" : 2, "direct_generator" : [ { "field" : "body", "suggest_mode" : "always", "min_word_len" : 1 } ] } } } } ``` The response contains suggested sored by the most likely spell correction first. In this case we got the expected correction `xorr the god jewel` first while the second correction is less conservative where only one of the errors is corrected. Note, the request is executed with `max_errors` set to `0.5` so 50% of the terms can contain misspellings (See parameter descriptions below). ```json { "took" : 37, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2938, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "simple_phrase" : [ { "text" : "Xor the Got-Jewel", "offset" : 0, "length" : 17, "options" : [ { "text" : "xorr the god jewel", "score" : 0.17877324 }, { "text" : "xor the god jewel", "score" : 0.14231323 } ] } ] } } ```` # Phrase suggest API ## Basic parameters * `field` - the name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections. * `gram_size` - sets max size of the n-grams (shingles) in the `field`. If the field doesn't contain n-grams (shingles) this should be omitted or set to `1`. * `real_word_error_likelihood` - the likelihood of a term being a misspelled even if the term exists in the dictionary. The default it `0.95` corresponding to 5% or the real words are misspelled. * `confidence` - The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of `1.0` will only return suggestions that score higher than the input phrase. If set to `0.0` the top N candidates are returned. The default is `1.0`. * `max_errors` - the maximum percentage of the terms that at most considered to be misspellings in order to form a correction. This method accepts a float value in the range `[0..1)` as a fraction of the actual query terms a number `>=1` as an absolut number of query terms. The default is set to `1.0` which corresponds to that only corrections with at most 1 misspelled term are returned. * `separator` - the separator that is used to separate terms in the bigram field. If not set the whitespce character is used as a separator. * `size` - the number of candidates that are generated for each individual query term Low numbers like `3` or `5` typically produce good results. Raising this can bring up terms with higher edit distances. The default is `5`. * `analyzer` - Sets the analyzer to analyse to suggest text with. Defaults to the search analyzer of the suggest field passed via `field`. * `shard_size` - Sets the maximum number of suggested term to be retrieved from each individual shard. During the reduce phase the only the top N suggestions are returned based on the `size` option. Defaults to `5`. * `text` - Sets the text / query to provide suggestions for. ## Smoothing Models The `phrase` suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index). * `laplace` - the default model that uses an additive smoothing model where a constant (typically `1.0` or smaller) is added to all counts to balance weights, The default `alpha` is `0.5`. * `stupid_backoff` - a simple backoff model that backs off to lower order n-gram models if the higher order count is `0` and discounts the lower order n-gram model by a constant factor. The default `discount` is `0.4`. * `linear_interpolation` - a smoothing model that takes the weighted mean of the unigrams, bigrams and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn't have any default values. All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`) must be supplied. ## Candidate Generators The `phrase` suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a `term` suggester called for each individual term in the text. The output of the generators is subsequently scored in in combination with the candidates from the other terms to for suggestion candidates. Currently only one type of candidate generator is supported, the `direct_generator`. The Phrase suggest API accepts a list of generators under the key `direct_generator` each of the generators in the list are called per term in the original text. ## Direct Generators The direct generators support the following parameters: * `field` - The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion. * `analyzer` - The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field. * `size` - The maximum corrections to be returned per suggest text token. * `suggest_mode` - The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified: * `missing` - Only suggest terms in the suggest text that aren't in the index. This is the default. * `popular` - Only suggest suggestions that occur in more docs then the original suggest text term. * `always` - Suggest any matching suggestions based on terms in the suggest text. * `max_edits` - The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2. * `min_prefix` - The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don't occur in the beginning of terms. * `min_query_length` - The minimum length a suggest text term must have in order to be included. Defaults to 4. * `max_inspections` - A factor that is used to multiply with the `shards_size` in order to inspect more candidate spell corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5. * `threshold_frequency` - The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option. * `max_query_frequency` - The maximum threshold in number of documents a sugges text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this this also improves the spellcheck performance. The shard level document frequencies are used for this option. * pre_filter - a filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. (optional) * post_filter - a filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. (optional) The following example shows a `phrase` suggest call with two generators, the first one is using a field containing ordinary indexed terms and the second one uses a field that uses terms indexed with a `reverse` filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The `pre_filter` and `post_filter` options accept ordinary analyzer names. ```json curl -s -XPOST 'localhost:9200/_search' -d { "suggest" : { "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "analyzer" : "body", "field" : "bigram", "size" : 4, "real_word_error_likelihood" : 0.95, "confidence" : 2.0, "gram_size" : 2, "direct_generator" : [ { "field" : "body", "suggest_mode" : "always", "min_word_len" : 1 }, { "field" : "reverse", "suggest_mode" : "always", "min_word_len" : 1, "pre_filter" : "reverse", "post_filter" : "reverse" } ] } } } } ``` `pre_filter` and `post_filter` can also be used to inject synonyms after candidates are generated. For instance for the query `captain usq` we might generate a candidate `usa` for term `usq` which is a synonym for `america` which allows to present `captain america` to the user if this phrase scores high enough. Closes #2709	2013-02-28 16:17:59 +01:00
Shay Banon	2bc624806d	not bytes...	2013-02-28 16:02:38 +01:00
Shay Banon	7400c30eba	fail a shard if a merge failure occurs	2013-02-27 23:44:55 +01:00
Shay Banon	e908c723f1	don't log merge failures twice	2013-02-27 20:23:40 +01:00
Simon Willnauer	7be8f431d5	move id tests into SimpleQueryTests	2013-02-27 19:03:42 +01:00
Simon Willnauer	8ab602ec81	Fix AIOOB exception in UID type/id tuple creation. Closes #2695	2013-02-27 18:58:27 +01:00
Shay Banon	3b2d403292	malformed elasticsearch.yml causes unresponsive hang fixes #2693	2013-02-27 18:58:08 +01:00
Drew Raines	cb7a569f4b	Include preference in _count serialization and builder. [#2698 ]	2013-02-27 08:15:02 -06:00
Martijn van Groningen	ffbdc0a4c3	Updated postings format jdocs	2013-02-27 10:46:55 +01:00
Drew Raines	b53a8aff6a	Allow _count to take preference parameter. [#2698 ]	2013-02-26 16:24:52 -06:00

1 2 3 4 5 ...

1385 Commits