1904 Commits

Author SHA1 Message Date
Marcus Granström
b7cb479a72 Added doc_as_upsert option to update api.
This option can reduce to amount of data being send to Elasticsearch.
Closes #3195
2013-06-17 10:23:37 +02:00
Clinton Gormley
27a8083b7d Expose timeout for nodes_info requests in the REST interface
Closes #3191
2013-06-15 19:01:09 +02:00
Adrien Grand
a30d58aae2 Compress PagedBytesAtomicFieldData's termOrdToBytesOffset.
Using MonotonicAppendingLongBuffer instead of a GrowableWriter should help
save several bits per value, especially when the bytes to store have similar
lengths.

Closes #3186
2013-06-15 09:31:23 +02:00
Simon Willnauer
25f19f8b87 Wait for reloctations in utility methods 2013-06-14 21:59:43 +02:00
Simon Willnauer
a4fc11b3d1 Wait for Yellow state after indexing 2013-06-14 12:14:43 +02:00
Clinton Gormley
f537b8ccee Change default operator to "or" for "low_freq_operator" and "high_freq_operator" parameters for "common" queries
Closes #3178
2013-06-14 11:08:56 +02:00
Martijn van Groningen
8d59ed3ab0 Use SinglePackedOrdinals over SingleArrayOrdinals to reduce the memory ordinals take for single valued fields in field data.
Closes #3185
2013-06-14 10:16:49 +02:00
Simon Willnauer
b995abfa80 Call DISI#cost() ahead of time to prevent NPE
NotDocIdSet resets the internal DocIdSetIterator to null causing NPE
if cost is called.

Closes #3177
2013-06-14 09:49:30 +02:00
Clinton Gormley
c3332db7d0 Fixed an error message on the terms filter 2013-06-13 19:40:47 +02:00
Simon Willnauer
4e4529f3dc Check if Alias Creation was acknoledge in tests.
if there is a failure during alias creation the tests don't fail with the
correct exception. This commit simplifies the debugging asserting on the ack
flag.
2013-06-13 15:52:33 +02:00
Simon Willnauer
a654c3d103 Set a hard limit on the number of tokens we run suggestion on
PhraseSuggester can be very slow and CPU intensive if a lot of terms
are suggested. Yet, to prevent cluster instabilty and long running requests
this commit adds a hard limit of by default 10 tokens where we just return
no correction for anymore if the query is parsed into more tokens.

Closes #3164
2013-06-13 15:12:38 +02:00
Alexander Reelsen
9d3e34b9f9 Allow date format to supported group of built-in patterns
Until now 'named dates' like dateOptionalTime could not be used as a group
of dates. This patch allows it to group it arbitrarily like this:

* yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||dateOptionalTime
* dateOptionalTime||yyyy/MM/dd HH:mm:ss||yyyy/MM/dd
* yyyy/MM/dd HH:mm:ss||dateOptionalTime||yyyy/MM/dd
* date_time||date_time_no_millis

Closes #2132
2013-06-13 15:03:55 +02:00
Martijn van Groningen
015d820e53 Made not found logic easier.
Relates #3172
2013-06-13 13:21:36 +02:00
Simon Willnauer
7e2d8f1358 add more verbose assertions to tests 2013-06-13 11:58:28 +02:00
Adrien Grand
c20d44a1ff Forbid usage of Character.codePoint(At|Before) and Collections.sort.
Character.codePointAt and codePointBefore have two versions: one which only
accepts an offset, and one which accepts an offset and a limit. The former can
be dangerous when working with buffers of characters because if the offset
is the last char of the buffer, a char outside the buffer might be used to
compute the code point, so one should always use the version which accepts a
limit.

Collections.sort is wasteful on random-access lists: it dumps data into an
array, sorts the list and then adds elements back to the list. However, the
sorting can easily be performed in-place by using Lucene's
CollectionUtil.(merge|quick|tim)Sort.
2013-06-13 10:14:35 +02:00
Martijn van Groningen
6d8a85c6af Made get mapping rest response consistent.
Closes #3172
2013-06-13 10:11:06 +02:00
Martijn van Groningen
96af4ee44f Use XConstantScoreQuery instead of ConstantScoreQuery.
Relates to #3167
2013-06-13 10:00:54 +02:00
Boaz Leskes
aa851225e5 Added created flag to index related request classes.
The flag is set to true when a document is new, false when replacing an existing object.

Other minor changes:
Fixed an issue with dynamic gc deletes settings update
Added an assertThrows to ElasticsearchAssertion

Closes #3084 , Closes #3154
2013-06-13 09:10:32 +02:00
Martijn van Groningen
a2de34eead Added filter support to custom_score query.
Closes #3167
2013-06-12 22:41:49 +02:00
Martijn van Groningen
dc0d81b8aa Improves the way the get mapping and get warmer get their data from the master's cluster state copy.
Both apis now also support a `local` parameter, that fetches the mapping / warmer from the cluster state of the node that received the request. The `type` option in the get mapping api now also support wildcards. The warmer api now also support the `type` option.

Closes #3171
2013-06-12 21:03:47 +02:00
Simon Willnauer
8e33e0e69d Use CFS in any case if index.compound_format is set to true
Lucenes MergePolicies support a noCFSRatio. This commit introduces
support for this ratio via `index.compound_format`. This setting
can parse a boolean value or a value in the interval [0..1] that
is equivalent to the noCFSRatio. The setting `1`, `1.0` and `true`
are equivalent as well as `0`, `0.0` and `false`.

Closes #3166
2013-06-12 20:45:18 +02:00
Simon Willnauer
cb0cf3167c stabelize more tests 2013-06-12 13:25:26 +02:00
Shay Banon
c449fbdd68 missing/exists filters should also work for objects
closes #3141
2013-06-12 04:42:23 +02:00
Simon Willnauer
66cd74d2df Always ceate index with mapping in test to ensure shards are available 2013-06-11 19:08:33 +02:00
Shay Banon
dac2c559d4 remove the index level class support
fix the test that relies on it, just index the data for each test case
2013-06-11 16:35:13 +02:00
Shay Banon
78fb12bcaa fix the type of the mapping 2013-06-11 14:49:34 +02:00
Shay Banon
3a0f9c6ea3 fix shared cluster to delete templates as well per test run 2013-06-11 14:43:18 +02:00
Shay Banon
1d63ff64c7 simplify parsing code 2013-06-11 13:19:54 +02:00
Shay Banon
41e4ee22e6 Thread pool: rename capacity to queue_size
fixes #3161
2013-06-11 13:07:07 +02:00
Simon Willnauer
7afffbe13b Cleanup String to UTF-8 conversion
Currently we have many different places that convert String to UTF-8
bytes and back. We shouldn't maintain more code than necessary to
do this conversion and rather use Lucene's support for it.
2013-06-10 21:56:24 +02:00
Alexander Reelsen
9323e677bd Cleaning up some tests by using assertHitCount assertion 2013-06-10 16:57:09 +02:00
Simon Willnauer
21945e5060 Ensure all shards return compareable scores for rescore tests 2013-06-10 16:50:10 +02:00
Simon Willnauer
314a3343f9 Add more verbose matchers / asserts to tests 2013-06-10 16:06:04 +02:00
Florian Schilling
f64f7c0c08 Fixed the GeoPointFieldMapper to parse geohashes correctly.
Closes #3073
2013-06-10 12:13:43 +02:00
Simon Willnauer
b9feaa9999 Simplify TestCluster
TestCluster now doesn't use any reference counting anymore and
testcluster names are based on creation time to prevent confilcts if
builds hang.
2013-06-10 12:07:11 +02:00
Britta Weber
11d08ac436 term vector request
================================

Returns information and statistics on terms in the fields of a particular document as stored in the index.

        curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true'

Tree types of values can be requested: term information, term statistics and field statistics.
By default, all term information and field statistics are returned for all fields but no term statistics.

Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url

	curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...'

or adding by adding the requested fields in the request body (see example below).

Term information
-------------------------

- term frequency in the field (always returned)
- term positions ("positions" : true)
- start and end offsets ("offsets" : true)
- term payloads ("payloads" : true), as base64 encoded bytes

If the requested information wasn't stored in the index, it will be omitted without further warning.
See [mapping](http://www.elasticsearch.org/guide/reference/mapping/core-types/) on how to configure your index to store term vectors.

Term statistics
-------------------------

Setting "term_statistics" to "true" (default is "false") will return

- total term frequency (how often a term occurs in all documents)
- document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.

Field statistics
-------------------------

Setting "field_statistics" to "false" (default is "true") will omit

- document count (how many documents contain this field)
- sum of document frequencies (the sum of document frequencies for all terms in this field)
- sum of total term frequencies (the sum of total term frequencies of each term in this field)

Behavior
-------------------------

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.

Example
-------------------------

First, we create an index that stores term vectors, payloads etc. :

    curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
        "mappings": {
            "tweet": {
                "properties": {
                    "text": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "store" : "yes",
                                "index_analyzer" : "fulltext_analyzer"
                         },
                     "fullname": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "index_analyzer" : "fulltext_analyzer"
                         }
                 }
            }
        },
        "settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replicas" : 0
            },
            "analysis": {
                    "analyzer": {
                        "fulltext_analyzer": {
                            "type": "custom",
                            "tokenizer": "whitespace",
                            "filter": [
                                "lowercase",
                                "type_as_payload"
                            ]
                        }
                    }
            }
         }
    }'

Second, we add some documents:

    curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
      "fullname" : "John Doe",
      "text" : "twitter test test test "

    }'

    curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
      "fullname" : "Jane Doe",
      "text" : "Another twitter test ..."

    }'

The following request returns all information and statistics for field "text" in document "1" (John Doe):

     curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
                    "fields" : ["text"],
                    "offsets" : true,
                    "payloads" : true,
                    "positions" : true,
                    "term_statistics" : true,
                    "field_statistics" : true
            }'
Equivalently, all parameters can be passed as URI parameters:
     curl -GET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true&fields=text&offsets=true&payloads=true&positions=true&term_statistics=true&field_statistics=true'

Response:

  {
    "_index" : "twitter",
    "_type" : "tweet",
    "_id" : "1",
    "_version" : 1,
    "exists" : true,
    "term_vectors" : {
      "text" : {
        "field_statistics" : {
          "sum_doc_freq" : 6,
          "doc_count" : 2,
          "sum_ttf" : 8
        },
        "terms" : {
          "test" : {
            "doc_freq" : 2,
            "ttf" : 4,
            "term_freq" : 3,
            "pos" : [ 1, 2, 3 ],
            "start" : [ 8, 13, 18 ],
            "end" : [ 12, 17, 22 ],
            "payload" : [ "d29yZA==", "d29yZA==", "d29yZA==" ]
          },
          "twitter" : {
            "doc_freq" : 2,
            "ttf" : 2,
            "term_freq" : 1,
            "pos" : [ 0 ],
            "start" : [ 0 ],
            "end" : [ 7 ],
            "payload" : [ "d29yZA==" ]
          }
        }
      }
    }
  }

Further changes:
-------------------------

XContentBuilder
new method
public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value)
to put an integer array.

IndicesAnalysisService
make token filter for saving payloads available in elasticsearch

AbstractFieldMapper/TypeParser
make term vector options string available and also fix the parsing of this string:
with_positions_payloads is actually allowed as can be seen in TermVectorsConsumerPerFields.

Closes #3114
2013-06-10 11:09:11 +02:00
Simon Willnauer
945b89fd80 Don't test the test - who tests the test for the test? ;) 2013-06-07 20:40:50 +02:00
Simon Willnauer
b222e83d2b Stabelize more tests 2013-06-07 20:33:17 +02:00
Britta Weber
ac75b1bcae Fix addMapping() in AbstractSharedClusterTest for more than one field 2013-06-07 19:05:13 +02:00
Alexander Reelsen
a5f9173e14 Making deb installable by being lintian compatible
According to #2515 the ubuntu software center does not allow to install
debian packages which are not lintian compatible

I worked on the package and made it lintian compatible by doing

* Ignoring errors about arch dependent binaries as we will not split
  this package. The arch dependent libraries are used correctly.
* Added a copyright file pointing to the apache license in debian

Closes #2515
Closes #2320
2013-06-07 13:53:14 +02:00
Simon Willnauer
962e3d58f7 Added shortcuts for several common commands
added simple way to add more complex mappings as well as shortcuts for flush
and status etc. all checking if requests return failed shards
2013-06-07 12:30:30 +02:00
Martijn van Groningen
8016d32a0e Fixed minor issue in ASCT#indexExists(...) 2013-06-06 21:42:42 +02:00
Martijn van Groningen
e218ead19e ChildrenQuery and ParentQuery now take into account documents that have been marked.
Closes #3144
2013-06-06 17:13:49 +02:00
Simon Willnauer
3b01f812d6 Stabelize more tests
Wait for relocation before checking statistics or run refresh / optimze.
2013-06-06 17:03:36 +02:00
Simon Willnauer
1c513bc262 Fallback to extract terms if MultiPhraseQuery is large
Currently if MPQ is very large highlighing can take down a node
or cause high CPU / RAM consumption. If the query grows > 16 terms
we just extract the terms and do term by term highlighting.

Closes  #3142 #3128
2013-06-06 11:22:49 +02:00
Simon Willnauer
f995c9c130 Correct offsets in FVH also if stored field is used for highlighting
The SimpleFragemntsBuilder did not correct offsets if the used
analysis chais could produce broken offsets that could lead to
StringArrayIndexOutOfBounds Exceptions

Closes #3140
2013-06-06 10:23:09 +02:00
Simon Willnauer
00c13532a9 report details if shard response has failed shards 2013-06-06 00:54:34 +02:00
Martijn van Groningen
7936417270 Added a benchmark for parent/child queries while indexing at the same time. 2013-06-05 22:27:18 +02:00
Martijn van Groningen
82ff1c6802 Fixed has_parent query and filter returning no results with multi level child docs. 2013-06-05 22:12:26 +02:00
Simon Willnauer
56dfa96851 More test cleanups 2013-06-05 15:45:03 +02:00