22 Commits

Author SHA1 Message Date
Adrien Grand
5821fa042c Cardinality aggregation.
This aggregation computes unique term counts using the hyperloglog++ algorithm
which uses linear counting to estimate low cardinalities and hyperloglog on
higher cardinalities.

Since this algorithm works on hashes, it is useful for high-cardinality fields
to store the hash of values directly in the index, which is the purpose of
the new `murmur3` field type. This is less necessary on low-cardinality
string fields because the aggregator is smart enough to only compute the hash
once per unique value per segment thanks to ordinals, or on numeric fields
since hashing them is very fast.

Close #5426
2014-03-13 19:19:56 +01:00
Simon Willnauer
da707b6f32 Remove omit_term_freq_and_positions for new indices
`omit_term_freq_and_positions` was deprecated in `0.20` and
is not documented anymore. We should reject indices that are
created with this option in the future.

Closes #4722
2014-01-17 14:46:48 +01:00
Lee Hinman
2341825358 Make type wrapping optional for PUT Mapping API request
Put mapping now supports either of these formats:

POST foo/doc/_mapping
{
  "doc": {
    "_routing": {"required": true},
    "properties": {
      "body": {"type": "string"}
    }
  }
}

or

POST foo/doc/_mapping
{
  "_routing": {"required": true},
  "properties": {
    "body": {"type": "string"}
  }
}

Closes #4483
2014-01-13 09:26:09 -07:00
Martijn van Groningen
943b62634c Replaced the multi-field type in favour for the multi fields option that can be set on any core field.
When upgrading to ES 1.0 the existing mappings with a multi-field type automatically get replaced to a core field with the new `fields` option.

If a `multi_field` type-ed field doesn't have a main / default field, a default field will be chosen for the multi fields syntax. The new main field type
will be equal to the first `multi_field` fields' field or type string if no fields have been configured for the `multi_field` field and in both cases
the default index will not be indexed (`index=no` is set on the default field).

If a `multi_field` typed field has a default field, that field will replace the `multi_field` typed field.

Closes to #4521
2014-01-13 09:21:53 +01:00
Simon Willnauer
10ec2e948a Fix ASL Header in source files to reflect s/ElasticSearch/Elasticsearch
This commit also removes the license to Shay Banon in favor of soley
Elasticsearch. Thanks Shay for this awesome product you took it far!

Closes #4636
2014-01-07 11:22:01 +01:00
Nik Everett
7690b40ec6 Allow string fields to store token counts
To use this one you send a string to a field of type 'token_count'.  This
makes the most sense with a multi-field.
2013-12-03 09:39:32 +01:00
Shay Banon
6f90a3e39a allow to parse directly the compressed mapping 2013-11-26 09:48:33 +01:00
Shay Banon
021aa09614 External method to set rootTypeParsers in DocumentMapperParser incorrect
fixes #4113
2013-11-07 01:06:57 +01:00
Boaz Leskes
0ef2493b2c Throw an exception if a type's mapping root node is not equal to the type in question.
Also, fix all the problems it brought up in tests.
Removed OverrideTypeMappingTests as it is no longer relevant.
Better naming for the default percolator mapping and change it's content use _default_ as root node.

Closes #4038
2013-11-05 11:54:25 +01:00
Adrien Grand
4fa8f6f61f Doc values integration.
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.

So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
 - field data filtering: this will fail if doc values are enabled,
 - field data cache clearing, even for memory-based doc values formats,
 - getting the memory usage for a specific field,
 - knowing whether a field is actually multi-valued.

This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).

Closes #3806
2013-10-09 16:34:30 +02:00
Simon Willnauer
f2dc4f810c Added tests for malformed mappings with no root object
This commit also makes the error message more consistent with
other exception messages in the DocumentMapperParser.
2013-08-07 14:01:32 +02:00
Manuel Bernhardt
27518b5e41 Improved error message when the mapping document is malformed 2013-08-07 13:41:49 +02:00
Alexander Reelsen
4f4f3a2b10 Added prefix suggestions based on AnalyzingSuggester
This commit introduces near realtime suggestions. For more information about
its usage refer to github issue #3376

From the implementation point of view, a custom AnalyzingSuggester is used
in combination with a custom postingsformat (which is not exposed to the user
anywhere for him to use).

Closes #3376
2013-08-01 08:44:09 +02:00
Shay Banon
1d63ff64c7 simplify parsing code 2013-06-11 13:19:54 +02:00
Chris Male
9e2469e04f Add per-field Similarity support 2012-11-21 12:44:59 +13:00
Martijn van Groningen
fd5bd102aa lucene 4: Exposed Lucene's codec api
This feature adds the option to configure a `PostingsFormat` and assign it to a field in the mapping. This feature is very expert and in almost all cases Elasticsearch's defaults will suite your needs.

## Configuring a postingsformat per field

There're several default postings formats configured by default which can be used in your mapping:
a* `direct` - A codec that wraps the default postings format during write time, but loads the terms and postinglists into memory directly in memory during read time as raw arrays. This postings format is exceptional memory intensive, but can give a substantial increase in search performance.
* `memory` - A codec that loads and stores terms and postinglists in memory using a FST. Acts like a cached postingslist.
* `bloom_default` - Maintains a bloom filter for the indexed terms, which is stored to disk and builds on top of the `default` postings format. This postings format is useful for low document frequency terms and offers a fail fast for seeks to terms that don't exist.
* `bloom_pulsing` - Similar to the `bloom_default` postings format, but builds on top of the `pulsing` postings format.
* `default` - The default postings format. The default if none is specified.

On all fields it possible to configure a `postings_format` attribute. Example mapping:
```
{
  "person" : {
     "properties" : {
         "second_person_id" : {"type" : "string", "postings_format" : "pulsing"}
     }
  }
}
```

## Configuring a custom postingsformat
It is possible the instantiate custom postingsformats. This can be specified via the index settings.
```
{
   "codec" : {
      "postings_format" : {
         "my_format" : {
            "type" : "pulsing40"
            "freq_cut_off" : "5"
         }
      }
   }
}
```
In the above example the `freq_cut_off` is set the 5 (defaults to 1). This tells the pulsing postings format to inline the postinglist of terms with a document frequency lower or equal to 5 in the term dictionary.

Closes #2411
2012-11-14 23:54:29 +01:00
Shay Banon
6c3847b0a9 move spatial4j and jts to be optional dependencies
allowing data and client nodes to work without them, disabling shapes if needed
2012-09-01 00:05:49 +02:00
Chris Male
bea4346f3a Added GeoShape indexing and querying support 2012-08-13 13:44:29 +02:00
Shay Banon
bb0f5cf234 improve map builder to initialize the inner map with a map to build the data from 2012-05-20 19:50:47 +02:00
Shay Banon
acbd7b686a Allow to customize quote analyzer to be used when quoting text in a query_string, closes #1931. 2012-05-10 11:51:51 +03:00
Shay Banon
6a71eab51f finalize structure, tests pass 2011-12-06 02:43:17 +02:00
Shay Banon
a8fd2d48b8 first cleanup phase, move to single src 2011-12-06 00:59:23 +02:00