We currently pull in the lucene-expression module that is referenced
by lucene-suggest. Yet, we don't make use of this dependency at all
and it pulls in a bunch of unshaded libs like `antlr` and `asm` which
are pretty common in other projects. We should exclude this
dependency since we don't use it at all and it causes problems
when Elasticsearch is used as a node client. (see #4858)
If we mark the dependency as provided it won't be included in the
distribution.
Closes#4859Closes#4858
MockPageCacheRecycler is missing in test jar which makes failing tests when using
test jar in plugins:
```
1> [2014-01-11 10:51:30,531][ERROR][test ] FAILURE : testWikipediaRiver(org.elasticsearch.river.wikipedia.WikipediaRiverTest)
1> REPRODUCE WITH : mvn test -Dtests.seed=5DAFD4FBAE587363 -Dtests.class=org.elasticsearch.river.wikipedia.WikipediaRiverTest -Dtests.method=testWikipediaRiver -Dtests.prefix=tests -Dtests.network=true -Dfile.encoding=MacRoman -Duser.timezone=Europe/Paris -Des.logger.level=INFO -Des.node.local=true -Dtests.cluster_seed=134842C2D806FFC0
1> Throwable:
1> java.lang.NoClassDefFoundError: org/elasticsearch/cache/recycler/MockPageCacheRecycler
1> org.elasticsearch.test.cache.recycler.MockPageCacheRecyclerModule.configure(MockPageCacheRecyclerModule.java:30)
1> org.elasticsearch.common.inject.AbstractModule.configure(AbstractModule.java:60)
```
Closes#4694.
* Clean up s/ElasticSearch/Elasticsearch on docs/*
* Clean up s/ElasticSearch/Elasticsearch on src/* bin/* & pom.xml
* Clean up s/ElasticSearch/Elasticsearch on NOTICE.txt and README.textile
Closes#4634
Refactor cache recycling so that it only caches large arrays (pages) that can
later be used to build more complex data-structures such as hash tables.
- QueueRecycler now takes a limit like other non-trivial recyclers.
- New PageCacheRecycler (inspired of CacheRecycler) has the ability to cache
byte[], int[], long[], double[] or Object[] arrays using a fixed amount of
memory (either globally or per-thread depending on the Recycler impl, eg.
queue is global while thread_local is per-thread).
- Paged arrays in o.e.common.util can now optionally take a PageCacheRecycler
to reuse existing pages.
- All aggregators' data-structures now use PageCacheRecycler:
- for all arrays (counts, mins, maxes, ...)
- LongHash can now take a PageCacheRecycler
- there is a new BytesRefHash (inspired from Lucene but quite different,
still; for instance it cheats on BytesRef comparisons by using Unsafe)
that also takes a PageCacheRecycler
Close#4557
Test can be run with `-Dtests.assertion.disabled=org.elasticsearch`
to run the tests without assertions to make sure assertions
don't hide any assignements etc. that introduce bugs in production.
We are mocking out some functionality to add assertions etc. or
randomize store types. We should randomly run with our defaults to make
sure we don't hide any potential problems.
The REST layer can now be tested through tests that are shared between all the elasticsearch official clients.
The tests are based on REST specification that can be found on the elasticsearch-rest-api-spec project and consist of YAML files that describe the operations to be executed and the obtained results that need to be tested.
REST tests can be executed through the ElasticsearchRestTests class, which relies on the rest-spec git submodule that contains the rest spec and tests pulled from the elasticsearch-rest-spec-api project. The rest-spec submodule gets automatically initialized and updated through maven (generate-test-resources phase).
The REST runner and the needed classes are distributed within the test artifact.
The following are the options supported by the REST tests runner:
- tests.rest[true|false|host:port]: determines whether the REST tests need to be run and if so whether to rely on an external cluster (providing host and port) or fire a test cluster (default)
- tests.rest.suite: comma separated paths of the test suites to be run (by default loaded from /rest-spec/test classpath). it is possible to run only a subset of the tests providing a sub-folder or even a single yaml file (the default /rest-spec/test prefix is optional when files are loaded from classpath) e.g. -Dtests.rest.suite=index,get,create/10_with_id
- tests.rest.spec: REST spec path (default /rest-spec/api from classpath)
- tests.iters: runs multiple iterations
- tests.seed: seed to base the random behaviours on
- tests.appendseed[true|false]: enables adding the seed to each test section's description (default false)
- tests.cluster_seed: seed used to create the test cluster (if enabled)
Closes#4469
In order to be sure that memory mapped lucene directories are working
one can configure the kernel about how many memory mapped areas
a process may have. This setting ensure for the debian and redhat initscripts
as well as the systemd startup, that this setting is set high enough.
Closes#4397
We have the situation that some tests fail since they don't handle
EsRejectedExecutionException which gets thrown when a node shuts
down. That is ok to ignore this exception and not fail.
We also suffer from OOMs that can't create native threads but don't
get threaddumps for those failures. This patch prints the thread
stacks once we catch a OOM which can' create native threads.
The maven-compiler-plugin upgrade from 2.3.2 to 3.1 (see #4279) could cause out of memory issue when building the project with Maven and JDK6 and default memory settings (no `MAVEN_OPTS`).
This issue does not appear with JDK7.
When we want to use test artifact in other projects, dependencies
are not shaded as for core artifact.
Issue opened in maven shade project: [MSHADE-158](http://jira.codehaus.org/browse/MSHADE-158)
When using it in other projects, you basically need to change your `pom.xml` file:
```xml
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>${elasticsearch.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
```
You can also define some properties:
```xml
<properties>
<tests.jvms>1</tests.jvms>
<tests.shuffle>true</tests.shuffle>
<tests.output>onerror</tests.output>
<tests.client.ratio></tests.client.ratio>
<es.logger.level>INFO</es.logger.level>
</properties>
```
Closes#4266
This commit upgrades to Lucene 4.6 and contains the following improvements:
* Remove XIndexWriter in favor of the fixed IndexWriter
* Removes patched XLuceneConstantScoreQuery
* Now uses Lucene passage formatters contributed from Elasticsearch in PostingsHighlighter
* Upgrades to Lucene46 Codec from Lucene45 Codec
* Fixes problem in CommonTermsQueryParser where close was never called.
Closes#4241
Currently when importing projects into eclipse you need to run 'mvn
eclipse:eclipse' on the command line to generate the poject files. This
means that when the pom changes you need to re-run the command on the
command line to reflect those changes in the project in eclipse. This
commit allows the developer to import the project as an existing maven
project (can be shared using git after import) and then allows the
application to be run inside eclipse using the .launch file in
/dev-tools enabling easy debugging of the application within eclipse
without requiring a maven build.
This commit adds javadocs and removed unused methods from central
classes like ElasticsearchIntegrationTest. It also changes visibility
of many methods and classes that are only needed inside the test infrastructure.
This commit causes all classes under 'org.elasticsearch.test.*'
to be included in a 'elasticsearch-${version}-test.jar' that can be
inclued by 3rd party projects or plugins via:
```
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>${elasticsearch.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
```
Currently tests only run with node clients but eventually we want to
run all tests with randomly choosen node / transport clients. To enable
this during development and on test servers as a transition phase this
commit adds the ability to allow a fraction of the clients used in
tests to be transport clients. By default this is still disabled.
To enable transport clients in tests pass '-Dtests.client.ratio=[0..1]'
where '1.0' will force transport clients and '0.0' completely disables
them. If an empty string is passed the ratio is chosen at random for
each test.
While testing an async system providing reproducible tests that
use randomized components is a hard task we should at least try to
reestablish the enviroment of a failing test as much as possible.
This commit allows to re-establish the shared 'TestCluster' by
resetting the cluster to a predefined shared state before each test.
Before this commit a tests that is executed in isolation was likely
using a entirely different node enviroment as the failing test since
the 'TestCluster' kept intermediate nodes started by other tests around.
use service id for pid name
disable filtering on *.exe (caused corruption)
rename exe names and add more options to .bat
start/stop operations are now supported (and expected to be called) by service.bat
add more variables from the env to customize default behavior prior to installing the service
add manager option
fixes regarding batch flow
specify service id in description
minor readability improvement
include .exe only in ZIP archive
rename x64 service id to make it work out of the box
add elasticsearch as a service for Windows platforms
based on Apace Commons Daemon
supports both x64 and x86
Compared to setting node.local to true, would be nicer to support node.mode with values of local or network.
Note, node.local is still supported.
closes#3713
The fact that Maven and Eclipse share the same build directories can trigger
race conditions when both are trying to build at the same time, eg. if you run
`mvn clean test` while Eclipse is up and running: Eclipse will notice that some
class files are missing and start compiling in parallel with Maven.
This commit adds support for failing fast when running a test
case with `-Dtests.iters=N` and uses some goodness from LuceneTestCase
in a new base `AbstractRandomizedTest`. This class checks among other
things if a tests doesn't call `super.setup` / `super.tearDown` when it
should do and checks if a large static resources are not cleaned up
after the tests ie. a running node.
For better integration with the Lucene Test Framework and the
availabilty of RandomizedRunner / Randommized Testing this commit
moves over from TestNG to JUnit.
This commit also moves relevant places over to RandomzedRunner for
reproduceability and removes copied classes from the Lucene Test
Framework.
The currently used maven shade plugin still keeps references to the
original classes in their constant pools around. This is never a problem
at runtime, but for dependency tools which try to use the constant pool
for determining dependencies will get confused (OSGI for example). This
patch simply bumps the version and will implicetely fix
fix http://jira.codehaus.org/browse/MSHADE-105Closes#3254Closes#3255
This enforces that settings are taken into account whichever mean is used to
import the project into Eclipse (manual import, m2e, mvn eclipse:eclipse, ...).
According to #2515 the ubuntu software center does not allow to install
debian packages which are not lintian compatible
I worked on the package and made it lintian compatible by doing
* Ignoring errors about arch dependent binaries as we will not split
this package. The arch dependent libraries are used correctly.
* Added a copyright file pointing to the apache license in debian
Closes#2515Closes#2320
In order to ensure that configuration files do not get overwritten when
upgrading an RPM, it is not sufficient to mark them as configuration. You
have to use the 'noreplace' parameter to make sure, they are never
overwritten. Added this parameter for the /etc/elasticsearch directory
as well as the /etc/sysconfig/elasticsearch file.
In addition, the post remove script now only deletes the user in case of
a package removal (and does nothing on package upgrade).
Closes#3123
This patch makes mvn eclipse:eclipse generate additional eclipse configuration
files so that Eclipse:
- uses Java 1.6 compliance level,
- truncates lines after 140 chars,
- uses 4 spaces for indentation,
- automatically adds a license header when creating a new class file,
- organizes imports the same way as Intellij Idea (which makes sense I guess
since most of the code bas has been written with Intellij, this will prevent
from having large diffs due to the fact that the order of imports has
changed).
This commit integrates the forbiddenAPI checks that checks
Java byte code against a list of "forbidden" API signatures.
The commit also contains the fixes of the current source code
that didn't pass the default API checks.
See https://code.google.com/p/forbidden-apis/ for details.
Closes#3059
This Lucene Release introduced a new API on DocIdSetIterator that requires each
implementation to return a `cost` upperbound as a function of the iterated documents.
This API allows for several optimizations during query execution especially in
Conjunction and Disjunction Queries with min_should_match set.
Closes#2990
Note: This has been disabled by default and is therefore not included in a
standard build. The main reason for this is, that you need to have a RPM
binary and the rpm development packages installed, which is not the case
on many systems.
The package contains an init.d-script as well as systemd configurations.
You can build your own RPM package simply by running 'maven rpm:rpm'
# Suggest feature
The suggest feature suggests similar looking terms based on a provided text by using a suggester. At the moment there the only supported suggester is `fuzzy`. The suggest feature is available since version `0.21.0`.
# Fuzzy suggester
The `fuzzy` suggester suggests terms based on edit distance. The provided suggest text is analyzed before terms are suggested. The suggested terms are provided per analyzed suggest text token. The `fuzzy` suggester doesn't take the query into account that is part of request.
# Suggest API
The suggest request part is defined along side the query part as top field in the json request.
```
curl -s -XPOST 'localhost:9200/_search' -d '{
"query" : {
...
},
"suggest" : {
...
}
}'
```
Several suggestions can be specified per request. Each suggestion is identified with an arbitary name. In the example below two suggestions are requested. The `my-suggest-1` suggestion uses the `body` field and `my-suggest-2` uses the `title` field. The `type` field is a required field and defines what suggester to use for a suggestion.
```
"suggest" : {
"suggestions" : {
"my-suggest-1" : {
"type" : "fuzzy",
"field" : "body",
"text" : "the amsterdma meetpu"
},
"my-suggest-2" : {
"type" : "fuzzy",
"field" : "title",
"text" : "the rottredam meetpu"
}
}
}
```
The below suggest response example includes the suggestions part for `my-suggest-1` and `my-suggest-2`. Each suggestion part contains a terms array, that contains all terms outputted by the analyzed suggest text. Each term object includes the term itself, the original start and end offset in the suggest text and if found an arbitary number of suggestions.
```
{
...
"suggest": {
"my-suggest-1": {
"terms" : [
{
"term" : "amsterdma",
"start_offset": 5,
"end_offset": 14,
"suggestions": [
...
]
}
...
]
},
"my-suggest-2" : {
"terms" : [
...
]
}
}
```
Each suggestions array contains a suggestion object that includes the suggested term, its document frequency and score compared to the suggest text term. The meaning of the score depends on the used suggester. The fuzzy suggester's score is based on the edit distance.
```
"suggestions": [
{
"term": "amsterdam",
"frequency": 77,
"score": 0.8888889
},
...
]
```
# Global suggest text
To avoid repitition of the suggest text, it is possible to define a global text. In the example below the suggest text is a global option and applies to the `my-suggest-1` and `my-suggest-2` suggestions.
```
"suggest" : {
"suggestions" : {
"text" : "the amsterdma meetpu",
"my-suggest-1" : {
"type" : "fuzzy",
"field" : "title"
},
"my-suggest-2" : {
"type" : "fuzzy",
"field" : "body"
}
}
}
```
The suggest text can be specied as global option or as suggestion specific option. The suggest text specified on suggestion level override the suggest text on the global level.
# Other suggest example.
In the below example we request suggestions for the following suggest text: `devloping distibutd saerch engies` on the `title` field with a maximum of 3 suggestions per term inside the suggest text. Note that in this example we use the `count` search type. This isn't required, but a nice optimalization. The suggestions are gather in the `query` phase and in the case that we only care about suggestions (so no hits) we don't need to execute the `fetch` phase.
```
curl -s -XPOST 'localhost:9200/_search?search_type=count' -d '{
"suggest" : {
"suggestions" : {
"my-title-suggestions" : {
"suggester" : "fuzzy",
"field" : "title",
"text" : "devloping distibutd saerch engies",
"size" : 3
}
}
}
}'
```
The above request could yield the response as stated in the code example below. As you can see if we take the first suggested term of each suggest text term we get `developing distributed search engines` as result.
```
{
...
"suggest": {
"my-title-suggestions": {
"terms": [
{
"term": "devloping",
"start_offset": 0,
"end_offset": 9,
"suggestions": [
{
"term": "developing",
"frequency": 77,
"score": 0.8888889
},
{
"term": "deloping",
"frequency": 1,
"score": 0.875
},
{
"term": "deploying",
"frequency": 2,
"score": 0.7777778
}
]
},
{
"term": "distibutd",
"start_offset": 10,
"end_offset": 19,
"suggestions": [
{
"term": "distributed",
"frequency": 217,
"score": 0.7777778
},
{
"term": "disributed",
"frequency": 1,
"score": 0.7777778
},
{
"term": "distribute",
"frequency": 1,
"score": 0.7777778
}
]
},
{
"term": "saerch",
"start_offset": 20,
"end_offset": 26,
"suggestions": [
{
"term": "search",
"frequency": 1038,
"score": 0.8333333
},
{
"term": "smerch",
"frequency": 3,
"score": 0.8333333
},
{
"term": "serch",
"frequency": 2,
"score": 0.8
}
]
},
{
"term": "engies",
"start_offset": 27,
"end_offset": 33,
"suggestions": [
{
"term": "engines",
"frequency": 568,
"score": 0.8333333
},
{
"term": "engles",
"frequency": 3,
"score": 0.8333333
},
{
"term": "eggies",
"frequency": 1,
"score": 0.8333333
}
]
}
]
}
}
...
}
```
# Common suggest options:
* `suggester` - The suggester implementation type. The only supported value is 'fuzzy'. This is a required option.
* `text` - The suggest text. The suggest text is a required option that needs to be set globally or per suggestion.
# Common fuzzy suggest options
* `field` - The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion.
* `analyzer` - The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field.
* `size` - The maximum corrections to be returned per suggest text token.
* `sort` - Defines how suggestions should be sorted per suggest text term. Two possible value:
** `score` - Sort by sore first, then document frequency and then the term itself.
** `frequency` - Sort by document frequency first, then simlarity score and then the term itself.
* `suggest_mode` - The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified:
** `missing` - Only suggest terms in the suggest text that aren't in the index. This is the default.
** `popular` - Only suggest suggestions that occur in more docs then the original suggest text term.
** `always` - Suggest any matching suggestions based on terms in the suggest text.
# Other fuzzy suggest options:
* `lowercase_terms` - Lower cases the suggest text terms after text analyzation.
* `max_edits` - The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2.
* `min_prefix` - The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don't occur in the beginning of terms.
* `min_query_length` - The minimum length a suggest text term must have in order to be included. Defaults to 4.
* `shard_size` - Sets the maximum number of suggestions to be retrieved from each individual shard. During the reduce phase only the top N suggestions are returned based on the `size` option. Defaults to the `size` option. Setting this to a value higher than the `size` can be useful in order to get a more accurate document frequency for spelling corrections at the cost of performance. Due to the fact that terms are partitioned amongst shards, the shard level document frequencies of spelling corrections may not be precise. Increasing this will make these document frequencies more precise.
* `max_inspections` - A factor that is used to multiply with the `shards_size` in order to inspect more candidate spell corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5.
* `threshold_frequency` - The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option.
* `max_query_frequency` - The maximum threshold in number of documents a sugges text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this this also improves the spellcheck performance. The shard level document frequencies are used for this option.
Closes#2585
* Removed CustmoMemoryIndex in favor of MemoryIndex which as of 4.1 supports adding the same field twice
* Replaced duplicated logic in X[*]FSDirectory for rate limiting with a RateLimitedFSDirectory wrapper
* Remove hacks to find out merge context in rate limiting in favor of IOContext
* replaced Scorer#freq() return type (from float to int)
* Upgraded FVHighlighter to new 'centered' highlighting
* Fixed RobinEngine to use seperate setCommitData
Embed the code now in our source, since jsr166e jar generation with 1.6 instead of 1.7 is complicated when doing it on its own as it relies on ThreadLocalRandom, and we have it in jsr166y