340 lines
10 KiB
Plaintext
340 lines
10 KiB
Plaintext
[[docs-termvectors]]
|
|
== Term Vectors
|
|
|
|
Returns information and statistics on terms in the fields of a particular
|
|
document. The document could be stored in the index or artificially provided
|
|
by the user. Term vectors are <<realtime,realtime>> by default, not near
|
|
realtime. This can be changed by setting `realtime` parameter to `false`.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true'
|
|
--------------------------------------------------
|
|
|
|
Optionally, you can specify the fields for which the information is
|
|
retrieved either with a parameter in the url
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?fields=text,...'
|
|
--------------------------------------------------
|
|
|
|
or by adding the requested fields in the request body (see
|
|
example below). Fields can also be specified with wildcards
|
|
in similar way to the <<query-dsl-multi-match-query,multi match query>>
|
|
|
|
[WARNING]
|
|
Note that the usage of `/_termvector` is deprecated in 2.0, and replaced by `/_termvectors`.
|
|
|
|
[float]
|
|
=== Return values
|
|
|
|
Three types of values can be requested: _term information_, _term statistics_
|
|
and _field statistics_. By default, all term information and field
|
|
statistics are returned for all fields but no term statistics.
|
|
|
|
[float]
|
|
==== Term information
|
|
|
|
* term frequency in the field (always returned)
|
|
* term positions (`positions` : true)
|
|
* start and end offsets (`offsets` : true)
|
|
* term payloads (`payloads` : true), as base64 encoded bytes
|
|
|
|
If the requested information wasn't stored in the index, it will be
|
|
computed on the fly if possible. Additionally, term vectors could be computed
|
|
for documents not even existing in the index, but instead provided by the user.
|
|
|
|
[WARNING]
|
|
======
|
|
Start and end offsets assume UTF-16 encoding is being used. If you want to use
|
|
these offsets in order to get the original text that produced this token, you
|
|
should make sure that the string you are taking a sub-string of is also encoded
|
|
using UTF-16.
|
|
======
|
|
|
|
[float]
|
|
==== Term statistics
|
|
|
|
Setting `term_statistics` to `true` (default is `false`) will
|
|
return
|
|
|
|
* total term frequency (how often a term occurs in all documents) +
|
|
* document frequency (the number of documents containing the current
|
|
term)
|
|
|
|
By default these values are not returned since term statistics can
|
|
have a serious performance impact.
|
|
|
|
[float]
|
|
==== Field statistics
|
|
|
|
Setting `field_statistics` to `false` (default is `true`) will
|
|
omit :
|
|
|
|
* document count (how many documents contain this field)
|
|
* sum of document frequencies (the sum of document frequencies for all
|
|
terms in this field)
|
|
* sum of total term frequencies (the sum of total term frequencies of
|
|
each term in this field)
|
|
|
|
[float]
|
|
==== Distributed frequencies coming[2.0]
|
|
|
|
Setting `dfs` to `true` (default is `false`) will return the term statistics
|
|
or the field statistics of the entire index, and not just at the shard. Use it
|
|
with caution as distributed frequencies can have a serious performance impact.
|
|
|
|
[float]
|
|
=== Behaviour
|
|
|
|
The term and field statistics are not accurate. Deleted documents
|
|
are not taken into account. The information is only retrieved for the
|
|
shard the requested document resides in, unless `dfs` is set to `true`.
|
|
The term and field statistics are therefore only useful as relative measures
|
|
whereas the absolute numbers have no meaning in this context. By default,
|
|
when requesting term vectors of artificial documents, a shard to get the statistics
|
|
from is randomly selected. Use `routing` only to hit a particular shard.
|
|
|
|
[float]
|
|
=== Example 1
|
|
|
|
First, we create an index that stores term vectors, payloads etc. :
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
|
|
"mappings": {
|
|
"tweet": {
|
|
"properties": {
|
|
"text": {
|
|
"type": "string",
|
|
"term_vector": "with_positions_offsets_payloads",
|
|
"store" : true,
|
|
"analyzer" : "fulltext_analyzer"
|
|
},
|
|
"fullname": {
|
|
"type": "string",
|
|
"term_vector": "with_positions_offsets_payloads",
|
|
"analyzer" : "fulltext_analyzer"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"settings" : {
|
|
"index" : {
|
|
"number_of_shards" : 1,
|
|
"number_of_replicas" : 0
|
|
},
|
|
"analysis": {
|
|
"analyzer": {
|
|
"fulltext_analyzer": {
|
|
"type": "custom",
|
|
"tokenizer": "whitespace",
|
|
"filter": [
|
|
"lowercase",
|
|
"type_as_payload"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
Second, we add some documents:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
|
|
"fullname" : "John Doe",
|
|
"text" : "twitter test test test "
|
|
}'
|
|
|
|
curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
|
|
"fullname" : "Jane Doe",
|
|
"text" : "Another twitter test ..."
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
The following request returns all information and statistics for field
|
|
`text` in document `1` (John Doe):
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
|
|
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true' -d '{
|
|
"fields" : ["text"],
|
|
"offsets" : true,
|
|
"payloads" : true,
|
|
"positions" : true,
|
|
"term_statistics" : true,
|
|
"field_statistics" : true
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
|
|
{
|
|
"_id": "1",
|
|
"_index": "twitter",
|
|
"_type": "tweet",
|
|
"_version": 1,
|
|
"found": true,
|
|
"term_vectors": {
|
|
"text": {
|
|
"field_statistics": {
|
|
"doc_count": 2,
|
|
"sum_doc_freq": 6,
|
|
"sum_ttf": 8
|
|
},
|
|
"terms": {
|
|
"test": {
|
|
"doc_freq": 2,
|
|
"term_freq": 3,
|
|
"tokens": [
|
|
{
|
|
"end_offset": 12,
|
|
"payload": "d29yZA==",
|
|
"position": 1,
|
|
"start_offset": 8
|
|
},
|
|
{
|
|
"end_offset": 17,
|
|
"payload": "d29yZA==",
|
|
"position": 2,
|
|
"start_offset": 13
|
|
},
|
|
{
|
|
"end_offset": 22,
|
|
"payload": "d29yZA==",
|
|
"position": 3,
|
|
"start_offset": 18
|
|
}
|
|
],
|
|
"ttf": 4
|
|
},
|
|
"twitter": {
|
|
"doc_freq": 2,
|
|
"term_freq": 1,
|
|
"tokens": [
|
|
{
|
|
"end_offset": 7,
|
|
"payload": "d29yZA==",
|
|
"position": 0,
|
|
"start_offset": 0
|
|
}
|
|
],
|
|
"ttf": 2
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
=== Example 2
|
|
|
|
Term vectors which are not explicitly stored in the index are automatically
|
|
computed on the fly. The following request returns all information and statistics for the
|
|
fields in document `1`, even though the terms haven't been explicitly stored in the index.
|
|
Note that for the field `text`, the terms are not re-generated.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true' -d '{
|
|
"fields" : ["text", "some_field_without_term_vectors"],
|
|
"offsets" : true,
|
|
"positions" : true,
|
|
"term_statistics" : true,
|
|
"field_statistics" : true
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
[[docs-termvectors-artificial-doc]]
|
|
=== Example 3
|
|
|
|
Term vectors can also be generated for artificial documents,
|
|
that is for documents not present in the index. The syntax is similar to the
|
|
<<search-percolate,percolator>> API. For example, the following request would
|
|
return the same results as in example 1. The mapping used is determined by the
|
|
`index` and `type`.
|
|
|
|
[WARNING]
|
|
======
|
|
If dynamic mapping is turned on (default), the document fields not in the original
|
|
mapping will be dynamically created.
|
|
======
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'http://localhost:9200/twitter/tweet/_termvectors' -d '{
|
|
"doc" : {
|
|
"fullname" : "John Doe",
|
|
"text" : "twitter test test test"
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
[[docs-termvectors-per-field-analyzer]]
|
|
=== Example 4
|
|
|
|
Additionally, a different analyzer than the one at the field may be provided
|
|
by using the `per_field_analyzer` parameter. This is useful in order to
|
|
generate term vectors in any fashion, especially when using artificial
|
|
documents. When providing an analyzer for a field that already stores term
|
|
vectors, the term vectors will be re-generated.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'http://localhost:9200/twitter/tweet/_termvectors' -d '{
|
|
"doc" : {
|
|
"fullname" : "John Doe",
|
|
"text" : "twitter test test test"
|
|
},
|
|
"fields": ["fullname"],
|
|
"per_field_analyzer" : {
|
|
"fullname": "keyword"
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"_index": "twitter",
|
|
"_type": "tweet",
|
|
"_version": 0,
|
|
"found": true,
|
|
"term_vectors": {
|
|
"fullname": {
|
|
"field_statistics": {
|
|
"sum_doc_freq": 1,
|
|
"doc_count": 1,
|
|
"sum_ttf": 1
|
|
},
|
|
"terms": {
|
|
"John Doe": {
|
|
"term_freq": 1,
|
|
"tokens": [
|
|
{
|
|
"position": 0,
|
|
"start_offset": 0,
|
|
"end_offset": 8
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|