OpenSearch/docs/reference/analysis/token-graphs.asciidoc

[[token-graphs]]
=== Token graphs

When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of
tokens, it also records the following:

* The `position` of each token in the stream
* The `positionLength`, the number of positions that a token spans

Using these, you can create a
https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph],
called a _token graph_, for a stream. In a token graph, each position represents
a node. Each token represents an edge or arc, pointing to the next position.

image::images/analysis/token-graph-qbf-ex.svg[align="center"]

[[token-graphs-synonyms]]
==== Synonyms

Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like
synonyms, to an existing token stream. These synonyms often span the same
positions as existing tokens.

In the following graph, `quick` and its synonym `fast` both have a position of
`0`. They span the same positions.

image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"]

[[token-graphs-multi-position-tokens]]
==== Multi-position tokens

Some token filters can add tokens that span multiple positions. These can
include tokens for multi-word synonyms, such as using "atm" as a synonym for
"automatic teller machine."

However, only some token filters, known as _graph token filters_, accurately
record the `positionLength` for multi-position tokens. This filters include:

* <<analysis-synonym-graph-tokenfilter,`synonym_graph`>>
* <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>>

In the following graph, `domain name system` and its synonym, `dns`, both have a
position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in
the graph have a default `positionLength` of `1`.

image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]

[[token-graphs-token-graphs-search]]
===== Using token graphs for search 

<<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute
and does not support token graphs containing multi-position tokens.

However, queries, such as the <<query-dsl-match-query,`match`>> or
<<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to
generate multiple sub-queries from a single query string.

.*Example*
[%collapsible]
====

A user runs a search for the following phrase using the `match_phrase` query:

`domain name system is fragile`

During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for
`domain name system`, is added to the query string's token stream. The `dns`
token has a `positionLength` of `3`.

image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]

The `match_phrase` query uses this graph to generate sub-queries for the
following phrases:

[source,text]
------
dns is fragile
domain name system is fragile
------

This means the query matches documents containing either `dns is fragile` _or_
`domain name system is fragile`.
====

[[token-graphs-invalid-token-graphs]]
===== Invalid token graphs

The following token filters can add tokens that span multiple positions but
only record a default `positionLength` of `1`:

* <<analysis-synonym-tokenfilter,`synonym`>>
* <<analysis-word-delimiter-tokenfilter,`word_delimiter`>>

This means these filters will produce invalid token graphs for streams
containing such tokens.

In the following graph, `dns` is a multi-position synonym for `domain name
system`. However, `dns` has the default `positionLength` value of `1`, resulting
in an invalid graph.

image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"]

Avoid using invalid token graphs for search. Invalid graphs can cause unexpected
search results.
[DOCS] Add token graph concept docs (#53339) Adds conceptual docs for token graphs. These docs cover: * How a token graph is constructed from a token stream * How synonyms and multi-position tokens impact token graphs * How token graphs are used during search * Why some token filters produce invalid token graphs Also makes the following supporting changes: * Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking * Adds several SVGs for token graph diagrams 2020-03-19 07:42:26 -04:00			`[[token-graphs]]`
			`=== Token graphs`

			`When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of`
			`tokens, it also records the following:`

			* The `position` of each token in the stream
			* The `positionLength`, the number of positions that a token spans

			`Using these, you can create a`
			`https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph],`
			`called a _token graph_, for a stream. In a token graph, each position represents`
			`a node. Each token represents an edge or arc, pointing to the next position.`

			`image::images/analysis/token-graph-qbf-ex.svg[align="center"]`

			`[[token-graphs-synonyms]]`
			`==== Synonyms`

			`Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like`
			`synonyms, to an existing token stream. These synonyms often span the same`
			`positions as existing tokens.`

			In the following graph, `quick` and its synonym `fast` both have a position of
			`0`. They span the same positions.

			`image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"]`

			`[[token-graphs-multi-position-tokens]]`
			`==== Multi-position tokens`

			`Some token filters can add tokens that span multiple positions. These can`
			`include tokens for multi-word synonyms, such as using "atm" as a synonym for`
			`"automatic teller machine."`

			`However, only some token filters, known as _graph token filters_, accurately`
			record the `positionLength` for multi-position tokens. This filters include:

			* <<analysis-synonym-graph-tokenfilter,`synonym_graph`>>
			* <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>>

			In the following graph, `domain name system` and its synonym, `dns`, both have a
			position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in
			the graph have a default `positionLength` of `1`.

			`image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]`

			`[[token-graphs-token-graphs-search]]`
			`===== Using token graphs for search`

			<<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute
			`and does not support token graphs containing multi-position tokens.`

			However, queries, such as the <<query-dsl-match-query,`match`>> or
			<<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to
			`generate multiple sub-queries from a single query string.`

			`.Example`
			`[%collapsible]`
			`====`

			A user runs a search for the following phrase using the `match_phrase` query:

			`domain name system is fragile`

			During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for
			`domain name system`, is added to the query string's token stream. The `dns`
			token has a `positionLength` of `3`.

			`image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]`

			The `match_phrase` query uses this graph to generate sub-queries for the
			`following phrases:`

			`[source,text]`
			`------`
			`dns is fragile`
			`domain name system is fragile`
			`------`

			This means the query matches documents containing either `dns is fragile` _or_
			`domain name system is fragile`.
			`====`

			`[[token-graphs-invalid-token-graphs]]`
			`===== Invalid token graphs`

			`The following token filters can add tokens that span multiple positions but`
			only record a default `positionLength` of `1`:

			* <<analysis-synonym-tokenfilter,`synonym`>>
			* <<analysis-word-delimiter-tokenfilter,`word_delimiter`>>

			`This means these filters will produce invalid token graphs for streams`
			`containing such tokens.`

			In the following graph, `dns` is a multi-position synonym for `domain name
			system`. However, `dns` has the default `positionLength` value of `1`, resulting
			`in an invalid graph.`

			`image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"]`

			`Avoid using invalid token graphs for search. Invalid graphs can cause unexpected`
			`search results.`