Docs: Warning about the conflict with the Standard Tokenizer

The examples given requires a specific Tokenizer to work.

Closes: 10645
This commit is contained in:
Benoit Delbosc 2015-04-17 14:59:24 +02:00 committed by Clinton Gormley
parent 60721b2a17
commit 4a94e1f14b
1 changed files with 17 additions and 11 deletions

View File

@ -16,27 +16,27 @@ ignored: "//hello---there, 'dude'" -> "hello", "there", "dude"
Parameters include:
`generate_word_parts`::
`generate_word_parts`::
If `true` causes parts of words to be
generated: "PowerShot" => "Power" "Shot". Defaults to `true`.
`generate_number_parts`::
`generate_number_parts`::
If `true` causes number subwords to be
generated: "500-42" => "500" "42". Defaults to `true`.
`catenate_words`::
`catenate_words`::
If `true` causes maximum runs of word parts to be
catenated: "wi-fi" => "wifi". Defaults to `false`.
`catenate_numbers`::
`catenate_numbers`::
If `true` causes maximum runs of number parts to
be catenated: "500-42" => "50042". Defaults to `false`.
`catenate_all`::
`catenate_all`::
If `true` causes all subword parts to be catenated:
"wi-fi-4000" => "wifi4000". Defaults to `false`.
`split_on_case_change`::
`split_on_case_change`::
If `true` causes "PowerShot" to be two tokens;
("Power-Shot" remains two parts regards). Defaults to `true`.
@ -44,29 +44,29 @@ Parameters include:
If `true` includes original words in subwords:
"500-42" => "500-42" "500" "42". Defaults to `false`.
`split_on_numerics`::
`split_on_numerics`::
If `true` causes "j2se" to be three tokens; "j"
"2" "se". Defaults to `true`.
`stem_english_possessive`::
`stem_english_possessive`::
If `true` causes trailing "'s" to be
removed for each subword: "O'Neil's" => "O", "Neil". Defaults to `true`.
Advance settings include:
`protected_words`::
`protected_words`::
A list of protected words from being delimiter.
Either an array, or also can set `protected_words_path` which resolved
to a file configured with protected words (one on each line).
Automatically resolves to `config/` based location if exists.
`type_table`::
`type_table`::
A custom type mapping table, for example (when configured
using `type_table_path`):
[source,js]
--------------------------------------------------
# Map the $, %, '.', and ',' characters to DIGIT
# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
@ -78,3 +78,9 @@ Advance settings include:
# see http://en.wikipedia.org/wiki/Zero-width_joiner
\\u200D => ALPHANUM
--------------------------------------------------
NOTE: Using a tokenizer like the `standard` tokenizer may interfere with
the `catenate_*` and `preserve_original` parameters, as the original
string may already have lost punctuation during tokenization. Instead,
you may want to use the `whitespace` tokenizer.