Provide users with warning about special characters in query DSL and API query fields (#1255)

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <cwmmoore@amazon.com>

Signed-off-by: cwillum <cwmmoore@amazon.com>
This commit is contained in:
Chris Moore 2022-09-27 13:29:19 -07:00 committed by GitHub
parent bbd0f1157b
commit 1b69f700f8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 78 additions and 0 deletions

View File

@ -120,3 +120,38 @@ With query DSL, however, you can include an HTTP request body to look for result
}
```
The OpenSearch query DSL comes in three varieties: term-level queries, full-text queries, and boolean queries. You can even perform more complicated searches by using different elements from each variety to find whatever data you need.
## A note on Unicode special characters in text fields
Due to word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it. This can lead to unintentional filtering of documents and potentially compromise control over their access.
The examples below illustrate values containing special characters that will be parsed improperly by the standard analyzer. In this example, the existence of the hyphen/minus sign in the value prevents the analyzer from distinguishing between the two different users for `user.id` and interprets them as one and the same:
```json
{
"bool": {
"must": {
"match": {
"user.id": "User-1"
}
}
}
}
```
```json
{
"bool": {
"must": {
"match": {
"user.id": "User-2"
}
}
}
}
```
To avoid this circumstance when using either query DSL or the REST API, you can use a custom analyzer or map the field as `keyword`, which performs an exact-match search. See [Keyword field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/keyword/) for the latter option.
For a list of characters that should be avoided when field type is `text`, see [Word Boundaries](https://unicode.org/reports/tr29/#Word_Boundaries).

View File

@ -678,6 +678,15 @@ PUT _plugins/_security/api/roles/<role>
}
```
>Due to word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it.
>
>For example, since the values in the fields ```"user.id": "User-1"``` and ```"user.id": "User-2"``` contain the hyphen/minus sign, this special character will prevent the analyzer from distinguishing between the two different users for `user.id` and interpret them as one and the same. This can lead to unintentional filtering of documents and potentially compromise control over their access.
>
>To avoid this circumstance, you can use a custom analyzer or map the field as `keyword`, which performs an exact-match search. See [Keyword field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/keyword/) for the latter option.
>
>For a list of characters that should be avoided when field type is `text`, see [Word Boundaries](https://unicode.org/reports/tr29/#Word_Boundaries).
{: .warning}
### Patch role
Introduced 1.0

View File

@ -55,6 +55,40 @@ PUT _plugins/_security/api/roles/public_data
These queries can be as complex as you want, but we recommend keeping them simple to minimize the performance impact that the document-level security feature has on the cluster.
{: .warning }
### A note on Unicode special characters in text fields
Due to word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it. This can lead to unintentional filtering of documents and potentially compromise control over their access.
The examples below illustrate values containing special characters that will be parsed improperly by the standard analyzer. In this example, the existence of the hyphen/minus sign in the value prevents the analyzer from distinguishing between the two different users for `user.id` and interprets them as one and the same:
```json
{
"bool": {
"must": {
"match": {
"user.id": "User-1"
}
}
}
}
```
```json
{
"bool": {
"must": {
"match": {
"user.id": "User-2"
}
}
}
}
```
To avoid this circumstance when using either Query DSL or the REST API, you can use a custom analyzer or map the field as `keyword`, which performs an exact-match search. See [Keyword field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/keyword/) for the latter option.
For a list of characters that should be avoided when field type is `text`, see [Word Boundaries](https://unicode.org/reports/tr29/#Word_Boundaries).
## Parameter substitution