OpenSearch/docs/reference/ml/transforms.asciidoc

589 lines
15 KiB
Plaintext
Raw Normal View History

[role="xpack"]
[[ml-configuring-transform]]
=== Transforming data with script fields
If you use {dfeeds}, you can add scripts to transform your data before
it is analyzed. {dfeeds-cap} contain an optional `script_fields` property, where
you can specify scripts that evaluate custom expressions and return script
fields.
If your {dfeed} defines script fields, you can use those fields in your job.
For example, you can use the script fields in the analysis functions in one or
more detectors.
* <<ml-configuring-transform1>>
* <<ml-configuring-transform2>>
* <<ml-configuring-transform3>>
* <<ml-configuring-transform4>>
* <<ml-configuring-transform5>>
* <<ml-configuring-transform6>>
* <<ml-configuring-transform7>>
* <<ml-configuring-transform8>>
* <<ml-configuring-transform9>>
The following indices APIs create and add content to an index that is used in
subsequent examples:
[source,js]
----------------------------------
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
2019-01-14 16:08:01 -05:00
PUT /my_index?include_type_name=true
{
"mappings":{
"_doc":{
"properties": {
"@timestamp": {
"type": "date"
},
"aborted_count": {
"type": "long"
},
"another_field": {
"type": "keyword" <1>
},
"clientip": {
"type": "keyword"
},
"coords": {
"properties": {
"lat": {
"type": "keyword"
},
"lon": {
"type": "keyword"
}
}
},
"error_count": {
"type": "long"
},
"query": {
"type": "keyword"
},
"some_field": {
"type": "keyword"
},
"tokenstring1":{
"type":"keyword"
},
"tokenstring2":{
"type":"keyword"
},
"tokenstring3":{
"type":"keyword"
}
}
}
}
}
PUT /my_index/_doc/1
{
"@timestamp":"2017-03-23T13:00:00",
"error_count":36320,
"aborted_count":4156,
"some_field":"JOE",
"another_field":"SMITH ",
"tokenstring1":"foo-bar-baz",
"tokenstring2":"foo bar baz",
"tokenstring3":"foo-bar-19",
"query":"www.ml.elastic.co",
"clientip":"123.456.78.900",
"coords": {
"lat" : 41.44,
"lon":90.5
}
}
----------------------------------
// CONSOLE
// TEST[skip:SETUP]
<1> In this example, string fields are mapped as `keyword` fields to support
aggregation. If you want both a full text (`text`) and a keyword (`keyword`)
version of the same field, use multi-fields. For more information, see
{ref}/multi-fields.html[fields].
[[ml-configuring-transform1]]
.Example 1: Adding two numerical fields
[source,js]
----------------------------------
PUT _ml/anomaly_detectors/test1
{
"analysis_config":{
"bucket_span": "10m",
"detectors":[
{
"function":"mean",
"field_name": "total_error_count", <1>
"detector_description": "Custom script field transformation"
}
]
},
"data_description": {
"time_field":"@timestamp",
"time_format":"epoch_ms"
}
}
PUT _ml/datafeeds/datafeed-test1
{
"job_id": "test1",
"indices": ["my_index"],
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"total_error_count": { <2>
"script": {
"lang": "expression",
"inline": "doc['error_count'].value + doc['aborted_count'].value"
}
}
}
}
----------------------------------
// CONSOLE
// TEST[skip:needs-licence]
<1> A script field named `total_error_count` is referenced in the detector
within the job.
<2> The script field is defined in the {dfeed}.
This `test1` job contains a detector that uses a script field in a mean analysis
function. The `datafeed-test1` {dfeed} defines the script field. It contains a
script that adds two fields in the document to produce a "total" error count.
The syntax for the `script_fields` property is identical to that used by {es}.
For more information, see {ref}/search-request-script-fields.html[Script Fields].
You can preview the contents of the {dfeed} by using the following API:
[source,js]
----------------------------------
GET _ml/datafeeds/datafeed-test1/_preview
----------------------------------
// CONSOLE
// TEST[skip:continued]
In this example, the API returns the following results, which contain a sum of
the `error_count` and `aborted_count` values:
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"total_error_count": 40476
}
]
----------------------------------
NOTE: This example demonstrates how to use script fields, but it contains
insufficient data to generate meaningful results. For a full demonstration of
how to create jobs with sample data, see <<ml-getting-started>>.
You can alternatively use {kib} to create an advanced job that uses script
fields. To add the `script_fields` property to your {dfeed}, you must use the
**Edit JSON** tab. For example:
[role="screenshot"]
image::images/ml-scriptfields.jpg[Adding script fields to a {dfeed} in {kib}]
[[ml-configuring-transform-examples]]
==== Common Script Field Examples
While the possibilities are limitless, there are a number of common scenarios
where you might use script fields in your {dfeeds}.
[NOTE]
===============================
Some of these examples use regular expressions. By default, regular
expressions are disabled because they circumvent the protection that Painless
provides against long running and memory hungry scripts. For more information,
see {ref}/modules-scripting-painless.html[Painless Scripting Language].
Machine learning analysis is case sensitive. For example, "John" is considered
to be different than "john". This is one reason you might consider using scripts
that convert your strings to upper or lowercase letters.
===============================
[[ml-configuring-transform2]]
.Example 2: Concatenating strings
[source,js]
--------------------------------------------------
PUT _ml/anomaly_detectors/test2
{
"analysis_config":{
"bucket_span": "10m",
"detectors":[
{
"function":"low_info_content",
"field_name":"my_script_field", <1>
"detector_description": "Custom script field transformation"
}
]
},
"data_description": {
"time_field":"@timestamp",
"time_format":"epoch_ms"
}
}
PUT _ml/datafeeds/datafeed-test2
{
"job_id": "test2",
"indices": ["my_index"],
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"my_script_field": {
"script": {
"lang": "painless",
"inline": "doc['some_field'].value + '_' + doc['another_field'].value" <2>
}
}
}
}
GET _ml/datafeeds/datafeed-test2/_preview
--------------------------------------------------
// CONSOLE
// TEST[skip:needs-licence]
<1> The script field has a rather generic name in this case, since it will
be used for various tests in the subsequent examples.
<2> The script field uses the plus (+) operator to concatenate strings.
The preview {dfeed} API returns the following results, which show that "JOE"
and "SMITH " have been concatenated and an underscore was added:
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"my_script_field": "JOE_SMITH "
}
]
----------------------------------
[[ml-configuring-transform3]]
.Example 3: Trimming strings
[source,js]
--------------------------------------------------
POST _ml/datafeeds/datafeed-test2/_update
{
"script_fields": {
"my_script_field": {
"script": {
"lang": "painless",
"inline": "doc['another_field'].value.trim()" <1>
}
}
}
}
GET _ml/datafeeds/datafeed-test2/_preview
--------------------------------------------------
// CONSOLE
// TEST[skip:continued]
<1> This script field uses the `trim()` function to trim extra white space from a
string.
The preview {dfeed} API returns the following results, which show that "SMITH "
has been trimmed to "SMITH":
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"my_script_field": "SMITH"
}
]
----------------------------------
[[ml-configuring-transform4]]
.Example 4: Converting strings to lowercase
[source,js]
--------------------------------------------------
POST _ml/datafeeds/datafeed-test2/_update
{
"script_fields": {
"my_script_field": {
"script": {
"lang": "painless",
"inline": "doc['some_field'].value.toLowerCase()" <1>
}
}
}
}
GET _ml/datafeeds/datafeed-test2/_preview
--------------------------------------------------
// CONSOLE
// TEST[skip:continued]
<1> This script field uses the `toLowerCase` function to convert a string to all
lowercase letters. Likewise, you can use the `toUpperCase{}` function to convert
a string to uppercase letters.
The preview {dfeed} API returns the following results, which show that "JOE"
has been converted to "joe":
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"my_script_field": "joe"
}
]
----------------------------------
[[ml-configuring-transform5]]
.Example 5: Converting strings to mixed case formats
[source,js]
--------------------------------------------------
POST _ml/datafeeds/datafeed-test2/_update
{
"script_fields": {
"my_script_field": {
"script": {
"lang": "painless",
"inline": "doc['some_field'].value.substring(0, 1).toUpperCase() + doc['some_field'].value.substring(1).toLowerCase()" <1>
}
}
}
}
GET _ml/datafeeds/datafeed-test2/_preview
--------------------------------------------------
// CONSOLE
// TEST[skip:continued]
<1> This script field is a more complicated example of case manipulation. It uses
the `subString()` function to capitalize the first letter of a string and
converts the remaining characters to lowercase.
The preview {dfeed} API returns the following results, which show that "JOE"
has been converted to "Joe":
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"my_script_field": "Joe"
}
]
----------------------------------
[[ml-configuring-transform6]]
.Example 6: Replacing tokens
[source,js]
--------------------------------------------------
POST _ml/datafeeds/datafeed-test2/_update
{
"script_fields": {
"my_script_field": {
"script": {
"lang": "painless",
"inline": "/\\s/.matcher(doc['tokenstring2'].value).replaceAll('_')" <1>
}
}
}
}
GET _ml/datafeeds/datafeed-test2/_preview
--------------------------------------------------
// CONSOLE
// TEST[skip:continued]
<1> This script field uses regular expressions to replace white
space with underscores.
The preview {dfeed} API returns the following results, which show that
"foo bar baz" has been converted to "foo_bar_baz":
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"my_script_field": "foo_bar_baz"
}
]
----------------------------------
[[ml-configuring-transform7]]
.Example 7: Regular expression matching and concatenation
[source,js]
--------------------------------------------------
POST _ml/datafeeds/datafeed-test2/_update
{
"script_fields": {
"my_script_field": {
"script": {
"lang": "painless",
"inline": "def m = /(.*)-bar-([0-9][0-9])/.matcher(doc['tokenstring3'].value); return m.find() ? m.group(1) + '_' + m.group(2) : '';" <1>
}
}
}
}
GET _ml/datafeeds/datafeed-test2/_preview
--------------------------------------------------
// CONSOLE
// TEST[skip:continued]
<1> This script field looks for a specific regular expression pattern and emits the
matched groups as a concatenated string. If no match is found, it emits an empty
string.
The preview {dfeed} API returns the following results, which show that
"foo-bar-19" has been converted to "foo_19":
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"my_script_field": "foo_19"
}
]
----------------------------------
[[ml-configuring-transform8]]
.Example 8: Splitting strings by domain name
[source,js]
--------------------------------------------------
PUT _ml/anomaly_detectors/test3
{
"description":"DNS tunneling",
"analysis_config":{
"bucket_span": "30m",
"influencers": ["clientip","hrd"],
"detectors":[
{
"function":"high_info_content",
"field_name": "sub",
"over_field_name": "hrd",
"exclude_frequent":"all"
}
]
},
"data_description": {
"time_field":"@timestamp",
"time_format":"epoch_ms"
}
}
PUT _ml/datafeeds/datafeed-test3
{
"job_id": "test3",
"indices": ["my_index"],
"query": {
"match_all": {
"boost": 1
}
},
"script_fields":{
"sub":{
"script":"return domainSplit(doc['query'].value).get(0);"
},
"hrd":{
"script":"return domainSplit(doc['query'].value).get(1);"
}
}
}
GET _ml/datafeeds/datafeed-test3/_preview
--------------------------------------------------
// CONSOLE
// TEST[skip:needs-licence]
If you have a single field that contains a well-formed DNS domain name, you can
use the `domainSplit()` function to split the string into its highest registered
domain and the sub-domain, which is everything to the left of the highest
registered domain. For example, the highest registered domain of
`www.ml.elastic.co` is `elastic.co` and the sub-domain is `www.ml`. The
`domainSplit()` function returns an array of two values: the first value is the
subdomain; the second value is the highest registered domain.
The preview {dfeed} API returns the following results, which show that
"www.ml.elastic.co" has been split into "elastic.co" and "www.ml":
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"clientip.keyword": "123.456.78.900",
"hrd": "elastic.co",
"sub": "www.ml"
}
]
----------------------------------
[[ml-configuring-transform9]]
.Example 9: Transforming geo_point data
[source,js]
--------------------------------------------------
PUT _ml/anomaly_detectors/test4
{
"analysis_config":{
"bucket_span": "10m",
"detectors":[
{
"function":"lat_long",
"field_name": "my_coordinates"
}
]
},
"data_description": {
"time_field":"@timestamp",
"time_format":"epoch_ms"
}
}
PUT _ml/datafeeds/datafeed-test4
{
"job_id": "test4",
"indices": ["my_index"],
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"my_coordinates": {
"script": {
"inline": "doc['coords.lat'].value + ',' + doc['coords.lon'].value",
"lang": "painless"
}
}
}
}
GET _ml/datafeeds/datafeed-test4/_preview
--------------------------------------------------
// CONSOLE
// TEST[skip:needs-licence]
In {es}, location data can be stored in `geo_point` fields but this data type is
2019-01-07 17:32:36 -05:00
not supported natively in {ml} analytics. This example of a script field
transforms the data into an appropriate format. For more information,
see <<ml-geo-functions>>.
The preview {dfeed} API returns the following results, which show that
`41.44` and `90.5` have been combined into "41.44,90.5":
[source,js]
----------------------------------
[
{
"@timestamp": 1490274000000,
"my_coordinates": "41.44,90.5"
}
]
----------------------------------