Enforce that responses in docs are valid json (#26249)

All of the snippets in our docs marked with `// TESTRESPONSE` are
checked against the response from Elasticsearch but, due to the
way they are implemented they are actually parsed as YAML instead
of JSON. Luckilly, all valid JSON is valid YAML! Unfurtunately
that means that invalid JSON has snuck into the exmples!

This adds a step during the build to parse them as JSON and fail
the build if they don't parse.

But no! It isn't quite that simple. The displayed text of some of
these responses looks like:
```
{
    ...
    "aggregations": {
        "range": {
            "buckets": [
                {
                    "to": 1.4436576E12,
                    "to_as_string": "10-2015",
                    "doc_count": 7,
                    "key": "*-10-2015"
                },
                {
                    "from": 1.4436576E12,
                    "from_as_string": "10-2015",
                    "doc_count": 0,
                    "key": "10-2015-*"
                }
            ]
        }
    }
}
```

Note the `...` which isn't valid json but we like it anyway and want
it in the output. We use substitution rules to convert the `...`
into the response we expect. That yields a response that looks like:
```
{
    "took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,
    "aggregations": {
        "range": {
            "buckets": [
                {
                    "to": 1.4436576E12,
                    "to_as_string": "10-2015",
                    "doc_count": 7,
                    "key": "*-10-2015"
                },
                {
                    "from": 1.4436576E12,
                    "from_as_string": "10-2015",
                    "doc_count": 0,
                    "key": "10-2015-*"
                }
            ]
        }
    }
}
```

That is what the tests consume but it isn't valid JSON! Oh no! We don't
want to go update all the substitution rules because that'd be huge and,
ultimately, wouldn't buy much. So we quote the `$body.took` bits before
parsing the JSON.

Note the responses that we use for the `_cat` APIs are all converted into
regexes and there is no expectation that they are valid JSON.

Closes #26233
This commit is contained in:
Nik Everett 2017-08-17 09:02:10 -04:00 committed by GitHub
parent 15b7aeeb0f
commit 6d2c40e546
12 changed files with 51 additions and 31 deletions

View File

@ -19,6 +19,10 @@
package org.elasticsearch.gradle.doc
import groovy.json.JsonException
import groovy.json.JsonParserType
import groovy.json.JsonSlurper
import org.gradle.api.DefaultTask
import org.gradle.api.InvalidUserDataException
import org.gradle.api.file.ConfigurableFileTree
@ -117,6 +121,23 @@ public class SnippetsTask extends DefaultTask {
+ "contain `curl`.")
}
}
if (snippet.testResponse && snippet.language == 'js') {
String quoted = snippet.contents
// quote values starting with $
.replaceAll(/([:,])\s*(\$[^ ,\n}]+)/, '$1 "$2"')
// quote fields starting with $
.replaceAll(/(\$[^ ,\n}]+)\s*:/, '"$1":')
JsonSlurper slurper =
new JsonSlurper(type: JsonParserType.INDEX_OVERLAY)
try {
slurper.parseText(quoted)
} catch (JsonException e) {
throw new InvalidUserDataException("Invalid json "
+ "in $snippet. The error is:\n${e.message}.\n"
+ "After substitutions and munging, the json "
+ "looks like:\n$quoted", e)
}
}
perSnippet(snippet)
snippet = null
}

View File

@ -160,7 +160,6 @@ The above `analyze` request returns the following:
[source,js]
--------------------------------------------------
# Result
{
"tokens" : [ {
"token" : "東京",

View File

@ -6,16 +6,16 @@ The `diversified_sampler` aggregation adds the ability to limit the number of ma
NOTE: Any good market researcher will tell you that when working with samples of data it is important
that the sample represents a healthy variety of opinions rather than being skewed by any single voice.
The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography,
a large spike in a timeline or an over-active forum spammer).
The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography,
a large spike in a timeline or an over-active forum spammer).
.Example use cases:
* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
* Removing bias from analytics by ensuring fair representation of content from different sources
* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
A choice of `field` or `script` setting is used to provide values used for de-duplication and the `max_docs_per_value` setting controls the maximum
A choice of `field` or `script` setting is used to provide values used for de-duplication and the `max_docs_per_value` setting controls the maximum
number of documents collected on any one shard which share a common value. The default setting for `max_docs_per_value` is 1.
The aggregation will throw an error if the choice of `field` or `script` produces multiple values for a single document (de-duplication using multi-valued fields is not supported due to efficiency concerns).
@ -39,7 +39,7 @@ POST /stackoverflow/_search?size=0
"my_unbiased_sample": {
"diversified_sampler": {
"shard_size": 200,
"field" : "author"
"field" : "author"
},
"aggs": {
"keywords": {
@ -89,7 +89,7 @@ Response:
==== Scripted example:
In this scenario we might want to diversify on a combination of field values. We can use a `script` to produce a hash of the
In this scenario we might want to diversify on a combination of field values. We can use a `script` to produce a hash of the
multiple values in a tags field to ensure we don't have a sample that consists of the same repeated combinations of tags.
[source,js]
@ -109,7 +109,7 @@ POST /stackoverflow/_search?size=0
"script" : {
"lang": "painless",
"source": "doc['tags'].values.hashCode()"
}
}
},
"aggs": {
"keywords": {
@ -150,7 +150,7 @@ Response:
"doc_count": 3,
"score": 1.34,
"bg_count": 200
},
}
]
}
}
@ -175,11 +175,11 @@ The default setting is "1".
The optional `execution_hint` setting can influence the management of the values used for de-duplication.
Each option will hold up to `shard_size` values in memory while performing de-duplication but the type of value held can be controlled as follows:
- hold field values directly (`map`)
- hold ordinals of the field as determined by the Lucene index (`global_ordinals`)
- hold hashes of the field values - with potential for hash collisions (`bytes_hash`)
The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not.
The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions.
Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.

View File

@ -35,7 +35,7 @@ GET /_cat/master?v
Might respond with:
[source,js]
[source,txt]
--------------------------------------------------
id host ip node
u_n93zwxThWHi1PDBJAGAg 127.0.0.1 127.0.0.1 u_n93zw
@ -57,7 +57,7 @@ GET /_cat/master?help
Might respond respond with:
[source,js]
[source,txt]
--------------------------------------------------
id | | node id
host | h | host name
@ -81,7 +81,7 @@ GET /_cat/nodes?h=ip,port,heapPercent,name
Responds with:
[source,js]
[source,txt]
--------------------------------------------------
127.0.0.1 9300 27 sLBaIGK
--------------------------------------------------
@ -197,7 +197,7 @@ GET _cat/templates?v&s=order:desc,index_patterns
returns:
[source,sh]
[source,txt]
--------------------------------------------------
name index_patterns order version
pizza_pepperoni [*pepperoni*] 2

View File

@ -97,7 +97,7 @@ GET /_cat/indices/twitter?pri&v&h=health,index,pri,rep,docs.count,mt
Might look like:
[source,js]
[source,txt]
--------------------------------------------------
health index pri rep docs.count mt pri.mt
yellow twitter 1 1 1200 16 16
@ -115,7 +115,7 @@ GET /_cat/indices?v&h=i,tm&s=tm:desc
Might look like:
[source,js]
[source,txt]
--------------------------------------------------
i tm
twitter 8.1gb

View File

@ -49,7 +49,7 @@ GET /_cat/nodeattrs?v&h=name,pid,attr,value
Might look like:
[source,js]
[source,txt]
--------------------------------------------------
name pid attr value
EK_AsJb 19566 testattr test

View File

@ -53,7 +53,7 @@ GET /_cat/nodes?v&h=id,ip,port,v,m
Might look like:
["source","js",subs="attributes,callouts"]
["source","txt",subs="attributes,callouts"]
--------------------------------------------------
id ip port v m
veJR 127.0.0.1 59938 {version} *

View File

@ -21,7 +21,7 @@ GET _cat/recovery?v
The response of this request will be something like:
[source,js]
[source,txt]
---------------------------------------------------------------------------
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
twitter 0 13ms store done n/a n/a 127.0.0.1 node-0 n/a n/a 0 0 100% 13 0 0 100% 9928 0 0 100.0%
@ -48,7 +48,7 @@ GET _cat/recovery?v&h=i,s,t,ty,st,shost,thost,f,fp,b,bp
This will return a line like:
[source,js]
[source,txt]
----------------------------------------------------------------------------
i s t ty st shost thost f fp b bp
twitter 0 1252ms peer done 192.168.1.1 192.168.1.2 0 100.0% 0 100.0%
@ -76,7 +76,7 @@ GET _cat/recovery?v&h=i,s,t,ty,st,rep,snap,f,fp,b,bp
This will show a recovery of type snapshot in the response
[source,js]
[source,txt]
--------------------------------------------------------------------------------
i s t ty st rep snap f fp b bp
twitter 0 1978ms snapshot done twitter snap_1 79 8.0% 12086 9.0%

View File

@ -16,7 +16,7 @@ GET _cat/shards
This will return
[source,js]
[source,txt]
---------------------------------------------------------------------------
twitter 0 p STARTED 3014 31.1mb 192.168.56.10 H5dfFeA
---------------------------------------------------------------------------
@ -42,7 +42,7 @@ GET _cat/shards/twitt*
Which will return the following
[source,js]
[source,txt]
---------------------------------------------------------------------------
twitter 0 p STARTED 3014 31.1mb 192.168.56.10 H5dfFeA
---------------------------------------------------------------------------
@ -68,7 +68,7 @@ GET _cat/shards
A relocating shard will be shown as follows
[source,js]
[source,txt]
---------------------------------------------------------------------------
twitter 0 p RELOCATING 3014 31.1mb 192.168.56.10 H5dfFeA -> -> 192.168.56.30 bGG90GE
---------------------------------------------------------------------------
@ -90,7 +90,7 @@ GET _cat/shards
You can get the initializing state in the response like this
[source,js]
[source,txt]
---------------------------------------------------------------------------
twitter 0 p STARTED 3014 31.1mb 192.168.56.10 H5dfFeA
twitter 0 r INITIALIZING 0 14.3mb 192.168.56.30 bGG90GE
@ -112,7 +112,7 @@ GET _cat/shards?h=index,shard,prirep,state,unassigned.reason
The reason for an unassigned shard will be listed as the last field
[source,js]
[source,txt]
---------------------------------------------------------------------------
twitter 0 p STARTED 3014 31.1mb 192.168.56.10 H5dfFeA
twitter 0 r STARTED 3014 31.1mb 192.168.56.30 bGG90GE

View File

@ -92,7 +92,7 @@ GET /_cat/thread_pool/generic?v&h=id,name,active,rejected,completed
which looks like:
[source,js]
[source,txt]
--------------------------------------------------
id name active rejected completed
0EWUhXeBQtaVGlexUeVwMg generic 0 0 70

View File

@ -170,7 +170,7 @@ the request was ignored.
"_type": "type1",
"_id": "1",
"_version": 6,
"result": noop
"result": "noop"
}
--------------------------------------------------
// TESTRESPONSE

View File

@ -373,7 +373,7 @@ PUT /customer/doc/1?pretty
And the response:
[source,sh]
[source,js]
--------------------------------------------------
{
"_index" : "customer",
@ -672,7 +672,7 @@ GET /_cat/indices?v
And the response:
[source,js]
[source,txt]
--------------------------------------------------
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open bank l7sSYV2cQXmu6_4rJWVIww 5 1 1000 0 128.6kb 128.6kb