Docusaurus build framework + ingestion doc refresh. (#8311)

* Docusaurus build framework + ingestion doc refresh.

* stick to npm instead of yarn

* fix typos

* restore some _bin

* Adjustments.

* detect and fix redirect anchors

* update anchor lint

* Web-console: remove specific column filters (#8343)

* add clear filter

* update tool kit

* remove usless check

* auto run

* add %

* Fix resource leak (#8337)

* Fix resource leak

* Patch comments

* Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234)

* Fixes from PR review.

* Fix more anchors.

* Preamble nix.

* Fix more anchors, headers

* clean up placeholder page

* add to website lint to travis config

* better broken link checking

* travis fix

* Fixed more broken links

* better redirects

* unfancy catch

* fix LGTM error

* link fixes

* fix md issues

* Addl fixes
This commit is contained in:
Gian Merlino 2019-08-20 21:48:59 -07:00 committed by GitHub
parent c4db83608f
commit d007477742
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
285 changed files with 15967 additions and 30680 deletions

View File

@ -174,6 +174,9 @@ matrix:
script:
- ${MVN} test -pl 'web-console'
- name: "docs"
script: cd website && npm install && npm run lint
- name: "batch index integration test"
services: &integration_test_services
- docker

View File

@ -1,65 +0,0 @@
#!/usr/bin/env python3
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
import sys
def normalize_target(redirect):
dirname = os.path.dirname(redirect["source"])
normalized = os.path.normpath(os.path.join(dirname, redirect["target"]))
return normalized
if len(sys.argv) != 3:
sys.stderr.write('usage: program <docs dir> <redirect.json>\n')
sys.exit(1)
docs_directory = sys.argv[1]
redirect_json = sys.argv[2]
with open(redirect_json, 'r') as f:
redirects = json.loads(f.read())
all_sources = {}
# Index all redirect sources
for redirect in redirects:
all_sources[redirect["source"]] = 1
# Create redirects
for redirect in redirects:
source = redirect["source"]
target = redirect["target"]
source_file = os.path.join(docs_directory, source)
# Ensure redirect source doesn't exist yet.
if os.path.exists(source_file):
raise Exception('Redirect source is an actual file: ' + source)
# Ensure target *does* exist, if relative.
if not target.startswith("/"):
target_file = os.path.join(docs_directory, normalize_target(redirect))
if not os.path.exists(target_file) and source not in all_sources:
raise Exception('Redirect target does not exist for source: ' + source)
# Write redirect file
os.makedirs(os.path.dirname(source_file), exist_ok=True)
with open(source_file, 'w') as f:
f.write("---\n")
f.write("layout: redirect_page\n")
f.write("redirect_target: " + target + "\n")
f.write("---\n")

View File

@ -1,54 +0,0 @@
#!/usr/bin/env python3
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import subprocess
import sys
deleted_paths_dict = {}
# assumes docs/latest in the doc repo has the current files for the next release
# deletes docs for old versions and copies docs/latest into the old versions
# run `git status | grep deleted:` on the doc repo to see what pages were deleted and feed that into
# missing-redirect-finder2.py
def main():
if len(sys.argv) != 2:
sys.stderr.write('usage: program <druid-docs-repo-path>\n')
sys.exit(1)
druid_docs_path = sys.argv[1]
druid_docs_path = "{}/docs".format(druid_docs_path)
prev_release_doc_paths = os.listdir(druid_docs_path)
for doc_path in prev_release_doc_paths:
if (doc_path != "img" and doc_path != "latest"):
print("DOC PATH: " + doc_path)
try:
command = "rm -rf {}/{}/*".format(druid_docs_path, doc_path)
outstr = subprocess.check_output(command, shell=True).decode('UTF-8')
command = "cp -r {}/latest/* {}/{}/".format(druid_docs_path, druid_docs_path, doc_path)
outstr = subprocess.check_output(command, shell=True).decode('UTF-8')
except:
print("error in path: " + doc_path)
continue
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print('Interrupted, closing.')

View File

@ -1,49 +0,0 @@
#!/usr/bin/env python3
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import sys
# Takes the output of `git status | grep deleted:` on the doc repo
# and cross references deleted pages with the _redirects.json file
if len(sys.argv) != 3:
sys.stderr.write('usage: program <del_paths_file> <redirect.json file>\n')
sys.exit(1)
del_paths = sys.argv[1]
redirect_json_path = sys.argv[2]
dep_dict = {}
with open(del_paths, 'r') as del_paths_file:
for line in del_paths_file.readlines():
subidx = line.index("/", 0)
line2 = line[subidx+1:]
subidx = line2.index("/", 0)
line3 = line2[subidx+1:]
dep_dict[line3.strip("\n")] = True
existing_redirects = {}
with open(redirect_json_path, 'r') as redirect_json_file:
redirect_json = json.load(redirect_json_file)
for redirect_entry in redirect_json:
redirect_source = redirect_entry["source"]
redirect_source = redirect_source.replace(".html", ".md")
existing_redirects[redirect_source] = True
for dep in dep_dict:
if dep not in existing_redirects:
print("MISSING REDIRECT: " + dep)

View File

@ -1,73 +0,0 @@
#!/usr/bin/env python3
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import re
import shutil
import sys
# Helper program for generating LICENSE contents for dependencies under web-console.
# Generates entries for MIT-licensed deps and dumps info for non-MIT deps.
# Uses JSON output from https://www.npmjs.com/package/license-checker.
if len(sys.argv) != 3:
sys.stderr.write('usage: program <license-report-path> <license-output-path>\n')
sys.stderr.write('Run the following command in web-console/ to generate the input license report:\n')
sys.stderr.write(' license-checker --production --json\n')
sys.exit(1)
license_report_path = sys.argv[1]
license_output_path = sys.argv[2]
non_mit_licenses = []
license_entry_template = "This product bundles {} version {}, copyright {},\n which is available under an MIT license. For details, see licenses/{}.MIT.\n"
with open(license_report_path, 'r') as license_report_file:
license_report = json.load(license_report_file)
for dependency_name_version in license_report:
dependency = license_report[dependency_name_version]
match_result = re.match("(.+)@(.+)", dependency_name_version)
dependency_name = match_result.group(1)
nice_dependency_name = dependency_name.replace("/", "-")
dependency_ver = match_result.group(2)
try:
licenseType = dependency["licenses"]
licenseFile = dependency["licenseFile"]
except:
print("No license file for {}".format(dependency_name_version))
try:
publisher = dependency["publisher"]
except:
publisher = ""
if licenseType != "MIT":
non_mit_licenses.append(dependency)
continue
fullDependencyPath = dependency["path"]
partialDependencyPath = re.match(".*/(web-console.*)", fullDependencyPath).group(1)
print(license_entry_template.format(dependency_name, dependency_ver, publisher, nice_dependency_name))
shutil.copy2(licenseFile, license_output_path + "/" + nice_dependency_name + ".MIT")
print("\nNon-MIT licenses:\n--------------------\n")
for non_mit_license in non_mit_licenses:
print(non_mit_license)

View File

@ -1,23 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: Druid Documentation
markdown: redcarpet
redcarpet:
extensions: ["tables", "no_intra_emphasis", "fenced_code_blocks", "with_toc_data"]
highlighter: pygments

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 31 KiB

View File

@ -1,743 +0,0 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
version="1.1"
viewBox="40 36 466.7041 318.99965"
width="583.38013"
height="398.74957"
id="svg2"
inkscape:version="0.48.2 r9819"
sodipodi:docname="druid-dataflow.svg"
inkscape:export-filename="/Users/xavier/mmx/druid/docs/_graphics/druid-dataflow@2x.png"
inkscape:export-xdpi="244.5267"
inkscape:export-ydpi="244.5267">
<sodipodi:namedview
pagecolor="#ffffff"
bordercolor="#666666"
borderopacity="1"
objecttolerance="10"
gridtolerance="10"
guidetolerance="10"
inkscape:pageopacity="0"
inkscape:pageshadow="2"
inkscape:window-width="1616"
inkscape:window-height="949"
id="namedview318"
showgrid="false"
fit-margin-top="0"
fit-margin-left="0"
fit-margin-right="0"
fit-margin-bottom="0"
inkscape:zoom="1.3366372"
inkscape:cx="248.71225"
inkscape:cy="133.85839"
inkscape:window-x="0"
inkscape:window-y="0"
inkscape:window-maximized="0"
inkscape:current-layer="layer1" />
<metadata
id="metadata4">
<dc:date>2013-07-10 16:52Z</dc:date>
<!-- Produced by OmniGraffle Professional 5.4.4 -->
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<defs
id="defs6">
<filter
id="Shadow"
filterUnits="userSpaceOnUse"
color-interpolation-filters="sRGB">
<feOffset
in="SourceAlpha"
result="offset"
dx="0"
dy="2"
id="feOffset9" />
<feFlood
flood-color="black"
flood-opacity=".12"
result="flood"
id="feFlood11" />
<feComposite
in="flood"
in2="offset"
operator="in"
id="feComposite13" />
</filter>
<font-face
font-family="Open Sans"
font-size="12"
panose-1="2 11 7 6 3 8 4 2 2 4"
units-per-em="1000"
underline-position="-75.195312"
underline-thickness="49.804688"
slope="0"
x-height="549.8047"
cap-height="724.1211"
ascent="1068.8477"
descent="-292.96875"
font-weight="bold"
id="font-face15"
stemv="0"
stemh="0"
accent-height="0"
ideographic="0"
alphabetic="0"
mathematical="0"
hanging="0"
v-ideographic="0"
v-alphabetic="0"
v-mathematical="0"
v-hanging="0"
strikethrough-position="0"
strikethrough-thickness="0"
overline-position="0"
overline-thickness="0">
<font-face-src
id="font-face-src17">
<font-face-name
name="OpenSans-Semibold"
id="font-face-name19" />
</font-face-src>
</font-face>
<marker
orient="auto"
overflow="visible"
markerUnits="strokeWidth"
id="FilledArrow_Marker"
viewBox="-1 -3 5 6"
markerWidth="5"
markerHeight="6"
style="color:#7f95a7;overflow:visible">
<g
id="g22">
<path
d="M 2.8800001,0 0,-1.08 0,1.08 z"
id="path24"
inkscape:connector-curvature="0"
style="fill:currentColor;stroke:currentColor;stroke-width:1" />
</g>
</marker>
<marker
orient="auto"
overflow="visible"
markerUnits="strokeWidth"
id="FilledArrow_Marker_2"
viewBox="-4 -3 5 6"
markerWidth="5"
markerHeight="6"
style="color:#7f95a7;overflow:visible">
<g
id="g27">
<path
d="M -2.8800001,0 0,1.08 0,-1.08 z"
id="path29"
inkscape:connector-curvature="0"
style="fill:currentColor;stroke:currentColor;stroke-width:1" />
</g>
</marker>
<font-face
font-family="Open Sans"
font-size="18"
panose-1="2 11 6 6 3 5 4 2 2 4"
units-per-em="1000"
underline-position="-75.195312"
underline-thickness="49.804688"
slope="0"
x-height="544.92188"
cap-height="724.1211"
ascent="1068.8477"
descent="-292.96875"
font-weight="500"
id="font-face31"
stemv="0"
stemh="0"
accent-height="0"
ideographic="0"
alphabetic="0"
mathematical="0"
hanging="0"
v-ideographic="0"
v-alphabetic="0"
v-mathematical="0"
v-hanging="0"
strikethrough-position="0"
strikethrough-thickness="0"
overline-position="0"
overline-thickness="0">
<font-face-src
id="font-face-src33">
<font-face-name
name="OpenSans"
id="font-face-name35" />
</font-face-src>
</font-face>
<font-face
font-family="Open Sans"
font-size="10"
panose-1="2 11 6 6 3 5 4 2 2 4"
units-per-em="1000"
underline-position="-75.195312"
underline-thickness="49.804688"
slope="0"
x-height="544.92188"
cap-height="724.1211"
ascent="1068.8477"
descent="-292.96875"
font-weight="500"
id="font-face37"
stemv="0"
stemh="0"
accent-height="0"
ideographic="0"
alphabetic="0"
mathematical="0"
hanging="0"
v-ideographic="0"
v-alphabetic="0"
v-mathematical="0"
v-hanging="0"
strikethrough-position="0"
strikethrough-thickness="0"
overline-position="0"
overline-thickness="0">
<font-face-src
id="font-face-src39">
<font-face-name
name="OpenSans"
id="font-face-name41" />
</font-face-src>
</font-face>
<filter
inkscape:collect="always"
id="filter5034"
x="-0.12"
width="1.24"
y="-0.12"
height="1.24"
color-interpolation-filters="sRGB">
<feGaussianBlur
inkscape:collect="always"
stdDeviation="1.8"
id="feGaussianBlur5036" />
</filter>
<filter
inkscape:collect="always"
id="filter5038"
x="-0.12"
width="1.24"
y="-0.12"
height="1.24"
color-interpolation-filters="sRGB">
<feGaussianBlur
inkscape:collect="always"
stdDeviation="1.8"
id="feGaussianBlur5040" />
</filter>
<filter
inkscape:collect="always"
id="filter5042"
x="-0.12"
width="1.24"
y="-0.12"
height="1.24"
color-interpolation-filters="sRGB">
<feGaussianBlur
inkscape:collect="always"
stdDeviation="1.8"
id="feGaussianBlur5044" />
</filter>
<filter
inkscape:collect="always"
id="filter5046"
x="-0.12"
width="1.24"
y="-0.12"
height="1.24"
color-interpolation-filters="sRGB">
<feGaussianBlur
inkscape:collect="always"
stdDeviation="1.8"
id="feGaussianBlur5048" />
</filter>
<filter
inkscape:collect="always"
id="filter5050"
x="-0.12"
width="1.24"
y="-0.12"
height="1.24"
color-interpolation-filters="sRGB">
<feGaussianBlur
inkscape:collect="always"
stdDeviation="1.8"
id="feGaussianBlur5052" />
</filter>
<filter
inkscape:collect="always"
id="filter5054"
x="-0.12"
width="1.24"
y="-0.12"
height="1.24"
color-interpolation-filters="sRGB">
<feGaussianBlur
inkscape:collect="always"
stdDeviation="1.8"
id="feGaussianBlur5056" />
</filter>
<filter
inkscape:collect="always"
id="filter5058"
x="-0.18000001"
width="1.36"
y="-0.18000001"
height="1.36"
color-interpolation-filters="sRGB">
<feGaussianBlur
inkscape:collect="always"
stdDeviation="2.7"
id="feGaussianBlur5060" />
</filter>
<filter
inkscape:collect="always"
id="filter5062"
x="-0.18000001"
width="1.36"
y="-0.18000001"
height="1.36"
color-interpolation-filters="sRGB">
<feGaussianBlur
inkscape:collect="always"
stdDeviation="2.7"
id="feGaussianBlur5064" />
</filter>
</defs>
<g
inkscape:groupmode="layer"
id="layer1"
inkscape:label="Layer2"
style="display:inline"
transform="translate(-1.2070312,-2.461015e-6)">
<text
sodipodi:linespacing="125%"
y="53.362823"
x="68.675781"
id="text109"
style="font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;writing-mode:lr-tb;text-anchor:middle;fill:#33424f;stroke:none;font-family:Open Sans;-inkscape-font-specification:Open Sans">
<tspan
y="53.362823"
x="68.675781"
id="tspan4550"
sodipodi:role="line">streaming</tspan>
<tspan
y="68.362823"
x="68.675781"
id="tspan4552"
sodipodi:role="line">data</tspan>
</text>
<text
sodipodi:linespacing="125%"
y="145.23154"
x="487.89328"
id="text115"
style="font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;writing-mode:lr-tb;text-anchor:middle;fill:#33424f;stroke:none;font-family:Open Sans;-inkscape-font-specification:Open Sans">
<tspan
font-size="12"
font-weight="bold"
x="493.09863"
y="158.23154"
textLength="43.283203"
id="tspan117"
style="font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;writing-mode:lr-tb;text-anchor:middle;fill:#33424f;font-family:Open Sans;-inkscape-font-specification:Open Sans">client</tspan>
</text>
<line
x1="106.93243"
y1="56.438995"
x2="132.23264"
y2="56.438995"
id="line119"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-end:url(#FilledArrow_Marker)" />
<text
sodipodi:linespacing="125%"
y="253.6384"
x="64.488861"
id="text121"
style="font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;writing-mode:lr-tb;text-anchor:middle;fill:#33424f;stroke:none;font-family:Open Sans;-inkscape-font-specification:Open Sans">
<tspan
font-size="12"
font-weight="bold"
x="68.760742"
y="266.6384"
textLength="32.71289"
id="tspan123"
style="font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;writing-mode:lr-tb;text-anchor:middle;fill:#33424f;font-family:Open Sans;-inkscape-font-specification:Open Sans">batch</tspan>
<tspan
font-size="12"
font-weight="bold"
x="72.05957"
y="283.6384"
textLength="26.115234"
id="tspan125"
style="font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;writing-mode:lr-tb;text-anchor:middle;fill:#33424f;font-family:Open Sans;-inkscape-font-specification:Open Sans">data</tspan>
</text>
<line
x1="107.43243"
y1="270.6384"
x2="132.73264"
y2="270.6384"
id="line127"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-end:url(#FilledArrow_Marker)" />
<line
x1="159.79263"
y1="74.938965"
x2="159.79263"
y2="243.0784"
id="line129"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-end:url(#FilledArrow_Marker)" />
<path
sodipodi:nodetypes="cc"
inkscape:connector-curvature="0"
id="line131"
d="m 185.06378,57.75783 214.03664,87.3856"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-start:url(#FilledArrow_Marker_2);marker-end:url(#FilledArrow_Marker)" />
<path
sodipodi:nodetypes="cc"
inkscape:connector-curvature="0"
id="line133"
d="m 178.29264,270.63837 121.01422,0"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-end:url(#FilledArrow_Marker)" />
<line
x1="355.03683"
y1="245.13406"
x2="405.93555"
y2="180.64258"
id="line135"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-start:url(#FilledArrow_Marker_2);marker-end:url(#FilledArrow_Marker)" />
<line
x1="353.40796"
y1="155.13821"
x2="398.50443"
y2="155.13821"
id="line137"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-end:url(#FilledArrow_Marker)" />
<line
x1="243.03348"
y1="155.9173"
x2="243.0354"
y2="167.45596"
id="line139"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-end:url(#FilledArrow_Marker)" />
<path
sodipodi:nodetypes="cc"
inkscape:connector-curvature="0"
id="line141"
d="m 277.8411,176.62586 30.29726,-9.86829"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-start:url(#FilledArrow_Marker_2);marker-end:url(#FilledArrow_Marker)" />
<path
sodipodi:nodetypes="cc"
inkscape:connector-curvature="0"
id="line143"
d="M 184.42804,68.73894 308.5797,140.29898"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-start:url(#FilledArrow_Marker_2);marker-end:url(#FilledArrow_Marker)" />
<line
x1="177.49472"
y1="73.660522"
x2="218.83446"
y2="113.87802"
id="line145"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-end:url(#FilledArrow_Marker)" />
<line
x1="334.90796"
y1="243.0784"
x2="334.90796"
y2="182.69818"
id="line147"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-start:url(#FilledArrow_Marker_2);marker-end:url(#FilledArrow_Marker)" />
<path
style="fill:#89d735;fill-opacity:1;stroke:none"
inkscape:connector-curvature="0"
id="path177"
d="m 142.63817,36.49612 30,0 c 1.65685,0 3,1.34314 3,3 l 0,30 c 0,1.65686 -1.34315,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34314 -3,-3 l 0,-30 c 0,-1.65686 1.34315,-3 3,-3 z" />
<path
style="fill:none;stroke:#60902c;stroke-width:1;stroke-linecap:round;stroke-linejoin:round;stroke-opacity:1"
inkscape:connector-curvature="0"
id="path179"
d="m 142.63817,36.49611 30,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65686 -1.34315,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34314 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z" />
<text
style="font-size:8px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#ffffff;stroke:none;font-family:Open Sans;-inkscape-font-specification:Open Sans"
sodipodi:linespacing="125%"
id="text3436"
x="142.0069"
y="59.447937">
<tspan
y="59.447937"
x="142.0069"
id="tspan4273"
sodipodi:role="line">realtime</tspan>
<tspan
y="69.447937"
x="142.0069"
id="tspan4275"
sodipodi:role="line">nodes</tspan>
</text>
<path
sodipodi:nodetypes="sssssssss"
style="fill:#687de9;stroke:none"
inkscape:connector-curvature="0"
id="path186"
d="m 311.90795,252.6384 46.0002,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -46.0002,0 c -1.65686,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65685 1.34314,-3 3,-3 z" />
<path
sodipodi:nodetypes="sssssssss"
style="fill:none;stroke:#3446b0;stroke-width:1;stroke-linecap:round;stroke-linejoin:round"
inkscape:connector-curvature="0"
id="path188"
d="m 311.90795,252.6384 46.0002,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -46.0002,0 c -1.65686,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65685 1.34314,-3 3,-3 z" />
<text
sodipodi:linespacing="125%"
y="274.90717"
x="311.92163"
style="font-size:8px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#ffffff;stroke:none;font-family:Sans;-inkscape-font-specification:Sans"
id="text190">
<tspan
y="274.90717"
x="311.92163"
id="tspan4477"
sodipodi:role="line">historical</tspan>
<tspan
y="284.90717"
x="311.92163"
id="tspan4479"
sodipodi:role="line">nodes</tspan>
</text>
<path
style="opacity:0.5;fill:#fd664a;stroke:none;filter:url(#filter5046)"
inkscape:connector-curvature="0"
id="path195"
d="m 228.0304,119.4173 30,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z"
transform="translate(-2.6747754e-6,-3.30513e-8)" />
<path
style="opacity:0.5;fill:none;stroke:#f3472c;stroke-width:1;stroke-linecap:round;stroke-linejoin:round;filter:url(#filter5042)"
inkscape:connector-curvature="0"
id="path197"
d="m 228.0304,119.4173 30,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z"
transform="translate(-2.6747754e-6,-3.30513e-8)" />
<text
sodipodi:linespacing="125%"
style="font-size:8px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#ffffff;stroke:none;font-family:Sans;-inkscape-font-specification:Sans"
id="text199"
x="227.10298"
y="151.71964">
<tspan
y="151.71964"
x="227.10298"
id="tspan4291"
sodipodi:role="line">MySQL</tspan>
</text>
<path
sodipodi:nodetypes="sssssssss"
style="fill:#3fbab2;stroke:none"
inkscape:connector-curvature="0"
id="path204"
d="m 220.03055,177.01595 45.99985,0 c 1.65685,0 3,1.34314 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -45.99985,0 c -1.65685,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65686 1.34315,-3 3,-3 z" />
<path
sodipodi:nodetypes="sssssssscsss"
style="fill:none;stroke:#1e9189;stroke-width:1;stroke-linecap:round;stroke-linejoin:round"
inkscape:connector-curvature="0"
id="path206"
d="m 220.03055,177.01595 45.99985,0 c 1.65685,0 3,1.34314 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -45.99985,0 c -0.82843,0 -1.57843,-0.33579 -2.12132,-0.87868 -0.54289,-0.54289 -0.87868,-1.29289 -0.87868,-2.12132 l 0,-15 0,-15 c 0,-0.82843 0.33579,-1.57843 0.87868,-2.12132 0.54289,-0.5429 1.29289,-0.87868 2.12132,-0.87868 z" />
<text
y="200.61603"
x="218.88231"
style="font-size:8px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#ffffff;stroke:none;font-family:Sans;-inkscape-font-specification:Sans"
id="text208"
sodipodi:linespacing="125%">
<tspan
y="200.61603"
x="218.88231"
id="tspan4489"
sodipodi:role="line">coordinator</tspan>
<tspan
y="210.61603"
x="218.88231"
id="tspan4491"
sodipodi:role="line">nodes</tspan>
</text>
<path
style="opacity:0.5;fill:#191f7d;fill-opacity:1;stroke:none;filter:url(#filter5038)"
inkscape:connector-curvature="0"
id="path213"
d="m 144.79263,252.63839 30,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65686 -1.34315,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34314 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z"
transform="translate(-2.6747754e-6,-3.30513e-8)" />
<path
style="opacity:0.5;fill:none;stroke:#080b3e;stroke-width:1;stroke-linecap:round;stroke-linejoin:round;stroke-opacity:1;filter:url(#filter5034)"
inkscape:connector-curvature="0"
id="path215"
d="m 144.79263,252.63839 30,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65686 -1.34315,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34314 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z"
transform="translate(-2.6747754e-6,-3.30513e-8)" />
<text
sodipodi:linespacing="125%"
y="274.63846"
x="144.0856"
style="font-size:8px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#ffffff;stroke:none;font-family:Sans;-inkscape-font-specification:Sans"
id="text217">
<tspan
y="274.63846"
x="144.0856"
id="tspan4485"
sodipodi:role="line">deep</tspan>
<tspan
y="284.63846"
x="144.0856"
id="tspan4487"
sodipodi:role="line">storage</tspan>
</text>
<path
style="opacity:0.5;fill:#d443a4;stroke:none;filter:url(#filter5054)"
inkscape:connector-curvature="0"
id="path222"
d="m 319.90796,137.13819 30,0 c 1.65686,0 3,1.34315 3,3 l 0,30 c 0,1.65686 -1.34314,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34314 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z"
transform="translate(-2.6747754e-6,-3.30513e-8)" />
<path
style="opacity:0.5;fill:none;stroke:#ad3184;stroke-width:1;stroke-linecap:round;stroke-linejoin:round;filter:url(#filter5050)"
inkscape:connector-curvature="0"
id="path224"
d="m 319.90796,137.13819 30,0 c 1.65686,0 3,1.34315 3,3 l 0,30 c 0,1.65686 -1.34314,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34314 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z"
transform="translate(-2.6747754e-6,-3.30513e-8)" />
<text
y="159.69537"
x="318.52225"
style="font-size:8px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#ffffff;stroke:none;font-family:Sans;-inkscape-font-specification:Sans"
id="text226"
sodipodi:linespacing="125%">
<tspan
y="159.69537"
x="318.52225"
id="tspan4309"
sodipodi:role="line">Zoo</tspan>
<tspan
y="169.69537"
x="318.52225"
id="tspan4311"
sodipodi:role="line">Keeper</tspan>
</text>
<line
x1="453.12442"
y1="154.63824"
x2="467.89215"
y2="154.63828"
id="line293"
style="fill:none;stroke:#7f95a7;stroke-width:2;stroke-linecap:round;stroke-linejoin:round;marker-start:url(#FilledArrow_Marker_2);marker-end:url(#FilledArrow_Marker)" />
<path
style="fill:#fbae4e;stroke:none"
inkscape:connector-curvature="0"
id="path296"
d="m 411.06442,137.13821 30,0 c 1.65685,0 3,1.34314 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -30,0 c -1.65687,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65686 1.34313,-3 3,-3 z" />
<path
style="fill:none;stroke:#e48819;stroke-width:1;stroke-linecap:round;stroke-linejoin:round"
inkscape:connector-curvature="0"
id="path298"
d="m 411.06442,137.13821 30,0 c 1.65685,0 3,1.34314 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -30,0 c -1.65687,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65686 1.34313,-3 3,-3 z" />
<text
sodipodi:linespacing="125%"
y="159.77753"
x="409.95114"
style="font-size:8px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#ffffff;stroke:none;font-family:Sans;-inkscape-font-specification:Sans"
id="text300">
<tspan
y="159.77753"
x="409.95114"
id="tspan4481"
sodipodi:role="line">broker</tspan>
<tspan
y="169.77753"
x="409.95114"
id="tspan4483"
sodipodi:role="line">nodes</tspan>
</text>
<path
d="m 424.40234,240.03369 30,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z"
id="path5004"
inkscape:connector-curvature="0"
style="opacity:0.5;fill:#a3a3a3;fill-opacity:1;stroke:none;filter:url(#filter5062)"
transform="matrix(0.5,0,0,0.5,170.45068,213.49402)" />
<path
d="m 424.40234,240.03369 30,0 c 1.65685,0 3,1.34315 3,3 l 0,30 c 0,1.65685 -1.34315,3 -3,3 l -30,0 c -1.65685,0 -3,-1.34315 -3,-3 l 0,-30 c 0,-1.65685 1.34315,-3 3,-3 z"
id="path5006"
inkscape:connector-curvature="0"
style="opacity:0.5;fill:none;stroke:#8f8f8f;stroke-width:1;stroke-linecap:round;stroke-linejoin:round;stroke-opacity:1;filter:url(#filter5058)"
transform="matrix(0.5,0,0,0.5,170.45068,213.49402)" />
<text
style="font-size:12px;font-style:normal;font-variant:normal;font-weight:600;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#33424f;stroke:none;font-family:Open Sans;-inkscape-font-specification:Open Sans Semi-Bold"
id="text5012"
x="399.03949"
y="315.00461"
sodipodi:linespacing="125%">
<tspan
style="font-size:9.60000038px;font-style:normal;font-variant:normal;font-weight:300;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#33424f;font-family:Open Sans;-inkscape-font-specification:Open Sans Light"
id="tspan5016"
textLength="26.115234"
y="345.00461"
x="406.6102"
font-weight="bold"
font-size="12">external dependencies</tspan>
</text>
<path
d="m 382.65185,305.62966 15,0 c 0.82843,0 1.5,0.67157 1.5,1.5 l 0,15 c 0,0.82842 -0.67157,1.5 -1.5,1.5 l -15,0 c -0.82844,0 -1.5,-0.67158 -1.5,-1.5 l 0,-15 c 0,-0.82843 0.67156,-1.5 1.5,-1.5 z"
id="path5026"
inkscape:connector-curvature="0"
style="fill:#a5a5a5;fill-opacity:1;stroke:none" />
<path
d="m 382.65185,305.62966 15,0 c 0.82843,0 1.5,0.67157 1.5,1.5 l 0,15 c 0,0.82842 -0.67157,1.5 -1.5,1.5 l -15,0 c -0.82844,0 -1.5,-0.67158 -1.5,-1.5 l 0,-15 c 0,-0.82843 0.67156,-1.5 1.5,-1.5 z"
id="path5028"
inkscape:connector-curvature="0"
style="fill:none;stroke:#7f7f7f;stroke-width:0.5;stroke-linecap:round;stroke-linejoin:round;stroke-opacity:1" />
<text
sodipodi:linespacing="125%"
y="287.12341"
x="399.03949"
id="text5030"
style="font-size:12px;font-style:normal;font-variant:normal;font-weight:600;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#33424f;stroke:none;font-family:Open Sans;-inkscape-font-specification:Open Sans Semi-Bold">
<tspan
font-size="12"
font-weight="bold"
x="406.6102"
y="317.12341"
textLength="26.115234"
id="tspan5032"
style="font-size:9.60000038px;font-style:normal;font-variant:normal;font-weight:300;font-stretch:normal;text-align:start;line-height:125%;writing-mode:lr-tb;text-anchor:start;fill:#33424f;font-family:Open Sans;-inkscape-font-specification:Open Sans Light">druid components</tspan>
</text>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 28 KiB

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 48 KiB

File diff suppressed because it is too large Load Diff

Binary file not shown.

View File

@ -1,57 +0,0 @@
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// dot -Gnewrank -Tpng indexing_service.dot > indexing_service.png
digraph g {
node [ fontname = "Helvetica Neue" ]
node [ fontname = "Helvetica Neue" ]
edge [ fontname = "Helvetica Neue Light Italic" fontsize = 12]
new_task[shape="plaintext" fontname="Helvetica Neue Light Italic"]
overlord[shape="box" label="Overlord"]
new_task -> overlord
overlord -> zk_tasks:mm1:n [label = "new_task"]
zk_tasks:mm1 -> mm1 [label = "new_task"]
subgraph cluster_0 {
style = "dotted"
label = "ZooKeeper"
fontname = "Helvetica Neue"
zk_status -> zk_tasks [style="invis"]
zk_status [fontname="Source Code Pro" shape = record label = "<status> /status | { <new_task> /new_task }"]
zk_tasks [fontname="Source Code Pro" shape=record label="<tasks> /tasks | {<mm1> /mm1 | <mm2> /mm2 | <mm3> /mm3}"]
{ rank = same; zk_status zk_tasks }
}
subgraph cluster_mm1 {
style="dotted"
mm1 [shape = "box" label = "Middle Manager 1" ]
peon_11[label = "peon"]
peon_12[label = "peon"]
peon_13[label = "peon"]
mm1 -> {peon_11;peon_12}
mm1 -> peon_13 [label = "new_task"]
mm1 -> peon_13:e [label = "new_task_status" dir=back]
}
zk_status:new_task:s -> mm1:e [label = "new_task_status" dir = back]
overlord:e -> zk_status:new_task:n [dir=back label="new_task_status"]
}

View File

@ -1,29 +0,0 @@
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<!-- Start page_footer include -->
<div class="container">
<footer>
<div class="container">
</div>
</footer>
</div>
<script src="http://code.jquery.com/jquery.min.js"></script>
</script>
<!-- stop page_footer include -->

View File

@ -1,37 +0,0 @@
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<!-- Start page_header include -->
<div class="navbar navbar-inverse navbar-static-top druid-nav">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li {% if page.sectionid == 'docs' %} class="active"{% endif %}><a href="https://github.com/apache/incubator-druid/wiki">Documentation</a></li>
</ul>
</div>
</div>
</div>
<!-- Stop page_header include --->

View File

@ -1,51 +0,0 @@
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="druid">
<title>Druid | {{page.title}}</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="//use.fontawesome.com/releases/v5.2.0/css/all.css" integrity="sha384-hWVjflwFxL6sNzntih27bfxkr27PmbbK/iSvJ+a4+0owXq79v+lsFkW54bOGbiDQ" crossorigin="anonymous">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.0">
<link rel="stylesheet" href="/css/main.css?v=1.0">
<link rel="stylesheet" href="/css/header.css?v=1.0">
<link rel="stylesheet" href="/css/footer.css?v=1.0">
<link rel="stylesheet" href="/css/syntax.css?v=1.0">
<link rel="stylesheet" href="/css/docs.css?v=1.0">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>

View File

@ -1,50 +0,0 @@
<!DOCTYPE html>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<html lang="en">
<head>
{% include site_head.html %}
</head>
<body>
{% include page_header.html %}
<div class="druid-header">
<div class="container">
<h1>Documentation</h1>
<h4></h4>
</div>
</div>
<div class="container">
<div class="row">
<div class="col-md-9 doc-content">
{{ content }}
</div>
<div class="col-md-3 toc" id="toc">
</div>
</div>
</div>
{% include page_footer.html %}
<script>$(function() { $(".toc").load("toc.html"); });</script>
</body>
</html>

View File

@ -1,178 +0,0 @@
[
{"source": "About-Experimental-Features.html", "target": "development/experimental.html"},
{"source": "Aggregations.html", "target": "querying/aggregations.html"},
{"source": "ApproxHisto.html", "target": "development/extensions-core/approximate-histograms.html"},
{"source": "Batch-ingestion.html", "target": "ingestion/batch-ingestion.html"},
{"source": "Booting-a-production-cluster.html", "target": "tutorials/cluster.html"},
{"source": "Broker-Config.html", "target": "configuration/index.html#broker"},
{"source": "Broker.html", "target": "design/broker.html"},
{"source": "Build-from-source.html", "target": "development/build.html"},
{"source": "Cassandra-Deep-Storage.html", "target": "dependencies/cassandra-deep-storage.html"},
{"source": "Cluster-setup.html", "target": "tutorials/cluster.html"},
{"source": "Concepts-and-Terminology.html", "target": "design/index.html"},
{"source": "Configuration.html", "target": "configuration/index.html"},
{"source": "Coordinator-Config.html", "target": "configuration/index.html#coordinator"},
{"source": "Coordinator.html", "target": "design/coordinator.html"},
{"source": "DataSource.html", "target": "querying/datasource.html"},
{"source": "DataSourceMetadataQuery.html", "target": "querying/datasourcemetadataquery.html"},
{"source": "Data_formats.html", "target": "ingestion/data-formats.html"},
{"source": "Deep-Storage.html", "target": "dependencies/deep-storage.html"},
{"source": "Design.html", "target": "design/index.html"},
{"source": "DimensionSpecs.html", "target": "querying/dimensionspecs.html"},
{"source": "Druid-vs-Cassandra.html", "target": "comparisons/druid-vs-key-value.html"},
{"source": "Druid-vs-Elasticsearch.html", "target": "comparisons/druid-vs-elasticsearch.html"},
{"source": "Druid-vs-Hadoop.html", "target": "comparisons/druid-vs-sql-on-hadoop.html"},
{"source": "Druid-vs-Impala-or-Shark.html", "target": "comparisons/druid-vs-sql-on-hadoop.html"},
{"source": "Druid-vs-Redshift.html", "target": "comparisons/druid-vs-redshift.html"},
{"source": "Druid-vs-Spark.html", "target": "comparisons/druid-vs-spark.html"},
{"source": "Druid-vs-Vertica.html", "target": "comparisons/druid-vs-redshift.html"},
{"source": "Evaluate.html", "target": "tutorials/cluster.html"},
{"source": "Examples.html", "target": "tutorials/index.html"},
{"source": "Filters.html", "target": "querying/filters.html"},
{"source": "Firehose.html", "target": "ingestion/firehose.html"},
{"source": "GeographicQueries.html", "target": "development/geo.html"},
{"source": "Granularities.html", "target": "querying/granularities.html"},
{"source": "GroupByQuery.html", "target": "querying/groupbyquery.html"},
{"source": "Hadoop-Configuration.html", "target": "ingestion/hadoop.html"},
{"source": "Having.html", "target": "querying/having.html"},
{"source": "Historical-Config.html", "target": "configuration/index.html#historical"},
{"source": "Historical.html", "target": "design/historical.html"},
{"source": "Including-Extensions.html", "target": "operations/including-extensions.html"},
{"source": "Indexing-Service-Config.html", "target": "configuration/index.html#overlord"},
{"source": "Indexing-Service.html", "target": "design/indexing-service.html"},
{"source": "Ingestion-FAQ.html", "target": "ingestion/faq.html"},
{"source": "Ingestion-overview.html", "target": "tutorials/index.html"},
{"source": "Ingestion.html", "target": "ingestion/index.html"},
{"source": "Integrating-Druid-With-Other-Technologies.html", "target": "development/integrating-druid-with-other-technologies.html"},
{"source": "Libraries.html", "target": "/libraries.html"},
{"source": "LimitSpec.html", "target": "querying/limitspec.html"},
{"source": "Logging.html", "target": "configuration/logging.html"},
{"source": "Metadata-storage.html", "target": "dependencies/metadata-storage.html"},
{"source": "Metrics.html", "target": "operations/metrics.html"},
{"source": "Middlemanager.html", "target": "design/middlemanager.html"},
{"source": "Modules.html", "target": "development/modules.html"},
{"source": "Other-Hadoop.html", "target": "operations/other-hadoop.html"},
{"source": "Papers-and-talks.html", "target": "misc/papers-and-talks.html"},
{"source": "Peons.html", "target": "design/peons.html"},
{"source": "Performance-FAQ.html", "target": "operations/basic-cluster-tuning.html"},
{"source": "Plumber.html", "target": "design/plumber.html"},
{"source": "Post-aggregations.html", "target": "querying/post-aggregations.html"},
{"source": "Query-Context.html", "target": "querying/query-context.html"},
{"source": "Querying.html", "target": "querying/querying.html"},
{"source": "Realtime-Config.html", "target": "ingestion/standalone-realtime.html"},
{"source": "Realtime.html", "target": "ingestion/standalone-realtime.html"},
{"source": "Realtime-ingestion.html", "target": "ingestion/stream-ingestion.html"},
{"source": "Recommendations.html", "target": "operations/recommendations.html"},
{"source": "Rolling-Updates.html", "target": "operations/rolling-updates.html"},
{"source": "Router.html", "target": "development/router.html"},
{"source": "Rule-Configuration.html", "target": "operations/rule-configuration.html"},
{"source": "SearchQuery.html", "target": "querying/searchquery.html"},
{"source": "SearchQuerySpec.html", "target": "querying/searchqueryspec.html"},
{"source": "SegmentMetadataQuery.html", "target": "querying/segmentmetadataquery.html"},
{"source": "Segments.html", "target": "design/segments.html"},
{"source": "SelectQuery.html", "target": "querying/select-query.html"},
{"source": "Simple-Cluster-Configuration.html", "target": "tutorials/cluster.html"},
{"source": "Tasks.html", "target": "ingestion/tasks.html"},
{"source": "TimeBoundaryQuery.html", "target": "querying/timeboundaryquery.html"},
{"source": "TimeseriesQuery.html", "target": "querying/timeseriesquery.html"},
{"source": "TopNMetricSpec.html", "target": "querying/topnmetricspec.html"},
{"source": "TopNQuery.html", "target": "querying/topnquery.html"},
{"source": "Tutorial:-A-First-Look-at-Druid.html", "target": "tutorials/index.html"},
{"source": "Tutorial:-All-About-Queries.html", "target": "tutorials/index.html"},
{"source": "Tutorial:-Loading-Batch-Data.html", "target": "tutorials/tutorial-batch.html"},
{"source": "Tutorial:-Loading-Streaming-Data.html", "target": "tutorials/tutorial-kafka.html"},
{"source": "Tutorial:-The-Druid-Cluster.html", "target": "tutorials/cluster.html"},
{"source": "Tutorials.html", "target": "tutorials/index.html"},
{"source": "Versioning.html", "target": "development/versioning.html"},
{"source": "ZooKeeper.html", "target": "dependencies/zookeeper.html"},
{"source": "alerts.html", "target": "operations/alerts.html"},
{"source": "comparisons/druid-vs-cassandra.html", "target": "druid-vs-key-value.html"},
{"source": "comparisons/druid-vs-hadoop.html", "target": "druid-vs-sql-on-hadoop.html"},
{"source": "comparisons/druid-vs-impala-or-shark.html", "target": "druid-vs-sql-on-hadoop.html"},
{"source": "comparisons/druid-vs-vertica.html", "target": "druid-vs-redshift.html"},
{"source": "configuration/auth.html", "target": "../design/auth.html"},
{"source": "configuration/broker.html", "target": "../configuration/index.html#broker"},
{"source": "configuration/caching.html", "target": "../configuration/index.html#cache-configuration"},
{"source": "configuration/coordinator.html", "target": "../configuration/index.html#coordinator"},
{"source": "configuration/historical.html", "target": "../configuration/index.html#historical"},
{"source": "configuration/indexing-service.html", "target": "../configuration/index.html#overlord"},
{"source": "configuration/simple-cluster.html", "target": "../tutorials/cluster.html"},
{"source": "design/concepts-and-terminology.html", "target": "index.html"},
{"source": "design/design.html", "target": "index.html"},
{"source": "development/approximate-histograms.html", "target": "extensions-core/approximate-histograms.html"},
{"source": "development/datasketches-aggregators.html", "target": "extensions-core/datasketches-extension.html"},
{"source": "development/extensions-core/datasketches-aggregators.html", "target": "datasketches-extension.html"},
{"source": "development/libraries.html", "target": "/libraries.html"},
{"source": "development/kafka-simple-consumer-firehose.html", "target": "extensions-contrib/kafka-simple.html"},
{"source": "development/select-query.html", "target": "../querying/select-query.html"},
{"source": "index.html", "target": "design/index.html"},
{"source": "ingestion/overview.html", "target": "index.html"},
{"source": "ingestion/ingestion.html", "target": "index.html"},
{"source": "ingestion/realtime-ingestion.html", "target": "stream-ingestion.html"},
{"source": "misc/cluster-setup.html", "target": "../tutorials/cluster.html"},
{"source": "misc/evaluate.html", "target": "../tutorials/cluster.html"},
{"source": "misc/tasks.html", "target": "../ingestion/tasks.html"},
{"source": "operations/multitenancy.html", "target": "../querying/multitenancy.html"},
{"source": "tutorials/booting-a-production-cluster.html", "target": "cluster.html"},
{"source": "tutorials/examples.html", "target": "index.html"},
{"source": "tutorials/firewall.html", "target": "cluster.html"},
{"source": "tutorials/quickstart.html", "target": "index.html"},
{"source": "tutorials/tutorial-a-first-look-at-druid.html", "target": "index.html"},
{"source": "tutorials/tutorial-all-about-queries.html", "target": "index.html"},
{"source": "tutorials/tutorial-loading-batch-data.html", "target": "tutorial-batch.html"},
{"source": "tutorials/tutorial-loading-streaming-data.html", "target": "tutorial-kafka.html"},
{"source": "tutorials/tutorial-the-druid-cluster.html", "target": "cluster.html"},
{"source": "development/extensions-core/caffeine-cache.html", "target":"../../configuration/index.html#cache-configuration"},
{"source": "Production-Cluster-Configuration.html", "target": "tutorials/cluster.html"},
{"source": "development/extensions-contrib/parquet.html", "target":"../../development/extensions-core/parquet.html"},
{"source": "development/extensions-contrib/scan-query.html", "target":"../../querying/scan-query.html"},
{"source": "tutorials/ingestion.html", "target": "index.html"},
{"source": "tutorials/ingestion-streams.html", "target": "index.html"},
{"source": "ingestion/native-batch.html", "target": "native_tasks.html"},
{"source": "Compute.html", "target": "design/processes.html"},
{"source": "Contribute.html", "target": "/community/"},
{"source": "Download.html", "target": "/downloads.html"},
{"source": "Druid-Personal-Demo-Cluster.html", "target": "tutorials/index.html"},
{"source": "Home.html", "target": "design/index.html"},
{"source": "Loading-Your-Data.html", "target": "ingestion/index.html"},
{"source": "Master.html", "target": "design/processes.html"},
{"source": "MySQL.html", "target": "development/extensions-core/mysql.html"},
{"source": "OrderBy.html", "target": "querying/limitspec.html"},
{"source": "Querying-your-data.html", "target": "querying/querying.html"},
{"source": "Spatial-Filters.html", "target": "development/geo.html"},
{"source": "Spatial-Indexing.html", "target": "development/geo.html"},
{"source": "Stand-Alone-With-Riak-CS.html", "target": "design/index.html"},
{"source": "Support.html", "target": "/community/"},
{"source": "Tutorial:-Webstream.html", "target": "tutorials/index.html"},
{"source": "Twitter-Tutorial.html", "target": "tutorials/index.html"},
{"source": "Tutorial:-Loading-Your-Data-Part-1.html", "target": "tutorials/index.html"},
{"source": "Tutorial:-Loading-Your-Data-Part-2.html", "target": "tutorials/index.html"},
{"source": "Kafka-Eight.html", "target": "development/extensions-core/kafka-eight-firehose.html"},
{"source": "Thanks.html", "target": "/community/"},
{"source": "Tutorial-A-First-Look-at-Druid.html", "target": "tutorials/index.html"},
{"source": "Tutorial-All-About-Queries.html", "target": "tutorials/index.html"},
{"source": "Tutorial-Loading-Batch-Data.html", "target": "tutorials/index.html"},
{"source": "Tutorial-Loading-Streaming-Data.html", "target": "tutorials/index.html"},
{"source": "Tutorial-The-Druid-Cluster.html", "target": "tutorials/index.html"},
{"source": "configuration/hadoop.html", "target": "../ingestion/hadoop.html"},
{"source": "configuration/production-cluster.html", "target": "../tutorials/cluster.html"},
{"source": "configuration/zookeeper.html", "target": "../dependencies/zookeeper.html"},
{"source": "querying/optimizations.html", "target": "multi-value-dimensions.html"},
{"source": "development/community-extensions/azure.html", "target": "../extensions-contrib/azure.html"},
{"source": "development/community-extensions/cassandra.html", "target": "../extensions-contrib/cassandra.html"},
{"source": "development/community-extensions/cloudfiles.html", "target": "../extensions-contrib/cloudfiles.html"},
{"source": "development/community-extensions/graphite.html", "target": "../extensions-contrib/graphite.html"},
{"source": "development/community-extensions/kafka-simple.html", "target": "../extensions-contrib/kafka-simple.html"},
{"source": "development/community-extensions/rabbitmq.html", "target": "../extensions-contrib/rabbitmq.html"},
{"source": "development/extensions-core/namespaced-lookup.html", "target": "lookups-cached-global.html"},
{"source": "operations/performance-faq.html", "target": "../operations/basic-cluster-tuning.html"},
{"source": "development/extensions-contrib/orc.html", "target": "../extensions-core/orc.html"},
{"source": "operations/performance-faq.html", "target": "../operations/basic-cluster-tuning.html"},
{"source": "configuration/realtime.md", "target": "../ingestion/standalone-realtime.html"},
{"source": "design/realtime.md", "target": "../ingestion/standalone-realtime.html"},
{"source": "ingestion/stream-pull.md", "target": "../ingestion/standalone-realtime.html"},
{"source": "development/extensions-core/kafka-eight-firehose.md", "target": "../../ingestion/standalone-realtime.html"},
{"source": "development/extensions-contrib/kafka-simple.md", "target": "../../ingestion/standalone-realtime.html"},
{"source": "development/extensions-contrib/rabbitmq.md", "target": "../../ingestion/standalone-realtime.html"},
{"source": "development/extensions-contrib/rocketmq.md", "target": "../../ingestion/standalone-realtime.html"},
]

View File

Before

Width:  |  Height:  |  Size: 131 KiB

After

Width:  |  Height:  |  Size: 131 KiB

View File

Before

Width:  |  Height:  |  Size: 91 KiB

After

Width:  |  Height:  |  Size: 91 KiB

View File

Before

Width:  |  Height:  |  Size: 127 KiB

After

Width:  |  Height:  |  Size: 127 KiB

View File

Before

Width:  |  Height:  |  Size: 70 KiB

After

Width:  |  Height:  |  Size: 70 KiB

View File

Before

Width:  |  Height:  |  Size: 78 KiB

After

Width:  |  Height:  |  Size: 78 KiB

View File

Before

Width:  |  Height:  |  Size: 39 KiB

After

Width:  |  Height:  |  Size: 39 KiB

View File

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 24 KiB

View File

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 22 KiB

View File

Before

Width:  |  Height:  |  Size: 30 KiB

After

Width:  |  Height:  |  Size: 30 KiB

View File

Before

Width:  |  Height:  |  Size: 55 KiB

After

Width:  |  Height:  |  Size: 55 KiB

View File

Before

Width:  |  Height:  |  Size: 352 KiB

After

Width:  |  Height:  |  Size: 352 KiB

View File

Before

Width:  |  Height:  |  Size: 134 KiB

After

Width:  |  Height:  |  Size: 134 KiB

View File

Before

Width:  |  Height:  |  Size: 163 KiB

After

Width:  |  Height:  |  Size: 163 KiB

View File

Before

Width:  |  Height:  |  Size: 159 KiB

After

Width:  |  Height:  |  Size: 159 KiB

View File

Before

Width:  |  Height:  |  Size: 63 KiB

After

Width:  |  Height:  |  Size: 63 KiB

View File

Before

Width:  |  Height:  |  Size: 45 KiB

After

Width:  |  Height:  |  Size: 45 KiB

View File

Before

Width:  |  Height:  |  Size: 102 KiB

After

Width:  |  Height:  |  Size: 102 KiB

View File

Before

Width:  |  Height:  |  Size: 62 KiB

After

Width:  |  Height:  |  Size: 62 KiB

View File

Before

Width:  |  Height:  |  Size: 44 KiB

After

Width:  |  Height:  |  Size: 44 KiB

View File

Before

Width:  |  Height:  |  Size: 81 KiB

After

Width:  |  Height:  |  Size: 81 KiB

View File

Before

Width:  |  Height:  |  Size: 68 KiB

After

Width:  |  Height:  |  Size: 68 KiB

View File

Before

Width:  |  Height:  |  Size: 84 KiB

After

Width:  |  Height:  |  Size: 84 KiB

View File

Before

Width:  |  Height:  |  Size: 35 KiB

After

Width:  |  Height:  |  Size: 35 KiB

View File

Before

Width:  |  Height:  |  Size: 163 KiB

After

Width:  |  Height:  |  Size: 163 KiB

View File

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 26 KiB

View File

Before

Width:  |  Height:  |  Size: 180 KiB

After

Width:  |  Height:  |  Size: 180 KiB

View File

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 26 KiB

View File

Before

Width:  |  Height:  |  Size: 202 KiB

After

Width:  |  Height:  |  Size: 202 KiB

View File

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 26 KiB

View File

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 28 KiB

View File

Before

Width:  |  Height:  |  Size: 43 KiB

After

Width:  |  Height:  |  Size: 43 KiB

View File

Before

Width:  |  Height:  |  Size: 429 KiB

After

Width:  |  Height:  |  Size: 429 KiB

View File

Before

Width:  |  Height:  |  Size: 427 KiB

After

Width:  |  Height:  |  Size: 427 KiB

View File

Before

Width:  |  Height:  |  Size: 84 KiB

After

Width:  |  Height:  |  Size: 84 KiB

View File

Before

Width:  |  Height:  |  Size: 74 KiB

After

Width:  |  Height:  |  Size: 74 KiB

View File

Before

Width:  |  Height:  |  Size: 99 KiB

After

Width:  |  Height:  |  Size: 99 KiB

View File

Before

Width:  |  Height:  |  Size: 81 KiB

After

Width:  |  Height:  |  Size: 81 KiB

View File

Before

Width:  |  Height:  |  Size: 64 KiB

After

Width:  |  Height:  |  Size: 64 KiB

View File

Before

Width:  |  Height:  |  Size: 65 KiB

After

Width:  |  Height:  |  Size: 65 KiB

View File

Before

Width:  |  Height:  |  Size: 51 KiB

After

Width:  |  Height:  |  Size: 51 KiB

View File

Before

Width:  |  Height:  |  Size: 80 KiB

After

Width:  |  Height:  |  Size: 80 KiB

View File

Before

Width:  |  Height:  |  Size: 77 KiB

After

Width:  |  Height:  |  Size: 77 KiB

View File

Before

Width:  |  Height:  |  Size: 29 KiB

After

Width:  |  Height:  |  Size: 29 KiB

View File

Before

Width:  |  Height:  |  Size: 76 KiB

After

Width:  |  Height:  |  Size: 76 KiB

View File

Before

Width:  |  Height:  |  Size: 34 KiB

After

Width:  |  Height:  |  Size: 34 KiB

View File

Before

Width:  |  Height:  |  Size: 235 KiB

After

Width:  |  Height:  |  Size: 235 KiB

View File

Before

Width:  |  Height:  |  Size: 29 KiB

After

Width:  |  Height:  |  Size: 29 KiB

View File

Before

Width:  |  Height:  |  Size: 44 KiB

After

Width:  |  Height:  |  Size: 44 KiB

View File

Before

Width:  |  Height:  |  Size: 38 KiB

After

Width:  |  Height:  |  Size: 38 KiB

View File

Before

Width:  |  Height:  |  Size: 134 KiB

After

Width:  |  Height:  |  Size: 134 KiB

View File

Before

Width:  |  Height:  |  Size: 57 KiB

After

Width:  |  Height:  |  Size: 57 KiB

View File

Before

Width:  |  Height:  |  Size: 67 KiB

After

Width:  |  Height:  |  Size: 67 KiB

View File

Before

Width:  |  Height:  |  Size: 446 KiB

After

Width:  |  Height:  |  Size: 446 KiB

View File

Before

Width:  |  Height:  |  Size: 174 KiB

After

Width:  |  Height:  |  Size: 174 KiB

View File

Before

Width:  |  Height:  |  Size: 169 KiB

After

Width:  |  Height:  |  Size: 169 KiB

View File

Before

Width:  |  Height:  |  Size: 205 KiB

After

Width:  |  Height:  |  Size: 205 KiB

View File

Before

Width:  |  Height:  |  Size: 118 KiB

After

Width:  |  Height:  |  Size: 118 KiB

View File

Before

Width:  |  Height:  |  Size: 63 KiB

After

Width:  |  Height:  |  Size: 63 KiB

View File

Before

Width:  |  Height:  |  Size: 92 KiB

After

Width:  |  Height:  |  Size: 92 KiB

View File

Before

Width:  |  Height:  |  Size: 78 KiB

After

Width:  |  Height:  |  Size: 78 KiB

View File

Before

Width:  |  Height:  |  Size: 109 KiB

After

Width:  |  Height:  |  Size: 109 KiB

View File

Before

Width:  |  Height:  |  Size: 134 KiB

After

Width:  |  Height:  |  Size: 134 KiB

View File

Before

Width:  |  Height:  |  Size: 53 KiB

After

Width:  |  Height:  |  Size: 53 KiB

View File

@ -1,6 +1,6 @@
---
layout: doc_page
title: "Apache Druid (incubating) vs Elasticsearch"
id: druid-vs-elasticsearch
title: "Apache Druid vs Elasticsearch"
---
<!--
@ -22,19 +22,18 @@ title: "Apache Druid (incubating) vs Elasticsearch"
~ under the License.
-->
# Apache Druid (incubating) vs Elasticsearch
We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents
and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations.
[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out
Elasticsearch is a search system based on Apache Lucene. It provides full text search for schema-free documents
and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations.
[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out
the resource requirements for data ingestion and aggregation in Elasticsearch is much higher than those of Druid.
Elasticsearch also does not support data summarization/roll-up at ingestion time, which can compact the data that needs to be
Elasticsearch also does not support data summarization/roll-up at ingestion time, which can compact the data that needs to be
stored up to 100x with real-world data sets. This leads to Elasticsearch having greater storage requirements.
Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost,
and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support
full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that
Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost,
and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support
full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that
summarization/roll-up can be done.

View File

@ -1,6 +1,6 @@
---
layout: doc_page
title: "Apache Druid (incubating) vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)"
id: druid-vs-key-value
title: "Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)"
---
<!--
@ -22,26 +22,25 @@ title: "Apache Druid (incubating) vs. Key/Value Stores (HBase/Cassandra/OpenTSDB
~ under the License.
-->
# Apache Druid (incubating) vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets. This same functionality
Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets. This same functionality
is supported in key/value stores in 2 ways:
1. Pre-compute all permutations of possible user queries
2. Range scans on event data
When pre-computing results, the key is the exact parameters of the query, and the value is the result of the query.
The queries return extremely quickly, but at the cost of flexibility, as ad-hoc exploratory queries are not possible with
pre-computing every possible query permutation. Pre-computing all permutations of all ad-hoc queries leads to result sets
that grow exponentially with the number of columns of a data set, and pre-computing queries for complex real-world data sets
When pre-computing results, the key is the exact parameters of the query, and the value is the result of the query.
The queries return extremely quickly, but at the cost of flexibility, as ad-hoc exploratory queries are not possible with
pre-computing every possible query permutation. Pre-computing all permutations of all ad-hoc queries leads to result sets
that grow exponentially with the number of columns of a data set, and pre-computing queries for complex real-world data sets
can require hours of pre-processing time.
The other approach to using key/value stores for aggregations to use the dimensions of an event as the key and the event measures as the value.
Aggregations are done by issuing range scans on this data. Timeseries specific databases such as OpenTSDB use this approach.
One of the limitations here is that the key/value storage model does not have indexes for any kind of filtering other than prefix ranges,
which can be used to filter a query down to a metric and time range, but cannot resolve complex predicates to narrow the exact data to scan.
When the number of rows to scan gets large, this limitation can greatly reduce performance. It is also harder to achieve good
The other approach to using key/value stores for aggregations to use the dimensions of an event as the key and the event measures as the value.
Aggregations are done by issuing range scans on this data. Timeseries specific databases such as OpenTSDB use this approach.
One of the limitations here is that the key/value storage model does not have indexes for any kind of filtering other than prefix ranges,
which can be used to filter a query down to a metric and time range, but cannot resolve complex predicates to narrow the exact data to scan.
When the number of rows to scan gets large, this limitation can greatly reduce performance. It is also harder to achieve good
locality with key/value stores because most dont support pushing down aggregates to the storage layer.
For arbitrary exploration of data (flexible data filtering), Druid's custom column format enables ad-hoc queries without pre-computation. The format
For arbitrary exploration of data (flexible data filtering), Druid's custom column format enables ad-hoc queries without pre-computation. The format
also enables fast scans on columns, which is important for good aggregation performance.

View File

@ -1,6 +1,6 @@
---
layout: doc_page
title: "Apache Druid (incubating) vs Kudu"
id: druid-vs-kudu
title: "Apache Druid vs Kudu"
---
<!--
@ -22,19 +22,18 @@ title: "Apache Druid (incubating) vs Kudu"
~ under the License.
-->
# Apache Druid (incubating) vs Apache Kudu
Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, so theoretically
the process for updating old values should be higher latency in Druid. However, the requirements in Kudu for maintaining extra head space to store
updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing
of data that is not need to answer a query at query time.
Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, so theoretically
the process for updating old values should be higher latency in Druid. However, the requirements in Kudu for maintaining extra head space to store
updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing
of data that is not need to answer a query at query time.
Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be
stored significantly (up to 40 times on average), and increases performance of scanning raw data significantly.
Druid segments also contain bitmap indexes for fast filtering, which Kudu does not currently support.
Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very
fast in Druid, whereas updates of older data is higher latency. This is by design as the data Druid is good for is typically event data,
and does not need to be updated too frequently. Kudu supports arbitrary primary keys with uniqueness constraints, and
efficient lookup by ranges of those keys. Kudu chooses not to include the execution engine, but supports sufficient
operations so as to allow node-local processing from the execution engines. This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).
Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be
stored significantly (up to 40 times on average), and increases performance of scanning raw data significantly.
Druid segments also contain bitmap indexes for fast filtering, which Kudu does not currently support.
Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very
fast in Druid, whereas updates of older data is higher latency. This is by design as the data Druid is good for is typically event data,
and does not need to be updated too frequently. Kudu supports arbitrary primary keys with uniqueness constraints, and
efficient lookup by ranges of those keys. Kudu chooses not to include the execution engine, but supports sufficient
operations so as to allow node-local processing from the execution engines. This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).
Druid includes its own query layer that allows it to push down aggregations and computations directly to data processes for faster query processing.

View File

@ -1,6 +1,6 @@
---
layout: doc_page
title: "Apache Druid (incubating) vs Redshift"
id: druid-vs-redshift
title: "Apache Druid vs Redshift"
---
<!--
@ -22,7 +22,6 @@ title: "Apache Druid (incubating) vs Redshift"
~ under the License.
-->
# Apache Druid (incubating) vs Redshift
### How does Druid compare to Redshift?

View File

@ -1,6 +1,6 @@
---
layout: doc_page
title: "Apache Druid (incubating) vs Spark"
id: druid-vs-spark
title: "Apache Druid vs Spark"
---
<!--
@ -22,20 +22,19 @@ title: "Apache Druid (incubating) vs Spark"
~ under the License.
-->
# Apache Druid (incubating) vs Apache Spark
Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.
Spark is a general cluster computing framework initially designed around the concept of Resilient Distributed Datasets (RDDs).
RDDs enable data reuse by persisting intermediate results
Spark is a general cluster computing framework initially designed around the concept of Resilient Distributed Datasets (RDDs).
RDDs enable data reuse by persisting intermediate results
in memory and enable Spark to provide fast computations for iterative algorithms.
This is especially beneficial for certain work flows such as machine
learning, where the same operation may be applied over and over
again until some result is converged upon. The generality of Spark makes it very suitable as an engine to process (clean or transform) data.
again until some result is converged upon. The generality of Spark makes it very suitable as an engine to process (clean or transform) data.
Although Spark provides the ability to query data through Spark SQL, much like Hadoop, the query latencies are not specifically targeted to be interactive (sub-second).
Druid's focus is on extremely low latency queries, and is ideal for powering applications used by thousands of users, and where each query must
return fast enough such that users can interactively explore through data. Druid fully indexes all data, and can act as a middle layer between Spark and your application.
Druid's focus is on extremely low latency queries, and is ideal for powering applications used by thousands of users, and where each query must
return fast enough such that users can interactively explore through data. Druid fully indexes all data, and can act as a middle layer between Spark and your application.
One typical setup seen in production is to process data in Spark, and load the processed data into Druid for faster access.
For more information about using Druid and Spark together, including benchmarks of the two systems, please see:

View File

@ -1,6 +1,6 @@
---
layout: doc_page
title: "Apache Druid (incubating) vs SQL-on-Hadoop"
id: druid-vs-sql-on-hadoop
title: "Apache Druid vs SQL-on-Hadoop"
---
<!--
@ -22,7 +22,6 @@ title: "Apache Druid (incubating) vs SQL-on-Hadoop"
~ under the License.
-->
# Apache Druid (incubating) vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
SQL-on-Hadoop engines provide an
execution engine for various data formats and data stores, and

View File

@ -1,6 +1,6 @@
---
layout: doc_page
title: "Apache Druid (incubating) Configuration Reference"
id: index
title: "Configuration reference"
---
<!--
@ -22,72 +22,9 @@ title: "Apache Druid (incubating) Configuration Reference"
~ under the License.
-->
# Apache Druid (incubating) Configuration Reference
This page documents all of the configuration properties for each Druid service type.
## Table of Contents
* [Recommended Configuration File Organization](#recommended-configuration-file-organization)
* [Common configurations](#common-configurations)
* [JVM Configuration Best Practices](#jvm-configuration-best-practices)
* [Extensions](#extensions)
* [Modules](#modules)
* [Zookeeper](#zookeeper)
* [Exhibitor](#exhibitor)
* [TLS](#tls)
* [Authentication & Authorization](#authentication-and-authorization)
* [Startup Logging](#startup-logging)
* [Request Logging](#request-logging)
* [Enabling Metrics](#enabling-metrics)
* [Emitting Metrics](#emitting-metrics)
* [Metadata Storage](#metadata-storage)
* [Deep Storage](#deep-storage)
* [Task Logging](#task-logging)
* [Overlord Discovery](#overlord-discovery)
* [Coordinator Discovery](#coordinator-discovery)
* [Announcing Segments](#announcing-segments)
* [JavaScript](#javascript)
* [Double Column Storage](#double-column-storage)
* [Master Server](#master-server)
* [Coordinator](#coordinator)
* [Static Configuration](#static-configuration)
* [Process Config](#coordinator-process-config)
* [Coordinator Operation](#coordinator-operation)
* [Segment Management](#segment-management)
* [Metadata Retrieval](#metadata-retrieval)
* [Dynamic Configuration](#dynamic-configuration)
* [Lookups](#lookups-dynamic-configuration)
* [Compaction](#compaction-dynamic-configuration)
* [Overlord](#overlord)
* [Static Configuration](#overlord-static-configuration)
* [Process Config](#overlord-process-config)
* [Overlord Operations](#overlord-operations)
* [Dynamic Configuration](#overlord-dynamic-configuration)
* [Worker Select Strategy](#worker-select-strategy)
* [Autoscaler](#autoscaler)
* [Data Server](#data-server)
* [MiddleManager & Peons](#middlemanager-and-peons)
* [Process Config](#middlemanager-process-config)
* [MiddleManager Configuration](#middlemanager-configuration)
* [Peon Processing](#peon-processing)
* [Peon Query Configuration](#peon-query-configuration)
* [Caching](#peon-caching)
* [Additional Peon Configuration](#additional-peon-configuration)
* [Historical](#historical)
* [Process Configuration](#historical-process-config)
* [General Configuration](#historical-general-configuration)
* [Query Configs](#historical-query-configs)
* [Caching](#historical-caching)
* [Query Server](#query-server)
* [Broker](#broker)
* [Process Config](#broker-process-configs)
* [Query Configuration](#broker-query-configuration)
* [SQL](#sql)
* [Caching](#broker-caching)
* [Segment Discovery](#segment-discovery)
* [Caching](#cache-configuration)
* [General Query Configuration](#general-query-configuration)
## Recommended Configuration File Organization
A recommended way of organizing Druid configuration files can be seen in the `conf` directory in the Druid package root, shown below:
@ -168,7 +105,7 @@ We recommend just setting the base ZK path and the ZK service host, but all ZK p
|`druid.zk.paths.base`|Base Zookeeper path.|`/druid`|
|`druid.zk.service.host`|The ZooKeeper hosts to connect to. This is a REQUIRED property and therefore a host address must be supplied.|none|
|`druid.zk.service.user`|The username to authenticate with ZooKeeper. This is an optional property.|none|
|`druid.zk.service.pwd`|The [Password Provider](../operations/password-provider.html) or the string password to authenticate with ZooKeeper. This is an optional property.|none|
|`druid.zk.service.pwd`|The [Password Provider](../operations/password-provider.md) or the string password to authenticate with ZooKeeper. This is an optional property.|none|
|`druid.zk.service.authScheme`|digest is the only authentication scheme supported. |digest|
|`druid.zk.service.terminateDruidProcessOnConnectFail`|If set to 'true' and the connection to ZooKeeper fails (after exhausting all potential backoff retires), Druid process terminates itself with exit code 1.|false|
@ -254,14 +191,14 @@ values for the below mentioned configs among others provided by Java implementat
|`druid.server.https.keyStorePath`|The file path or URL of the TLS/SSL Key store.|none|yes|
|`druid.server.https.keyStoreType`|The type of the key store.|none|yes|
|`druid.server.https.certAlias`|Alias of TLS/SSL certificate for the connector.|none|yes|
|`druid.server.https.keyStorePassword`|The [Password Provider](../operations/password-provider.html) or String password for the Key Store.|none|yes|
|`druid.server.https.keyStorePassword`|The [Password Provider](../operations/password-provider.md) or String password for the Key Store.|none|yes|
Following table contains non-mandatory advanced configuration options, use caution.
|Property|Description|Default|Required|
|--------|-----------|-------|--------|
|`druid.server.https.keyManagerFactoryAlgorithm`|Algorithm to use for creating KeyManager, more details [here](https://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/JSSERefGuide.html#KeyManager).|`javax.net.ssl.KeyManagerFactory.getDefaultAlgorithm()`|no|
|`druid.server.https.keyManagerPassword`|The [Password Provider](../operations/password-provider.html) or String password for the Key Manager.|none|no|
|`druid.server.https.keyManagerPassword`|The [Password Provider](../operations/password-provider.md) or String password for the Key Manager.|none|no|
|`druid.server.https.includeCipherSuites`|List of cipher suite names to include. You can either use the exact cipher suite name or a regular expression.|Jetty's default include cipher list|no|
|`druid.server.https.excludeCipherSuites`|List of cipher suite names to exclude. You can either use the exact cipher suite name or a regular expression.|Jetty's default exclude cipher list|no|
|`druid.server.https.includeProtocols`|List of exact protocols names to include.|Jetty's default include protocol list|no|
@ -278,7 +215,7 @@ These properties apply to the SSLContext that will be provided to the internal H
|`druid.client.https.trustStoreType`|The type of the key store where trusted root certificates are stored.|`java.security.KeyStore.getDefaultType()`|no|
|`druid.client.https.trustStorePath`|The file path or URL of the TLS/SSL Key store where trusted root certificates are stored.|none|yes|
|`druid.client.https.trustStoreAlgorithm`|Algorithm to be used by TrustManager to validate certificate chains|`javax.net.ssl.TrustManagerFactory.getDefaultAlgorithm()`|no|
|`druid.client.https.trustStorePassword`|The [Password Provider](../operations/password-provider.html) or String password for the Trust Store.|none|yes|
|`druid.client.https.trustStorePassword`|The [Password Provider](../operations/password-provider.md) or String password for the Trust Store.|none|yes|
This [document](http://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html) lists all the possible
values for the above mentioned configs among others provided by Java implementation.
@ -293,7 +230,7 @@ values for the above mentioned configs among others provided by Java implementat
|`druid.auth.unsecuredPaths`| List of Strings|List of paths for which security checks will not be performed. All requests to these paths will be allowed.|[]|no|
|`druid.auth.allowUnauthenticatedHttpOptions`|Boolean|If true, skip authentication checks for HTTP OPTIONS requests. This is needed for certain use cases, such as supporting CORS pre-flight requests. Note that disabling authentication checks for OPTIONS requests will allow unauthenticated users to determine what Druid endpoints are valid (by checking if the OPTIONS request returns a 200 instead of 404), so enabling this option may reveal information about server configuration, including information about what extensions are loaded (if those extensions add endpoints).|false|no|
For more information, please see [Authentication and Authorization](../design/auth.html).
For more information, please see [Authentication and Authorization](../design/auth.md).
For configuration options for specific auth extensions, please refer to the extension documentation.
@ -316,7 +253,7 @@ All processes that can serve queries can also log the query requests they see. B
|--------|-----------|-------|
|`druid.request.logging.type`|Choices: noop, file, emitter, slf4j, filtered, composing, switching. How to log every query request.|[required to configure request logging]|
Note that, you can enable sending all the HTTP requests to log by setting "org.apache.druid.jetty.RequestLog" to DEBUG level. See [Logging](../configuration/logging.html)
Note that, you can enable sending all the HTTP requests to log by setting "org.apache.druid.jetty.RequestLog" to DEBUG level. See [Logging](../configuration/logging.md)
#### File Request Logging
@ -424,7 +361,7 @@ The following monitors are available:
### Emitting Metrics
The Druid servers [emit various metrics](../operations/metrics.html) and alerts via something we call an Emitter. There are three emitter implementations included with the code, a "noop" emitter (the default if none is specified), one that just logs to log4j ("logging"), and one that does POSTs of JSON events to a server ("http"). The properties for using the logging emitter are described below.
The Druid servers [emit various metrics](../operations/metrics.md) and alerts via something we call an Emitter. There are three emitter implementations included with the code, a "noop" emitter (the default if none is specified), one that just logs to log4j ("logging"), and one that does POSTs of JSON events to a server ("http"). The properties for using the logging emitter are described below.
|Property|Description|Default|
|--------|-----------|-------|
@ -453,7 +390,9 @@ The Druid servers [emit various metrics](../operations/metrics.html) and alerts
#### Http Emitter Module TLS Overrides
When emitting events to a TLS-enabled receiver, the Http Emitter will by default use an SSLContext obtained via the process described at [Druid's internal communication over TLS](../operations/tls-support.html#druids-internal-communication-over-tls), i.e., the same SSLContext that would be used for internal communications between Druid processes.
When emitting events to a TLS-enabled receiver, the Http Emitter will by default use an SSLContext obtained via the
process described at [Druid's internal communication over TLS](../operations/tls-support.html), i.e., the same
SSLContext that would be used for internal communications between Druid processes.
In some use cases it may be desirable to have the Http Emitter use its own separate truststore configuration. For example, there may be organizational policies that prevent the TLS-enabled metrics receiver's certificate from being added to the same truststore used by Druid's internal HTTP client.
@ -465,7 +404,7 @@ The following properties allow the Http Emitter to use its own truststore config
|`druid.emitter.http.ssl.trustStorePath`|The file path or URL of the TLS/SSL Key store where trusted root certificates are stored. If this is unspecified, the Http Emitter will use the same SSLContext as Druid's internal HTTP client, as described in the beginning of this section, and all other properties below are ignored.|null|
|`druid.emitter.http.ssl.trustStoreType`|The type of the key store where trusted root certificates are stored.|`java.security.KeyStore.getDefaultType()`|
|`druid.emitter.http.ssl.trustStoreAlgorithm`|Algorithm to be used by TrustManager to validate certificate chains|`javax.net.ssl.TrustManagerFactory.getDefaultAlgorithm()`|
|`druid.emitter.http.ssl.trustStorePassword`|The [Password Provider](../operations/password-provider.html) or String password for the Trust Store.|none|
|`druid.emitter.http.ssl.trustStorePassword`|The [Password Provider](../operations/password-provider.md) or String password for the Trust Store.|none|
|`druid.emitter.http.ssl.protocol`|TLS protocol to use.|"TLSv1.2"|
#### Parametrized Http Emitter Module
@ -488,22 +427,22 @@ The additional configs are:
#### Graphite Emitter
To use graphite as emitter set `druid.emitter=graphite`. For configuration details please follow this [link](../development/extensions-contrib/graphite.html).
To use graphite as emitter set `druid.emitter=graphite`. For configuration details please follow this [link](../development/extensions-contrib/graphite.md).
### Metadata Storage
### Metadata storage
These properties specify the jdbc connection and other configuration around the metadata storage. The only processes that connect to the metadata storage with these properties are the [Coordinator](../design/coordinator.html) and [Overlord](../design/overlord.html).
These properties specify the jdbc connection and other configuration around the metadata storage. The only processes that connect to the metadata storage with these properties are the [Coordinator](../design/coordinator.md) and [Overlord](../design/overlord.md).
|Property|Description|Default|
|--------|-----------|-------|
|`druid.metadata.storage.type`|The type of metadata storage to use. Choose from "mysql", "postgresql", or "derby".|derby|
|`druid.metadata.storage.connector.connectURI`|The jdbc uri for the database to connect to|none|
|`druid.metadata.storage.connector.user`|The username to connect with.|none|
|`druid.metadata.storage.connector.password`|The [Password Provider](../operations/password-provider.html) or String password used to connect with.|none|
|`druid.metadata.storage.connector.password`|The [Password Provider](../operations/password-provider.md) or String password used to connect with.|none|
|`druid.metadata.storage.connector.createTables`|If Druid requires a table and it doesn't exist, create it?|true|
|`druid.metadata.storage.tables.base`|The base name for tables.|druid|
|`druid.metadata.storage.tables.dataSource`|The table to use to look for dataSources which created by [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.html).|druid_dataSource|
|`druid.metadata.storage.tables.dataSource`|The table to use to look for dataSources which created by [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.md).|druid_dataSource|
|`druid.metadata.storage.tables.pendingSegments`|The table to use to look for pending segments.|druid_pendingSegments|
|`druid.metadata.storage.tables.segments`|The table to use to look for segments.|druid_segments|
|`druid.metadata.storage.tables.rules`|The table to use to look for segment load/drop rules.|druid_rules|
@ -514,9 +453,9 @@ These properties specify the jdbc connection and other configuration around the
|`druid.metadata.storage.tables.supervisors`|Used by the indexing service to store supervisor configurations.|druid_supervisors|
|`druid.metadata.storage.tables.audit`|The table to use for audit history of configuration changes e.g. Coordinator rules.|druid_audit|
### Deep Storage
### Deep storage
The configurations concern how to push and pull [Segments](../design/segments.html) from deep storage.
The configurations concern how to push and pull [Segments](../design/segments.md) from deep storage.
|Property|Description|Default|
|--------|-----------|-------|
@ -537,7 +476,7 @@ This deep storage doesn't do anything. There are no configs.
#### S3 Deep Storage
This deep storage is used to interface with Amazon's S3. Note that the `druid-s3-extensions` extension must be loaded.
The below table shows some important configurations for S3. See [S3 Deep Storage](../development/extensions-core/s3.html) for full configurations.
The below table shows some important configurations for S3. See [S3 Deep Storage](../development/extensions-core/s3.md) for full configurations.
|Property|Description|Default|
|--------|-----------|-------|
@ -635,7 +574,7 @@ Store task logs in HDFS. Note that the `druid-hdfs-storage` extension must be lo
### Overlord Discovery
This config is used to find the [Overlord](../design/overlord.html) using Curator service discovery. Only required if you are actually running an Overlord.
This config is used to find the [Overlord](../design/overlord.md) using Curator service discovery. Only required if you are actually running an Overlord.
|Property|Description|Default|
|--------|-----------|-------|
@ -644,7 +583,7 @@ This config is used to find the [Overlord](../design/overlord.html) using Curato
### Coordinator Discovery
This config is used to find the [Coordinator](../design/coordinator.html) using Curator service discovery. This config is used by the realtime indexing processes to get information about the segments loaded in the cluster.
This config is used to find the [Coordinator](../design/coordinator.md) using Curator service discovery. This config is used by the realtime indexing processes to get information about the segments loaded in the cluster.
|Property|Description|Default|
|--------|-----------|-------|
@ -675,13 +614,11 @@ the following properties.
|--------|-----------|-------|
|`druid.javascript.enabled`|Set to "true" to enable JavaScript functionality. This affects the JavaScript parser, filter, extractionFn, aggregator, post-aggregator, router strategy, and worker selection strategy.|false|
<div class="note info">
JavaScript-based functionality is disabled by default. Please refer to the Druid <a href="../development/javascript.html">JavaScript programming guide</a> for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.
</div>
> JavaScript-based functionality is disabled by default. Please refer to the Druid [JavaScript programming guide](../development/javascript.md) for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.
### Double Column storage
Prior to version 0.13.0 Druid's storage layer used a 32-bit float representation to store columns created by the
Prior to version 0.13.0 Druid's storage layer used a 32-bit float representation to store columns created by the
doubleSum, doubleMin, and doubleMax aggregators at indexing time.
Starting from version 0.13.0 the default will be 64-bit floats for Double columns.
Using 64-bit representation for double column will lead to avoid precesion loss at the cost of doubling the storage size of such columns.
@ -699,7 +636,7 @@ This section contains the configuration options for the processes that reside on
### Coordinator
For general Coordinator Process information, see [here](../design/coordinator.html).
For general Coordinator Process information, see [here](../design/coordinator.md).
#### Static Configuration
@ -712,7 +649,7 @@ These Coordinator static configurations can be defined in the `coordinator/runti
|`druid.host`|The host for the current process. This is used to advertise the current processes location as reachable from another process and should generally be specified such that `http://${druid.host}/` could actually talk to this process|InetAddress.getLocalHost().getCanonicalHostName()|
|`druid.bindOnHost`|Indicating whether the process's internal jetty server bind on `druid.host`. Default is false, which means binding to all interfaces.|false|
|`druid.plaintextPort`|This is the port to actually listen on; unless port mapping is used, this will be the same port as is on `druid.host`|8081|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.html) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8281|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.md) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8281|
|`druid.service`|The name of the service. This is used as a dimension when emitting metrics and alerts to differentiate between the various services|druid/coordinator|
##### Coordinator Operation
@ -758,7 +695,7 @@ These Coordinator static configurations can be defined in the `coordinator/runti
#### Dynamic Configuration
The Coordinator has dynamic configuration to change certain behaviour on the fly. The Coordinator uses a JSON spec object from the Druid [metadata storage](../dependencies/metadata-storage.html) config table. This object is detailed below:
The Coordinator has dynamic configuration to change certain behaviour on the fly. The Coordinator uses a JSON spec object from the Druid [metadata storage](../dependencies/metadata-storage.md) config table. This object is detailed below:
It is recommended that you use the Coordinator Console to configure these parameters. However, if you need to do it via HTTP, the JSON object can be submitted to the Coordinator via a POST request at:
@ -796,7 +733,7 @@ Issuing a GET request at the same URL will return the spec that is currently in
|--------|-----------|-------|
|`millisToWaitBeforeDeleting`|How long does the Coordinator need to be active before it can start removing (marking unused) segments in metadata storage.|900000 (15 mins)|
|`mergeBytesLimit`|The maximum total uncompressed size in bytes of segments to merge.|524288000L|
|`mergeSegmentsLimit`|The maximum number of segments that can be in a single [append task](../ingestion/tasks.html).|100|
|`mergeSegmentsLimit`|The maximum number of segments that can be in a single [append task](../ingestion/tasks.md).|100|
|`maxSegmentsToMove`|The maximum number of segments that can be moved at any given time.|5|
|`replicantLifetime`|The maximum number of Coordinator runs for a segment to be replicated before we start alerting.|15|
|`replicationThrottleLimit`|The maximum number of segments that can be replicated at one time.|10|
@ -823,8 +760,8 @@ To view last <n> entries of the audit history of Coordinator dynamic config issu
http://<COORDINATOR_IP>:<PORT>/druid/coordinator/v1/config/history?count=<n>
```
##### Lookups Dynamic Configuration (EXPERIMENTAL)<a id="lookups-dynamic-configuration"></a>
These configuration options control the behavior of the Lookup dynamic configuration described in the [lookups page](../querying/lookups.html)
##### Lookups Dynamic Configuration
These configuration options control the behavior of the Lookup dynamic configuration described in the [lookups page](../querying/lookups.md)
|Property|Description|Default|
|--------|-----------|-------|
@ -840,21 +777,21 @@ These configuration options control the behavior of the Lookup dynamic configura
Compaction configurations can also be set or updated dynamically using
[Coordinator's API](../operations/api-reference.html#compaction-configuration) without restarting Coordinators.
For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.html).
For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.md).
A description of the compaction config is:
|Property|Description|Required|
|--------|-----------|--------|
|`dataSource`|dataSource name to be compacted.|yes|
|`taskPriority`|[Priority](../ingestion/tasks.html#task-priorities) of compaction task.|no (default = 25)|
|`taskPriority`|[Priority](../ingestion/tasks.html#priority) of compaction task.|no (default = 25)|
|`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 12GB will result in compaction tasks taking an excessive amount of time.|no (default = 419430400)|
|`targetCompactionSizeBytes`|The target segment size, for each segment, after compaction. The actual sizes of compacted segments might be slightly larger or smaller than this value. Each compaction task may generate more than one output segment, and it will try to keep each output segment close to this configured size. This configuration cannot be used together with `maxRowsPerSegment`.|no (default = 419430400)|
|`maxRowsPerSegment`|Max number of rows per segment after compaction. This configuration cannot be used together with `targetCompactionSizeBytes`.|no|
|`maxNumSegmentsToCompact`|Maximum number of segments to compact together per compaction task. Since a time chunk must be processed in its entirety, if a time chunk has a total number of segments greater than this parameter, compaction will not run for that time chunk.|no (default = 150)|
|`skipOffsetFromLatest`|The offset for searching segments to be compacted. Strongly recommended to set for realtime dataSources. |no (default = "P1D")|
|`tuningConfig`|Tuning config for compaction tasks. See below [Compaction Task TuningConfig](#compact-task-tuningconfig).|no|
|`taskContext`|[Task context](../ingestion/tasks.html#task-context) for compaction tasks.|no|
|`tuningConfig`|Tuning config for compaction tasks. See below [Compaction Task TuningConfig](#compaction-tuningconfig).|no|
|`taskContext`|[Task context](../ingestion/tasks.html#context) for compaction tasks.|no|
An example of compaction config is:
@ -875,16 +812,16 @@ If you see this problem, it's recommended to set `skipOffsetFromLatest` to some
|Property|Description|Required|
|--------|-----------|--------|
|`maxRowsInMemory`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (default = 1000000)|
|`maxBytesInMemory`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (1/6 of max JVM memory)|
|`maxTotalRows`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (default = 20000000)|
|`indexSpec`|See [IndexSpec](../ingestion/native_tasks.html#indexspec)|no|
|`maxPendingPersists`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (default = 0 (meaning one persist can be running concurrently with ingestion, and none can be queued up))|
|`pushTimeout`|See [tuningConfig for indexTask](../ingestion/native_tasks.html#tuningconfig)|no (default = 0)|
|`maxRowsInMemory`|See [tuningConfig for indexTask](../ingestion/native-batch.md#tuningconfig)|no (default = 1000000)|
|`maxBytesInMemory`|See [tuningConfig for indexTask](../ingestion/native-batch.md#tuningconfig)|no (1/6 of max JVM memory)|
|`maxTotalRows`|See [tuningConfig for indexTask](../ingestion/native-batch.md#tuningconfig)|no (default = 20000000)|
|`indexSpec`|See [IndexSpec](../ingestion/index.md#indexspec)|no|
|`maxPendingPersists`|See [tuningConfig for indexTask](../ingestion/native-batch.md#tuningconfig)|no (default = 0 (meaning one persist can be running concurrently with ingestion, and none can be queued up))|
|`pushTimeout`|See [tuningConfig for indexTask](../ingestion/native-batch.md#tuningconfig)|no (default = 0)|
### Overlord
For general Overlord Process information, see [here](../design/overlord.html).
For general Overlord Process information, see [here](../design/overlord.md).
#### Overlord Static Configuration
@ -897,7 +834,7 @@ These Overlord static configurations can be defined in the `overlord/runtime.pro
|`druid.host`|The host for the current process. This is used to advertise the current processes location as reachable from another process and should generally be specified such that `http://${druid.host}/` could actually talk to this process|InetAddress.getLocalHost().getCanonicalHostName()|
|`druid.bindOnHost`|Indicating whether the process's internal jetty server bind on `druid.host`. Default is false, which means binding to all interfaces.|false|
|`druid.plaintextPort`|This is the port to actually listen on; unless port mapping is used, this will be the same port as is on `druid.host`|8090|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.html) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8290|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.md) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8290|
|`druid.service`|The name of the service. This is used as a dimension when emitting metrics and alerts to differentiate between the various services|druid/overlord|
##### Overlord Operations
@ -907,7 +844,7 @@ These Overlord static configurations can be defined in the `overlord/runtime.pro
|`druid.indexer.runner.type`|Choices "local" or "remote". Indicates whether tasks should be run locally or in a distributed environment. Experimental task runner "httpRemote" is also available which is same as "remote" but uses HTTP to interact with Middle Manaters instead of Zookeeper.|local|
|`druid.indexer.storage.type`|Choices are "local" or "metadata". Indicates whether incoming tasks should be stored locally (in heap) or in metadata storage. Storing incoming tasks in metadata storage allows for tasks to be resumed if the Overlord should fail.|local|
|`druid.indexer.storage.recentlyFinishedThreshold`|A duration of time to store task results.|PT24H|
|`druid.indexer.tasklock.forceTimeChunkLock`|_**Setting this to false is still experimental**_<br/> If set, all tasks are enforced to use time chunk lock. If not set, each task automatically chooses a lock type to use. This configuration can be overwritten by setting `forceTimeChunkLock` in the [task context](../ingestion/locking-and-priority.html#task-context). See [Task Locking & Priority](../ingestion/locking-and-priority.html) for more details about locking in tasks.|true|
|`druid.indexer.tasklock.forceTimeChunkLock`|_**Setting this to false is still experimental**_<br/> If set, all tasks are enforced to use time chunk lock. If not set, each task automatically chooses a lock type to use. This configuration can be overwritten by setting `forceTimeChunkLock` in the [task context](../ingestion/tasks.md#context). See [Task Locking & Priority](../ingestion/tasks.md#context) for more details about locking in tasks.|true|
|`druid.indexer.queue.maxSize`|Maximum number of active tasks at one time.|Integer.MAX_VALUE|
|`druid.indexer.queue.startDelay`|Sleep this long before starting Overlord queue management. This can be useful to give a cluster time to re-orient itself after e.g. a widespread network issue.|PT1M|
|`druid.indexer.queue.restartDelay`|Sleep this long when Overlord queue management throws an exception before trying again.|PT30S|
@ -1060,7 +997,9 @@ middleManagers up to capacity simultaneously, rather than a single middleManager
|`type`|`fillCapacity`.|required; must be `fillCapacity`|
|`affinityConfig`|[Affinity config](#affinity) object|null (no affinity)|
###### Javascript<a id="javascript-worker-select-strategy"></a>
<a name="javascript-worker-select-strategy"></a>
###### Javascript
Allows defining arbitrary logic for selecting workers to run task using a JavaScript function.
The function is passed remoteTaskRunnerConfig, map of workerId to available workers and task to be executed and returns the workerId on which the task should be run or null if the task cannot be run.
@ -1082,9 +1021,7 @@ Example: a function that sends batch_index_task to workers 10.0.0.1 and 10.0.0.2
}
```
<div class="note info">
JavaScript-based functionality is disabled by default. Please refer to the Druid <a href="../development/javascript.html">JavaScript programming guide</a> for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.
</div>
> JavaScript-based functionality is disabled by default. Please refer to the Druid [JavaScript programming guide](../development/javascript.md) for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.
###### Affinity
@ -1123,7 +1060,7 @@ These MiddleManager and Peon configurations can be defined in the `middleManager
|`druid.host`|The host for the current process. This is used to advertise the current processes location as reachable from another process and should generally be specified such that `http://${druid.host}/` could actually talk to this process|InetAddress.getLocalHost().getCanonicalHostName()|
|`druid.bindOnHost`|Indicating whether the process's internal jetty server bind on `druid.host`. Default is false, which means binding to all interfaces.|false|
|`druid.plaintextPort`|This is the port to actually listen on; unless port mapping is used, this will be the same port as is on `druid.host`|8091|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.html) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8291|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.md) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8291|
|`druid.service`|The name of the service. This is used as a dimension when emitting metrics and alerts to differentiate between the various services|druid/middlemanager|
#### MiddleManager Configuration
@ -1166,7 +1103,7 @@ The amount of direct memory needed by Druid is at least
ensure at least this amount of direct memory is available by providing `-XX:MaxDirectMemorySize=<VALUE>` in
`druid.indexer.runner.javaOptsArray` as documented above.
#### Peon Query Configuration
#### Peon query configuration
See [general query configuration](#general-query-configuration).
@ -1228,7 +1165,7 @@ This type of medium is preferred, but it may require to allow the JVM to have mo
to the size of the segments being created. But definitely it doesn't make sense to add more extra off-heap memory,
than the configured maximum *heap* size (`-Xmx`) for the same JVM.
For most types of tasks SegmentWriteOutMediumFactory could be configured per-task (see [Tasks](../ingestion/tasks.html)
For most types of tasks SegmentWriteOutMediumFactory could be configured per-task (see [Tasks](../ingestion/tasks.md)
page, "TuningConfig" section), but if it's not specified for a task, or it's not supported for a particular task type,
then the value from the configuration below is used:
@ -1238,7 +1175,7 @@ then the value from the configuration below is used:
### Historical
For general Historical Process information, see [here](../design/historical.html).
For general Historical Process information, see [here](../design/historical.md).
These Historical configurations can be defined in the `historical/runtime.properties` file.
@ -1248,7 +1185,7 @@ These Historical configurations can be defined in the `historical/runtime.proper
|`druid.host`|The host for the current process. This is used to advertise the current processes location as reachable from another process and should generally be specified such that `http://${druid.host}/` could actually talk to this process|InetAddress.getLocalHost().getCanonicalHostName()|
|`druid.bindOnHost`|Indicating whether the process's internal jetty server bind on `druid.host`. Default is false, which means binding to all interfaces.|false|
|`druid.plaintextPort`|This is the port to actually listen on; unless port mapping is used, this will be the same port as is on `druid.host`|8083|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.html) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8283|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.md) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8283|
|`druid.service`|The name of the service. This is used as a dimension when emitting metrics and alerts to differentiate between the various services|druid/historical|
@ -1257,7 +1194,7 @@ These Historical configurations can be defined in the `historical/runtime.proper
|Property|Description|Default|
|--------|-----------|-------|
|`druid.server.maxSize`|The maximum number of bytes-worth of segments that the process wants assigned to it. This is not a limit that Historical processes actually enforces, just a value published to the Coordinator process so it can plan accordingly.|0|
|`druid.server.tier`| A string to name the distribution tier that the storage process belongs to. Many of the [rules Coordinator processes use](../operations/rule-configuration.html) to manage segments can be keyed on tiers. | `_default_tier` |
|`druid.server.tier`| A string to name the distribution tier that the storage process belongs to. Many of the [rules Coordinator processes use](../operations/rule-configuration.md) to manage segments can be keyed on tiers. | `_default_tier` |
|`druid.server.priority`|In a tiered architecture, the priority of the tier, thus allowing control over which processes are queried. Higher numbers mean higher priority. The default (no priority) works for architecture with no cross replication (tiers that have no data-storage overlap). Data centers typically have equal priority. | 0 |
#### Storing Segments
@ -1274,7 +1211,7 @@ These Historical configurations can be defined in the `historical/runtime.proper
In `druid.segmentCache.locations`, *freeSpacePercent* was added because *maxSize* setting is only a theoretical limit and assumes that much space will always be available for storing segments. In case of any druid bug leading to unaccounted segment files left alone on disk or some other process writing stuff to disk, This check can start failing segment loading early before filling up the disk completely and leaving the host usable otherwise.
#### Historical Query Configs
#### Historical query configs
##### Concurrent Requests
@ -1289,7 +1226,7 @@ Druid uses Jetty to serve HTTP requests.
|`druid.server.http.defaultQueryTimeout`|Query timeout in millis, beyond which unfinished queries will be cancelled|300000|
|`druid.server.http.gracefulShutdownTimeout`|The maximum amount of time Jetty waits after receiving shutdown signal. After this timeout the threads will be forcefully shutdown. This allows any queries that are executing to complete.|`PT0S` (do not wait)|
|`druid.server.http.unannouncePropagationDelay`|How long to wait for zookeeper unannouncements to propagate before shutting down Jetty. This is a minimum and `druid.server.http.gracefulShutdownTimeout` does not start counting down until after this period elapses.|`PT0S` (do not wait)|
|`druid.server.http.maxQueryTimeout`|Maximum allowed value (in milliseconds) for `timeout` parameter. See [query-context](../querying/query-context.html) to know more about `timeout`. Query is rejected if the query context `timeout` is greater than this value. |Long.MAX_VALUE|
|`druid.server.http.maxQueryTimeout`|Maximum allowed value (in milliseconds) for `timeout` parameter. See [query-context](../querying/query-context.md) to know more about `timeout`. Query is rejected if the query context `timeout` is greater than this value. |Long.MAX_VALUE|
|`druid.server.http.maxRequestHeaderSize`|Maximum size of a request header in bytes. Larger headers consume more memory and can make a server more vulnerable to denial of service attacks.|8 * 1024|
##### Processing
@ -1310,7 +1247,7 @@ The amount of direct memory needed by Druid is at least
ensure at least this amount of direct memory is available by providing `-XX:MaxDirectMemorySize=<VALUE>` at the command
line.
##### Historical Query Configuration
##### Historical query configuration
See [general query configuration](#general-query-configuration).
@ -1333,7 +1270,7 @@ This section contains the configuration options for the processes that reside on
### Broker
For general Broker process information, see [here](../design/broker.html).
For general Broker process information, see [here](../design/broker.md).
These Broker configurations can be defined in the `broker/runtime.properties` file.
@ -1344,12 +1281,12 @@ These Broker configurations can be defined in the `broker/runtime.properties` fi
|`druid.host`|The host for the current process. This is used to advertise the current processes location as reachable from another process and should generally be specified such that `http://${druid.host}/` could actually talk to this process|InetAddress.getLocalHost().getCanonicalHostName()|
|`druid.bindOnHost`|Indicating whether the process's internal jetty server bind on `druid.host`. Default is false, which means binding to all interfaces.|false|
|`druid.plaintextPort`|This is the port to actually listen on; unless port mapping is used, this will be the same port as is on `druid.host`|8082|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.html) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8282|
|`druid.tlsPort`|TLS port for HTTPS connector, if [druid.enableTlsPort](../operations/tls-support.md) is set then this config will be used. If `druid.host` contains port then that port will be ignored. This should be a non-negative Integer.|8282|
|`druid.service`|The name of the service. This is used as a dimension when emitting metrics and alerts to differentiate between the various services|druid/broker|
#### Query Configuration
#### Query configuration
##### Query Prioritization
##### Query prioritization
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
@ -1371,7 +1308,7 @@ Druid uses Jetty to serve HTTP requests.
|`druid.server.http.maxScatterGatherBytes`|Maximum number of bytes gathered from data processes such as Historicals and realtime processes to execute a query. Queries that exceed this limit will fail. This is an advance configuration that allows to protect in case Broker is under heavy load and not utilizing the data gathered in memory fast enough and leading to OOMs. This limit can be further reduced at query time using `maxScatterGatherBytes` in the context. Note that having large limit is not necessarily bad if broker is never under heavy concurrent load in which case data gathered is processed quickly and freeing up the memory used.|Long.MAX_VALUE|
|`druid.server.http.gracefulShutdownTimeout`|The maximum amount of time Jetty waits after receiving shutdown signal. After this timeout the threads will be forcefully shutdown. This allows any queries that are executing to complete.|`PT0S` (do not wait)|
|`druid.server.http.unannouncePropagationDelay`|How long to wait for zookeeper unannouncements to propagate before shutting down Jetty. This is a minimum and `druid.server.http.gracefulShutdownTimeout` does not start counting down until after this period elapses.|`PT0S` (do not wait)|
|`druid.server.http.maxQueryTimeout`|Maximum allowed value (in milliseconds) for `timeout` parameter. See [query-context](../querying/query-context.html) to know more about `timeout`. Query is rejected if the query context `timeout` is greater than this value. |Long.MAX_VALUE|
|`druid.server.http.maxQueryTimeout`|Maximum allowed value (in milliseconds) for `timeout` parameter. See [query-context](../querying/query-context.md) to know more about `timeout`. Query is rejected if the query context `timeout` is greater than this value. |Long.MAX_VALUE|
|`druid.server.http.maxRequestHeaderSize`|Maximum size of a request header in bytes. Larger headers consume more memory and can make a server more vulnerable to denial of service attacks. |8 * 1024|
##### Client Configuration
@ -1385,7 +1322,7 @@ client has the following configuration options.
|`druid.broker.http.compressionCodec`|Compression codec the Broker uses to communicate with Historical and real-time processes. May be "gzip" or "identity".|gzip|
|`druid.broker.http.readTimeout`|The timeout for data reads from Historical servers and real-time tasks.|PT15M|
|`druid.broker.http.unusedConnectionTimeout`|The timeout for idle connections in connection pool. This timeout should be less than `druid.broker.http.readTimeout`. Set this timeout = ~90% of `druid.broker.http.readTimeout`|`PT4M`|
|`druid.broker.http.maxQueuedBytes`|Maximum number of bytes queued per query before exerting backpressure on the channel to the data server. Similar to `druid.server.http.maxScatterGatherBytes`, except unlike that configuration, this one will trigger backpressure rather than query failure. Zero means disabled. Can be overridden by the ["maxQueuedBytes" query context parameter](../querying/query-context.html).|0 (disabled)|
|`druid.broker.http.maxQueuedBytes`|Maximum number of bytes queued per query before exerting backpressure on the channel to the data server. Similar to `druid.server.http.maxScatterGatherBytes`, except unlike that configuration, this one will trigger backpressure rather than query failure. Zero means disabled. Can be overridden by the ["maxQueuedBytes" query context parameter](../querying/query-context.md).|0 (disabled)|
##### Retry Policy
@ -1397,7 +1334,7 @@ Druid broker can optionally retry queries internally for transient errors.
##### Processing
The broker uses processing configs for nested groupBy queries. And, if you use groupBy v1, long-interval queries (of any type) can be broken into shorter interval queries and processed in parallel inside this thread pool. For more details, see "chunkPeriod" in [Query Context](../querying/query-context.html) doc.
The broker uses processing configs for nested groupBy queries. And, if you use groupBy v1, long-interval queries (of any type) can be broken into shorter interval queries and processed in parallel inside this thread pool. For more details, see "chunkPeriod" in the [query context](../querying/query-context.md) doc.
|Property|Description|Default|
|--------|-----------|-------|
@ -1415,7 +1352,7 @@ The amount of direct memory needed by Druid is at least
ensure at least this amount of direct memory is available by providing `-XX:MaxDirectMemorySize=<VALUE>` at the command
line.
##### Broker Query Configuration
##### Broker query configuration
See [general query configuration](#general-query-configuration).
@ -1435,11 +1372,11 @@ The Druid SQL server is configured through the following properties on the Broke
|`druid.sql.planner.awaitInitializationOnStart`|Boolean|Whether the the Broker will wait for its SQL metadata view to fully initialize before starting up. If set to 'true', the Broker's HTTP server will not start up, and the Broker will not announce itself as available, until the server view is initialized. See also `druid.broker.segment.awaitInitializationOnStart`, a related setting.|true|
|`druid.sql.planner.maxQueryCount`|Maximum number of queries to issue, including nested queries. Set to 1 to disable sub-queries, or set to 0 for unlimited.|8|
|`druid.sql.planner.maxSemiJoinRowsInMemory`|Maximum number of rows to keep in memory for executing two-stage semi-join queries like `SELECT * FROM Employee WHERE DeptName IN (SELECT DeptName FROM Dept)`.|100000|
|`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.html). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.html) instead.|100000|
|`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000|
|`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
|`druid.sql.planner.selectThreshold`|Page size threshold for [Select queries](../querying/select-query.html). Select queries for larger resultsets will be issued back-to-back using pagination.|1000|
|`druid.sql.planner.selectThreshold`|Page size threshold for [Select queries](../querying/select-query.md). Select queries for larger resultsets will be issued back-to-back using pagination.|1000|
|`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinalty algorithm for `COUNT(DISTINCT foo)`.|true|
|`druid.sql.planner.useApproximateTopN`|Whether to use approximate [TopN queries](../querying/topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](../querying/groupbyquery.html) will be used instead.|true|
|`druid.sql.planner.useApproximateTopN`|Whether to use approximate [TopN queries](../querying/topnquery.md) when a SQL query could be expressed as such. If false, exact [GroupBy queries](../querying/groupbyquery.md) will be used instead.|true|
|`druid.sql.planner.requireTimeCondition`|Whether to require SQL to have filter conditions on __time column so that all generated native queries will have user specified intervals. If true, all queries wihout filter condition on __time column will fail|false|
|`druid.sql.planner.sqlTimeZone`|Sets the default time zone for the server, which will affect how time functions and timestamp literals behave. Should be a time zone name like "America/Los_Angeles" or offset like "-08:00".|UTC|
|`druid.sql.planner.serializeComplexValues`|Whether to serialize "complex" output values, false will return the class name instead of the serialized value.|true|
@ -1472,9 +1409,9 @@ See [cache configuration](#cache-configuration) for how to configure cache setti
## Cache Configuration
This section describes caching configuration that is common to Broker, Historical, and MiddleManager/Peon processes.
Caching can optionally be enabled on the Broker, Historical, and MiddleManager/Peon processses. See [Broker](#broker-caching),
[Historical](#Historical-caching), and [Peon](#peon-caching) configuration options for how to enable it for different processes.
[Historical](#historical-caching), and [Peon](#peon-caching) configuration options for how to enable it for different processes.
Druid uses a local in-memory cache by default, unless a diffrent type of cache is specified.
Use the `druid.cache.type` configuration to set a different kind of cache.
@ -1491,9 +1428,7 @@ for both Broker and Historical processes, when defined in the common properties
#### Local Cache
<div class="note caution">
DEPRECATED: Use caffeine (default as of v0.12.0) instead
</div>
> DEPRECATED: Use caffeine (default as of v0.12.0) instead
The local cache is deprecated in favor of the Caffeine cache, and may be removed in a future version of Druid. The Caffeine cache affords significantly better performance and control over eviction behavior compared to `local` cache, and is recommended in any situation where you are using JRE 8u60 or higher.
@ -1572,34 +1507,34 @@ If there is an L1 miss and L2 hit, it will also populate L1.
|`druid.cache.useL2`|A boolean indicating whether to query L2 cache, if it's a miss in L1. It makes sense to configure this to `false` on Historical processes, if L2 is a remote cache like `memcached`, and this cache also used on brokers, because in this case if a query reached Historical it means that a broker didn't find corresponding results in the same remote cache, so a query to the remote cache from Historical is guaranteed to be a miss.|`true`|
|`druid.cache.populateL2`|A boolean indicating whether to put results into L2 cache.|`true`|
## General Query Configuration
## General query configuration
This section describes configurations that control behavior of Druid's query types, applicable to Broker, Historical, and MiddleManager processes.
### TopN Query config
### TopN query config
|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.topN.minTopNThreshold`|See [TopN Aliasing](../querying/topnquery.html#aliasing) for details.|1000|
### Search Query Config
### Search query config
|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.search.maxSearchLimit`|Maximum number of search results to return.|1000|
|`druid.query.search.searchStrategy`|Default search query strategy.|useIndexes|
### Segment Metadata Query Config
### SegmentMetadata query config
|Property|Description|Default|
|--------|-----------|-------|
|`druid.query.segmentMetadata.defaultHistory`|When no interval is specified in the query, use a default interval of defaultHistory before the end time of the most recent segment, specified in ISO8601 format. This property also controls the duration of the default interval used by GET /druid/v2/datasources/{dataSourceName} interactions for retrieving datasource dimensions/metrics.|P1W|
|`druid.query.segmentMetadata.defaultAnalysisTypes`|This can be used to set the Default Analysis Types for all segment metadata queries, this can be overridden when making the query|["cardinality", "interval", "minmax"]|
### GroupBy Query Config
### GroupBy query config
This section describes the configurations for groupBy queries. You can set the runtime properties in the `runtime.properties` file on Broker, Historical, and MiddleManager processes. You can set the query context parameters through the [query context](../querying/query-context.md).
This section describes the configurations for groupBy queries. You can set the runtime properties in the `runtime.properties` file on Broker, Historical, and MiddleManager processes. You can set the query context parameters through the [query context](../querying/query-context.html).
#### Configurations for groupBy v2
Supported runtime properties:
@ -1677,4 +1612,3 @@ Supported query contexts:
|`maxIntermediateRows`|Can be used to lower the value of `druid.query.groupBy.maxIntermediateRows` for this query.|None|
|`maxResults`|Can be used to lower the value of `druid.query.groupBy.maxResults` for this query.|None|
|`useOffheap`|Set to true to store aggregations off-heap when merging results.|false|

View File

@ -1,5 +1,5 @@
---
layout: doc_page
id: logging
title: "Logging"
---
@ -22,7 +22,6 @@ title: "Logging"
~ under the License.
-->
# Logging
Apache Druid (incubating) processes will emit logs that are useful for debugging to the console. Druid processes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html#enabling-metrics). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.

View File

@ -1,62 +0,0 @@
---
layout: doc_page
title: "Cassandra Deep Storage"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Cassandra Deep Storage
## Introduction
Apache Druid (incubating) can use Apache Cassandra as a deep storage mechanism. Segments and their metadata are stored in Cassandra in two tables:
`index_storage` and `descriptor_storage`. Underneath the hood, the Cassandra integration leverages Astyanax. The
index storage table is a [Chunked Object](https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store) repository. It contains
compressed segments for distribution to Historical processes. Since segments can be large, the Chunked Object storage allows the integration to multi-thread
the write to Cassandra, and spreads the data across all the processes in a cluster. The descriptor storage table is a normal C* table that
stores the segment metadatak.
## Schema
Below are the create statements for each:
```sql
CREATE TABLE index_storage(key text,
chunk text,
value blob,
PRIMARY KEY (key, chunk)) WITH COMPACT STORAGE;
CREATE TABLE descriptor_storage(key varchar,
lastModified timestamp,
descriptor varchar,
PRIMARY KEY (key)) WITH COMPACT STORAGE;
```
## Getting Started
First create the schema above. I use a new keyspace called `druid` for this purpose, which can be created using the
[Cassandra CQL `CREATE KEYSPACE`](http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_keyspace_r.html) command.
Then, add the following to your Historical and realtime runtime properties files to enable a Cassandra backend.
```properties
druid.extensions.loadList=["druid-cassandra-storage"]
druid.storage.type=c*
druid.storage.host=localhost:9160
druid.storage.keyspace=druid
```

View File

@ -1,212 +0,0 @@
---
layout: doc_page
title: "Apache Druid (incubating) Design"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# What is Druid?<a id="what-is-druid"></a>
Apache Druid (incubating) is a real-time analytics database designed for fast slice-and-dice analytics
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries) on large data sets. Druid is most often
used as a database for powering use cases where real-time ingest, fast query performance, and high uptime are important.
As such, Druid is commonly used for powering GUIs of analytical applications, or as a backend for highly-concurrent APIs
that need fast aggregations. Druid works best with event-oriented data.
Common application areas for Druid include:
- Clickstream analytics (web and mobile analytics)
- Network telemetry analytics (network performance monitoring)
- Server metrics storage
- Supply chain analytics (manufacturing metrics)
- Application performance metrics
- Digital marketing/advertising analytics
- Business intelligence / OLAP
Druid's core architecture combines ideas from data warehouses, timeseries databases, and logsearch systems. Some of
Druid's key features are:
1. **Columnar storage format.** Druid uses column-oriented storage, meaning it only needs to load the exact columns
needed for a particular query. This gives a huge speed boost to queries that only hit a few columns. In addition, each
column is stored optimized for its particular data type, which supports fast scans and aggregations.
2. **Scalable distributed system.** Druid is typically deployed in clusters of tens to hundreds of servers, and can
offer ingest rates of millions of records/sec, retention of trillions of records, and query latencies of sub-second to a
few seconds.
3. **Massively parallel processing.** Druid can process a query in parallel across the entire cluster.
4. **Realtime or batch ingestion.** Druid can ingest data either real-time (ingested data is immediately available for
querying) or in batches.
5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale the cluster out or in, simply add or
remove servers and the cluster will rebalance itself automatically, in the background, without any downtime. If any
Druid servers fail, the system will automatically route around the damage until those servers can be replaced. Druid
is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software
updates.
6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is
stored safely in [deep storage](#deep-storage) (typically cloud storage, HDFS, or a shared filesystem). Your data can be
recovered from deep storage even if every single Druid server fails. For more limited failures affecting just a few
Druid servers, replication ensures that queries are still possible while the system recovers.
7. **Indexes for quick filtering.** Druid uses [CONCISE](https://arxiv.org/pdf/1004.0403) or
[Roaring](https://roaringbitmap.org/) compressed bitmap indexes to create indexes that power fast filtering and
searching across multiple columns.
8. **Time-based partitioning.** Druid first partitions data by time, and can additionally partition based on other fields.
This means time-based queries will only access the partitions that match the time range of the query. This leads to
significant performance improvements for time-based data.
9. **Approximate algorithms.** Druid includes algorithms for approximate count-distinct, approximate ranking, and
computation of approximate histograms and quantiles. These algorithms offer bounded memory usage and are often
substantially faster than exact computations. For situations where accuracy is more important than speed, Druid also
offers exact count-distinct and exact ranking.
10. **Automatic summarization at ingest time.** Druid optionally supports data summarization at ingestion time. This
summarization partially pre-aggregates your data, and can lead to big costs savings and performance boosts.
# When should I use Druid?<a id="when-to-use-druid"></a>
Druid is likely a good choice if your use case fits a few of the following descriptors:
- Insert rates are very high, but updates are less common.
- Most of your queries are aggregation and reporting queries ("group by" queries). You may also have searching and
scanning queries.
- You are targeting query latencies of 100ms to a few seconds.
- Your data has a time component (Druid includes optimizations and design choices specifically related to time).
- You may have more than one table, but each query hits just one big distributed table. Queries may potentially hit more
than one smaller "lookup" table.
- You have high cardinality data columns (e.g. URLs, user IDs) and need fast counting and ranking over them.
- You want to load data from Kafka, HDFS, flat files, or object storage like Amazon S3.
Situations where you would likely _not_ want to use Druid include:
- You need low-latency updates of _existing_ records using a primary key. Druid supports streaming inserts, but not streaming updates (updates are done using
background batch jobs).
- You are building an offline reporting system where query latency is not very important.
- You want to do "big" joins (joining one big fact table to another big fact table) and you are okay with these queries
taking up to hours to complete.
# Architecture
Druid has a multi-process, distributed architecture that is designed to be cloud-friendly and easy to operate. Each
Druid process type can be configured and scaled independently, giving you maximum flexibility over your cluster. This
design also provides enhanced fault tolerance: an outage of one component will not immediately affect other components.
## Processes and Servers
Druid has several process types, briefly described below:
* [**Coordinator**](../design/coordinator.html) processes manage data availability on the cluster.
* [**Overlord**](../design/overlord.html) processes control the assignment of data ingestion workloads.
* [**Broker**](../design/broker.html) processes handle queries from external clients.
* [**Router**](../development/router.html) processes are optional processes that can route requests to Brokers, Coordinators, and Overlords.
* [**Historical**](../design/historical.html) processes store queryable data.
* [**MiddleManager**](../design/middlemanager.html) processes are responsible for ingesting data.
Druid processes can be deployed any way you like, but for ease of deployment we suggest organizing them into three server types: Master, Query, and Data.
* **Master**: Runs Coordinator and Overlord processes, manages data availability and ingestion.
* **Query**: Runs Broker and optional Router processes, handles queries from external clients.
* **Data**: Runs Historical and MiddleManager processes, executes ingestion workloads and stores all queryable data.
For more details on process and server organization, please see [Druid Processses and Servers](../design/processes.html).
### External dependencies
In addition to its built-in process types, Druid also has three external dependencies. These are intended to be able to
leverage existing infrastructure, where present.
#### Deep storage
Shared file storage accessible by every Druid server. This is typically going to
be a distributed object store like S3 or HDFS, or a network mounted filesystem. Druid uses this to store any data that
has been ingested into the system.
Druid uses deep storage only as a backup of your data and as a way to transfer data in the background between
Druid processes. To respond to queries, Historical processes do not read from deep storage, but instead read pre-fetched
segments from their local disks before any queries are served. This means that Druid never needs to access deep storage
during a query, helping it offer the best query latencies possible. It also means that you must have enough disk space
both in deep storage and across your Historical processes for the data you plan to load.
For more details, please see [Deep storage dependency](../dependencies/deep-storage.html).
#### Metadata storage
The metadata storage holds various shared system metadata such as segment availability information and task information. This is typically going to be a traditional RDBMS
like PostgreSQL or MySQL.
For more details, please see [Metadata storage dependency](../dependencies/metadata-storage.html)
#### Zookeeper
Used for internal service discovery, coordination, and leader election.
For more details, please see [Zookeeper dependency](../dependencies/zookeeper.html).
The idea behind this architecture is to make a Druid cluster simple to operate in production at scale. For example, the
separation of deep storage and the metadata store from the rest of the cluster means that Druid processes are radically
fault tolerant: even if every single Druid server fails, you can still relaunch your cluster from data stored in deep
storage and the metadata store.
### Architecture diagram
The following diagram shows how queries and data flow through this architecture, using the suggested Master/Query/Data server organization:
<img src="../../img/druid-architecture.png" width="800"/>
# Datasources and segments
Druid data is stored in "datasources", which are similar to tables in a traditional RDBMS. Each datasource is
partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a "chunk" (for
example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more
["segments"](../design/segments.html). Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
organized into time chunks, it's sometimes helpful to think of segments as living on a timeline like the following:
<img src="../../img/druid-timeline.png" width="800" />
A datasource may have anywhere from just a few segments, up to hundreds of thousands and even millions of segments. Each
segment starts life off being created on a MiddleManager, and at that point, is mutable and uncommitted. The segment
building process includes the following steps, designed to produce a data file that is compact and supports fast
queries:
- Conversion to columnar format
- Indexing with bitmap indexes
- Compression using various algorithms
- Dictionary encoding with id storage minimization for String columns
- Bitmap compression for bitmap indexes
- Type-aware compression for all columns
Periodically, segments are committed and published. At this point, they are written to [deep storage](#deep-storage),
become immutable, and move from MiddleManagers to the Historical processes (see [Architecture](#architecture) above
for details). An entry about the segment is also written to the [metadata store](#metadata-storage). This entry is a
self-describing bit of metadata about the segment, including things like the schema of the segment, its size, and its
location on deep storage. These entries are what the Coordinator uses to know what data *should* be available on the
cluster.
# Query processing
Queries first enter the [Broker](../design/broker.html), where the Broker will identify which segments have data that may pertain to that query.
The list of segments is always pruned by time, and may also be pruned by other attributes depending on how your
datasource is partitioned. The Broker will then identify which [Historicals](../design/historical.html) and
[MiddleManagers](../design/middlemanager.html) are serving those segments and send a rewritten subquery to each of those processes. The Historical/MiddleManager processes will take in the
queries, process them and return results. The Broker receives results and merges them together to get the final answer,
which it returns to the original caller.
Broker pruning is an important way that Druid limits the amount of data that must be scanned for each query, but it is
not the only way. For filters at a more granular level than what the Broker can use for pruning, indexing structures
inside each segment allow Druid to figure out which (if any) rows match the filter set before looking at any row of
data. Once Druid knows which rows match a particular query, it only accesses the specific columns it needs for that
query. Within those columns, Druid can skip from row to row, avoiding reading data that doesn't match the query filter.
So Druid uses three different techniques to maximize query performance:
- Pruning which segments are accessed for each query.
- Within each segment, using indexes to identify which rows must be accessed.
- Within each segment, only reading the specific rows and columns that are relevant to a particular query.

View File

@ -1,65 +0,0 @@
---
layout: doc_page
title: "Indexing Service"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Indexing Service
The Apache Druid (incubating) indexing service is a highly-available, distributed service that runs indexing related tasks.
Indexing [tasks](../ingestion/tasks.html) create (and sometimes destroy) Druid [segments](../design/segments.html). The indexing service has a master/slave like architecture.
The indexing service is composed of three main components: a [Peon](../design/peons.html) component that can run a single task, a [Middle Manager](../design/middlemanager.html) component that manages Peons, and an [Overlord](../design/overlord.html) component that manages task distribution to MiddleManagers.
Overlords and MiddleManagers may run on the same process or across multiple processes while MiddleManagers and Peons always run on the same process.
Tasks are managed using API endpoints on the Overlord service. Please see [Overlord Task API](../operations/api-reference.html#overlord-tasks) for more information.
![Indexing Service](../../img/indexing_service.png "Indexing Service")
<!--
Preamble
--------
The truth is, the indexing service is an experience that is difficult to characterize with words. When they asked me to write this preamble, I was taken aback. I wasnt quite sure what exactly to write or how to describe this… entity. I accepted the job, as much for the challenge and inner growth as the money, and took to the mountains for reflection. Six months later, I knew I had it, I was done and had achieved the next euphoric victory in the continuous struggle that plagues my life. But, enough about me. This is about the indexing service.
The indexing service is philosophical transcendence, an infallible truth that will shape your soul, mold your character, and define your reality. The indexing service is creating world peace, playing with puppies, unwrapping presents on Christmas morning, cradling a loved one, and beating Goro in Mortal Kombat for the first time. The indexing service is sustainable economic growth, global propensity, and a world of transparent financial transactions. The indexing service is a true belieber. The indexing service is panicking because you forgot you signed up for a course and the big exam is in a few minutes, only to wake up and realize it was all a dream. What is the indexing service? More like what isnt the indexing service. The indexing service is here and it is ready, but are you?
-->
Overlord
--------------
See [Overlord](../design/overlord.html).
Middle Managers
---------------
See [Middle Manager](../design/middlemanager.html).
Peons
-----
See [Peon](../design/peons.html).
Tasks
-----
See [Tasks](../ingestion/tasks.html).

View File

@ -1,38 +0,0 @@
---
layout: doc_page
title: "Apache Druid (incubating) Plumbers"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Apache Druid (incubating) Plumbers
The plumber handles generated segments both while they are being generated and when they are "done". This is also technically a pluggable interface and there are multiple implementations. However, plumbers handle numerous complex details, and therefore an advanced understanding of Druid is recommended before implementing your own.
Available Plumbers
------------------
#### YeOldePlumber
This plumber creates single historical segments.
#### RealtimePlumber
This plumber creates real-time/mutable segments.

View File

@ -1,39 +0,0 @@
---
layout: doc_page
title: "Batch Data Ingestion"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Batch Data Ingestion
Apache Druid (incubating) can load data from static files through a variety of methods described here.
## Native Batch Ingestion
Druid has built-in batch ingestion functionality. See [here](../ingestion/native_tasks.html) for more info.
## Hadoop Batch Ingestion
Hadoop can be used for batch ingestion. The Hadoop-based batch ingestion will be faster and more scalable than the native batch ingestion. See [here](../ingestion/hadoop.html) for more details.
Having Problems?
----------------
Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-user).

View File

@ -1,95 +0,0 @@
---
layout: doc_page
title: "Command Line Hadoop Indexer"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Command Line Hadoop Indexer
To run:
```
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_dir> org.apache.druid.cli.Main index hadoop <spec_file>
```
## Options
- "--coordinate" - provide a version of Apache Hadoop to use. This property will override the default Hadoop coordinates. Once specified, Apache Druid (incubating) will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`.
- "--no-default-hadoop" - don't pull down the default hadoop version
## Spec file
The spec file needs to contain a JSON object where the contents are the same as the "spec" field in the Hadoop index task. See [Hadoop Batch Ingestion](../ingestion/hadoop.html) for details on the spec format.
In addition, a `metadataUpdateSpec` and `segmentOutputPath` field needs to be added to the ioConfig:
```
"ioConfig" : {
...
"metadataUpdateSpec" : {
"type":"mysql",
"connectURI" : "jdbc:mysql://localhost:3306/druid",
"password" : "diurd",
"segmentTable" : "druid_segments",
"user" : "druid"
},
"segmentOutputPath" : "/MyDirectory/data/index/output"
},
```
and a `workingPath` field needs to be added to the tuningConfig:
```
"tuningConfig" : {
...
"workingPath": "/tmp",
...
}
```
#### Metadata Update Job Spec
This is a specification of the properties that tell the job how to update metadata such that the Druid cluster will see the output segments and load them.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|"metadata" is the only value available.|yes|
|connectURI|String|A valid JDBC url to metadata storage.|yes|
|user|String|Username for db.|yes|
|password|String|password for db.|yes|
|segmentTable|String|Table to use in DB.|yes|
These properties should parrot what you have configured for your [Coordinator](../design/coordinator.html).
#### segmentOutputPath Config
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|segmentOutputPath|String|the path to dump segments into.|yes|
#### workingPath Config
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|workingPath|String|the working path to use for intermediate results (results between Hadoop jobs).|no (default == '/tmp/druid-indexing')|
Please note that the command line Hadoop indexer doesn't have the locking capabilities of the indexing service, so if you choose to use it,
you have to take caution to not override segments created by real-time processing (if you that a real-time pipeline set up).

View File

@ -1,90 +0,0 @@
---
layout: doc_page
title: "Compaction Task"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Compaction Task
Compaction tasks merge all segments of the given interval. The syntax is:
```json
{
"type": "compact",
"id": <task_id>,
"dataSource": <task_datasource>,
"interval": <interval to specify segments to be merged>,
"dimensions" <custom dimensionsSpec>,
"segmentGranularity": <segment granularity after compaction>,
"targetCompactionSizeBytes": <target size of compacted segments>
"tuningConfig" <index task tuningConfig>,
"context": <task context>
}
```
|Field|Description|Required|
|-----|-----------|--------|
|`type`|Task type. Should be `compact`|Yes|
|`id`|Task id|No|
|`dataSource`|DataSource name to be compacted|Yes|
|`interval`|Interval of segments to be compacted|Yes|
|`dimensionsSpec`|Custom dimensionsSpec. Compaction task will use this dimensionsSpec if exist instead of generating one. See below for more details.|No|
|`metricsSpec`|Custom metricsSpec. Compaction task will use this metricsSpec if specified rather than generating one.|No|
|`segmentGranularity`|If this is set, compactionTask will change the segment granularity for the given interval. See [segmentGranularity of Uniform Granularity Spec](./ingestion-spec.html#uniform-granularity-spec) for more details. See the below table for the behavior.|No|
|`targetCompactionSizeBytes`|Target segment size after comapction. Cannot be used with `maxRowsPerSegment`, `maxTotalRows`, and `numShards` in tuningConfig.|No|
|`tuningConfig`|[Index task tuningConfig](../ingestion/native_tasks.html#tuningconfig)|No|
|`context`|[Task context](../ingestion/locking-and-priority.html#task-context)|No|
An example of compaction task is
```json
{
"type" : "compact",
"dataSource" : "wikipedia",
"interval" : "2017-01-01/2018-01-01"
}
```
This compaction task reads _all segments_ of the interval `2017-01-01/2018-01-01` and results in new segments.
Since `segmentGranularity` is null, the original segment granularity will be remained and not changed after compaction.
To control the number of result segments per time chunk, you can set [maxRowsPerSegment](../configuration/index.html#compaction-dynamic-configuration) or [numShards](../ingestion/native_tasks.html#tuningconfig).
Please note that you can run multiple compactionTasks at the same time. For example, you can run 12 compactionTasks per month instead of running a single task for the entire year.
A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters.
For example, its `firehose` is always the [ingestSegmentFirehose](./firehose.html#ingestsegmentfirehose), and `dimensionsSpec` and `metricsSpec`
include all dimensions and metrics of the input segments by default.
Compaction tasks will exit with a failure status code, without doing anything, if the interval you specify has no
data segments loaded in it (or if the interval you specify is empty).
The output segment can have different metadata from the input segments unless all input segments have the same metadata.
- Dimensions: since Apache Druid (incubating) supports schema change, the dimensions can be different across segments even if they are a part of the same dataSource.
If the input segments have different dimensions, the output segment basically includes all dimensions of the input segments.
However, even if the input segments have the same set of dimensions, the dimension order or the data type of dimensions can be different. For example, the data type of some dimensions can be
changed from `string` to primitive types, or the order of dimensions can be changed for better locality.
In this case, the dimensions of recent segments precede that of old segments in terms of data types and the ordering.
This is because more recent segments are more likely to have the new desired order and data types. If you want to use
your own ordering and types, you can specify a custom `dimensionsSpec` in the compaction task spec.
- Roll-up: the output segment is rolled up only when `rollup` is set for all input segments.
See [Roll-up](../ingestion/index.html#rollup) for more details.
You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.html#analysistypes).

View File

@ -1,51 +0,0 @@
---
layout: doc_page
title: "Deleting Data"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Deleting Data
Permanent deletion of a segment in Apache Druid (incubating) has two steps:
1. The segment must first be marked as "unused". This occurs when a segment is dropped by retention rules, and when a user manually disables a segment through the Coordinator API.
2. After segments have been marked as "unused", a Kill Task will delete any "unused" segments from Druid's metadata store as well as deep storage.
For documentation on retention rules, please see [Data Retention](../operations/rule-configuration.html).
For documentation on disabling segments using the Coordinator API, please see [Coordinator Delete API](../operations/api-reference.html#coordinator-delete)
A data deletion tutorial is available at [Tutorial: Deleting data](../tutorials/tutorial-delete-data.html)
## Kill Task
Kill tasks delete all information about a segment and removes it from deep storage. Segments to kill must be unused
(used==0) in the Druid segment table. The available grammar is:
```json
{
"type": "kill",
"id": <task_id>,
"dataSource": <task_datasource>,
"interval" : <all_segments_in_this_interval_will_die!>,
"context": <task context>
}
```

View File

@ -1,279 +0,0 @@
---
layout: doc_page
title: "Apache Druid (incubating) Firehoses"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Apache Druid (incubating) Firehoses
Firehoses are used in [native batch ingestion tasks](../ingestion/native_tasks.html) and stream push tasks automatically created by [Tranquility](../ingestion/stream-push.html).
They are pluggable, and thus the configuration schema can and will vary based on the `type` of the Firehose.
| Field | Type | Description | Required |
|-------|------|-------------|----------|
| type | String | Specifies the type of Firehose. Each value will have its own configuration schema. Firehoses packaged with Druid are described below. | yes |
## Additional Firehoses
There are several Firehoses readily available in Druid. Some are meant for examples, and others can be used directly in a production environment.
For additional Firehoses, please see our [extensions list](../development/extensions.html).
### LocalFirehose
This Firehose can be used to read the data from files on local disk, and is mainly intended for proof-of-concept testing, and works with `string` typed parsers.
This Firehose is _splittable_ and can be used by [native parallel index tasks](./native_tasks.html#parallel-index-task).
Since each split represents a file in this Firehose, each worker task of `index_parallel` will read a file.
A sample local Firehose spec is shown below:
```json
{
"type": "local",
"filter" : "*.csv",
"baseDir": "/data/directory"
}
```
|property|description|required?|
|--------|-----------|---------|
|type|This should be "local".|yes|
|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html) for more information.|yes|
|baseDir|directory to search recursively for files to be ingested. |yes|
### HttpFirehose
This Firehose can be used to read the data from remote sites via HTTP, and works with `string` typed parsers.
This Firehose is _splittable_ and can be used by [native parallel index tasks](./native_tasks.html#parallel-index-task).
Since each split represents a file in this Firehose, each worker task of `index_parallel` will read a file.
A sample HTTP Firehose spec is shown below:
```json
{
"type": "http",
"uris": ["http://example.com/uri1", "http://example2.com/uri2"]
}
```
The below configurations can be optionally used if the URIs specified in the spec require a Basic Authentication Header.
Omitting these fields from your spec will result in HTTP requests with no Basic Authentication Header.
|property|description|default|
|--------|-----------|-------|
|httpAuthenticationUsername|Username to use for authentication with specified URIs|None|
|httpAuthenticationPassword|PasswordProvider to use with specified URIs|None|
Example with authentication fields using the DefaultPassword provider (this requires the password to be in the ingestion spec):
```json
{
"type": "http",
"uris": ["http://example.com/uri1", "http://example2.com/uri2"],
"httpAuthenticationUsername": "username",
"httpAuthenticationPassword": "password123"
}
```
You can also use the other existing Druid PasswordProviders. Here is an example using the EnvironmentVariablePasswordProvider:
```json
{
"type": "http",
"uris": ["http://example.com/uri1", "http://example2.com/uri2"],
"httpAuthenticationUsername": "username",
"httpAuthenticationPassword": {
"type": "environment",
"variable": "HTTP_FIREHOSE_PW"
}
}
```
The below configurations can optionally be used for tuning the Firehose performance.
|property|description|default|
|--------|-----------|-------|
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|
|prefetchTriggerBytes|Threshold to trigger prefetching HTTP objects.|maxFetchCapacityBytes / 2|
|fetchTimeout|Timeout for fetching an HTTP object.|60000|
|maxFetchRetry|Maximum retries for fetching an HTTP object.|3|
### IngestSegmentFirehose
This Firehose can be used to read the data from existing druid segments, potentially using a new schema and changing the name, dimensions, metrics, rollup, etc. of the segment.
This Firehose is _splittable_ and can be used by [native parallel index tasks](./native_tasks.html#parallel-index-task).
This firehose will accept any type of parser, but will only utilize the list of dimensions and the timestamp specification.
A sample ingest Firehose spec is shown below:
```json
{
"type": "ingestSegment",
"dataSource": "wikipedia",
"interval": "2013-01-01/2013-01-02"
}
```
|property|description|required?|
|--------|-----------|---------|
|type|This should be "ingestSegment".|yes|
|dataSource|A String defining the data source to fetch rows from, very similar to a table in a relational database|yes|
|interval|A String representing the ISO-8601 interval. This defines the time range to fetch the data over.|yes|
|dimensions|The list of dimensions to select. If left empty, no dimensions are returned. If left null or not defined, all dimensions are returned. |no|
|metrics|The list of metrics to select. If left empty, no metrics are returned. If left null or not defined, all metrics are selected.|no|
|filter| See [Filters](../querying/filters.html)|no|
|maxInputSegmentBytesPerTask|When used with the native parallel index task, the maximum number of bytes of input segments to process in a single task. If a single segment is larger than this number, it will be processed by itself in a single task (input segments are never split across tasks). Defaults to 150MB.|no|
### SqlFirehose
This Firehose can be used to ingest events residing in an RDBMS. The database connection information is provided as part of the ingestion spec.
For each query, the results are fetched locally and indexed.
If there are multiple queries from which data needs to be indexed, queries are prefetched in the background, up to `maxFetchCapacityBytes` bytes.
This firehose will accept any type of parser, but will only utilize the list of dimensions and the timestamp specification. See the extension documentation for more detailed ingestion examples.
Requires one of the following extensions:
* [MySQL Metadata Store](../development/extensions-core/mysql.html).
* [PostgreSQL Metadata Store](../development/extensions-core/postgresql.html).
```json
{
"type": "sql",
"database": {
"type": "mysql",
"connectorConfig": {
"connectURI": "jdbc:mysql://host:port/schema",
"user": "user",
"password": "password"
}
},
"sqls": ["SELECT * FROM table1", "SELECT * FROM table2"]
}
```
|property|description|default|required?|
|--------|-----------|-------|---------|
|type|This should be "sql".||Yes|
|database|Specifies the database connection details.||Yes|
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|No|
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|No|
|prefetchTriggerBytes|Threshold to trigger prefetching SQL result objects.|maxFetchCapacityBytes / 2|No|
|fetchTimeout|Timeout for fetching the result set.|60000|No|
|foldCase|Toggle case folding of database column names. This may be enabled in cases where the database returns case insensitive column names in query results.|false|No|
|sqls|List of SQL queries where each SQL query would retrieve the data to be indexed.||Yes|
#### Database
|property|description|default|required?|
|--------|-----------|-------|---------|
|type|The type of database to query. Valid values are `mysql` and `postgresql`_||Yes|
|connectorConfig|Specify the database connection properties via `connectURI`, `user` and `password`||Yes|
### InlineFirehose
This Firehose can be used to read the data inlined in its own spec.
It can be used for demos or for quickly testing out parsing and schema, and works with `string` typed parsers.
A sample inline Firehose spec is shown below:
```json
{
"type": "inline",
"data": "0,values,formatted\n1,as,CSV"
}
```
|property|description|required?|
|--------|-----------|---------|
|type|This should be "inline".|yes|
|data|Inlined data to ingest.|yes|
### CombiningFirehose
This Firehose can be used to combine and merge data from a list of different Firehoses.
```json
{
"type": "combining",
"delegates": [ { firehose1 }, { firehose2 }, ... ]
}
```
|property|description|required?|
|--------|-----------|---------|
|type|This should be "combining"|yes|
|delegates|List of Firehoses to combine data from|yes|
### Streaming Firehoses
The EventReceiverFirehose is used in tasks automatically generated by
[Tranquility stream push](../ingestion/stream-push.html). These Firehoses are not suitable for batch ingestion.
#### EventReceiverFirehose
This Firehose can be used to ingest events using an HTTP endpoint, and works with `string` typed parsers.
```json
{
"type": "receiver",
"serviceName": "eventReceiverServiceName",
"bufferSize": 10000
}
```
When using this Firehose, events can be sent by submitting a POST request to the HTTP endpoint:
`http://<peonHost>:<port>/druid/worker/v1/chat/<eventReceiverServiceName>/push-events/`
|property|description|required?|
|--------|-----------|---------|
|type|This should be "receiver"|yes|
|serviceName|Name used to announce the event receiver service endpoint|yes|
|maxIdleTime|A Firehose is automatically shut down after not receiving any events for this period of time, in milliseconds. If not specified, a Firehose is never shut down due to being idle. Zero and negative values have the same effect.|no|
|bufferSize|Size of buffer used by Firehose to store events|no, default is 100000|
Shut down time for EventReceiverFirehose can be specified by submitting a POST request to
`http://<peonHost>:<port>/druid/worker/v1/chat/<eventReceiverServiceName>/shutdown?shutoffTime=<shutoffTime>`
If shutOffTime is not specified, the Firehose shuts off immediately.
#### TimedShutoffFirehose
This can be used to start a Firehose that will shut down at a specified time.
An example is shown below:
```json
{
"type": "timed",
"shutoffTime": "2015-08-25T01:26:05.119Z",
"delegate": {
"type": "receiver",
"serviceName": "eventReceiverServiceName",
"bufferSize": 100000
}
}
```
|property|description|required?|
|--------|-----------|---------|
|type|This should be "timed"|yes|
|shutoffTime|Time at which the Firehose should shut down, in ISO8601 format|yes|
|delegate|Firehose to use|yes|

View File

@ -1,180 +0,0 @@
---
layout: doc_page
title: "JSON Flatten Spec"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# JSON Flatten Spec
| Field | Type | Description | Required |
|-------|------|-------------|----------|
| useFieldDiscovery | Boolean | If true, interpret all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns. | no (default == true) |
| fields | JSON Object array | Specifies the fields of interest and how they are accessed | no (default == []) |
Defining the JSON Flatten Spec allows nested JSON fields to be flattened during ingestion time. Only parseSpecs of types "json" or ["avro"](../development/extensions-core/avro.html) support flattening.
'fields' is a list of JSON Objects, describing the field names and how the fields are accessed:
## JSON Field Spec
| Field | Type | Description | Required |
|-------|------|-------------|----------|
| type | String | Type of the field, "root", "path" or "jq". | yes |
| name | String | This string will be used as the column name when the data has been ingested. | yes |
| expr | String | Defines an expression for accessing the field within the JSON object, using [JsonPath](https://github.com/jayway/JsonPath) notation for type "path", and [jackson-jq](https://github.com/eiiches/jackson-jq) for type "jq". This field is only used for type "path" and "jq", otherwise ignored. | only for type "path" or "jq" |
Suppose the event JSON has the following form:
```json
{
"timestamp": "2015-09-12T12:10:53.155Z",
"dim1": "qwerty",
"dim2": "asdf",
"dim3": "zxcv",
"ignore_me": "ignore this",
"metrica": 9999,
"foo": {"bar": "abc"},
"foo.bar": "def",
"nestmet": {"val": 42},
"hello": [1.0, 2.0, 3.0, 4.0, 5.0],
"mixarray": [1.0, 2.0, 3.0, 4.0, {"last": 5}],
"world": [{"hey": "there"}, {"tree": "apple"}],
"thing": {"food": ["sandwich", "pizza"]}
}
```
The column "metrica" is a Long metric column, "hello" is an array of Double metrics, and "nestmet.val" is a nested Long metric. All other columns are dimensions.
To flatten this JSON, the parseSpec could be defined as follows:
```json
"parseSpec": {
"format": "json",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "root",
"name": "dim1"
},
"dim2",
{
"type": "path",
"name": "foo.bar",
"expr": "$.foo.bar"
},
{
"type": "root",
"name": "foo.bar"
},
{
"type": "path",
"name": "path-metric",
"expr": "$.nestmet.val"
},
{
"type": "path",
"name": "hello-0",
"expr": "$.hello[0]"
},
{
"type": "path",
"name": "hello-4",
"expr": "$.hello[4]"
},
{
"type": "path",
"name": "world-hey",
"expr": "$.world[0].hey"
},
{
"type": "path",
"name": "worldtree",
"expr": "$.world[1].tree"
},
{
"type": "path",
"name": "first-food",
"expr": "$.thing.food[0]"
},
{
"type": "path",
"name": "second-food",
"expr": "$.thing.food[1]"
},
{
"type": "jq",
"name": "first-food-by-jq",
"expr": ".thing.food[1]"
},
{
"type": "jq",
"name": "hello-total",
"expr": ".hello | sum"
}
]
},
"dimensionsSpec" : {
"dimensions" : [],
"dimensionsExclusions": ["ignore_me"]
},
"timestampSpec" : {
"format" : "auto",
"column" : "timestamp"
}
}
```
Fields "dim3", "ignore_me", and "metrica" will be automatically discovered because 'useFieldDiscovery' is true, so they have been omitted from the field spec list.
"ignore_me" will be automatically discovered but excluded as specified by dimensionsExclusions.
Aggregators should use the metric column names as defined in the flattenSpec. Using the example above:
```json
"metricsSpec" : [
{
"type" : "longSum",
"name" : "path-metric-sum",
"fieldName" : "path-metric"
},
{
"type" : "doubleSum",
"name" : "hello-0-sum",
"fieldName" : "hello-0"
},
{
"type" : "longSum",
"name" : "metrica-sum",
"fieldName" : "metrica"
}
]
```
Note that:
* For convenience, when defining a root-level field, it is possible to define only the field name, as a string, shown with "dim2" above.
* Enabling 'useFieldDiscovery' will only autodetect fields at the root level with a single value (not a map or list), as well as fields referring to a list of single values. In the example above, "dim1", "dim2", "dim3", "ignore_me", "metrica", and "foo.bar" (at the root) would be automatically detected as columns. The "hello" field is a list of Doubles and will be autodiscovered, but note that the example ingests the individual list members as separate fields. The "world" field must be explicitly defined because its value is a map. The "mixarray" field, while similar to "hello", must also be explicitly defined because its last value is a map.
* Duplicate field definitions are not allowed, an exception will be thrown.
* If auto field discovery is enabled, any discovered field with the same name as one already defined in the field specs will be skipped and not added twice.
* The JSON input must be a JSON object at the root, not an array. e.g., {"valid": "true"} and {"valid":[1,2,3]} are supported but [{"invalid": "true"}] and [1,2,3] are not.
* [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) is useful for testing the path expressions.
* jackson-jq supports subset of [./jq](https://stedolan.github.io/jq/) syntax. Please refer jackson-jq document.

View File

@ -1,43 +0,0 @@
---
layout: doc_page
title: "Hadoop-based Batch Ingestion VS Native Batch Ingestion"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Comparison of Batch Ingestion Methods
Apache Druid (incubating) basically supports three types of batch ingestion: Apache Hadoop-based
batch ingestion, native parallel batch ingestion, and native local batch
ingestion. The below table shows what features are supported by each
ingestion method.
| |Hadoop-based ingestion|Native parallel ingestion|Native local ingestion|
|---|----------------------|-------------------------|----------------------|
| Parallel indexing | Always parallel | Parallel if firehose is splittable <br/> & maxNumConcurrentSubTasks > 1 in tuningConfig | Always sequential |
| Supported indexing modes | Overwriting mode | Both appending and overwriting modes | Both appending and overwriting modes |
| External dependency | Hadoop (it internally submits Hadoop jobs) | No dependency | No dependency |
| Supported [rollup modes](./index.html#roll-up-modes) | Perfect rollup | Both perfect and best-effort rollup | Both perfect and best-effort rollup |
| Supported partitioning methods | [Both Hash-based and range partitioning](./hadoop.html#partitioning-specification) | Hash-based partitioning (when `forceGuaranteedRollup` = true) | Hash-based partitioning (when `forceGuaranteedRollup` = true) |
| Supported input locations | All locations accessible via HDFS client or Druid dataSource | All implemented [firehoses](./firehose.html) | All implemented [firehoses](./firehose.html) |
| Supported file formats | All implemented Hadoop InputFormats | Currently text file formats (CSV, TSV, JSON) by default. Additional formats can be added though a [custom extension](../development/modules.html) implementing [`FiniteFirehoseFactory`](https://github.com/apache/incubator-druid/blob/master/core/src/main/java/org/apache/druid/data/input/FiniteFirehoseFactory.java) | Currently text file formats (CSV, TSV, JSON) by default. Additional formats can be added though a [custom extension](../development/modules.html) implementing [`FiniteFirehoseFactory`](https://github.com/apache/incubator-druid/blob/master/core/src/main/java/org/apache/druid/data/input/FiniteFirehoseFactory.java) |
| Saving parse exceptions in ingestion report | Currently not supported | Currently not supported | Supported |
| Custom segment version | Supported, but this is NOT recommended | N/A | N/A |

View File

@ -1,306 +0,0 @@
---
layout: doc_page
title: "Ingestion"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Ingestion
## Overview
<a name="datasources" />
### Datasources and segments
Apache Druid (incubating) data is stored in "datasources", which are similar to tables in a traditional RDBMS. Each datasource is
partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a "chunk" (for
example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more
["segments"](../design/segments.html). Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
organized into time chunks, it's sometimes helpful to think of segments as living on a timeline like the following:
<img src="../../img/druid-timeline.png" width="800" />
A datasource may have anywhere from just a few segments, up to hundreds of thousands and even millions of segments. Each
segment starts life off being created on a MiddleManager, and at that point, is mutable and uncommitted. The segment
building process includes the following steps, designed to produce a data file that is compact and supports fast
queries:
- Conversion to columnar format
- Indexing with bitmap indexes
- Compression using various algorithms
- Dictionary encoding with id storage minimization for String columns
- Bitmap compression for bitmap indexes
- Type-aware compression for all columns
Periodically, segments are published (committed). At this point, they are written to deep storage, become immutable, and
move from MiddleManagers to the Historical processes. An entry about the segment is also written to the metadata store.
This entry is a self-describing bit of metadata about the segment, including things like the schema of the segment, its
size, and its location on deep storage. These entries are what the Coordinator uses to know what data *should* be
available on the cluster.
For details on the segment file format, please see [segment files](../design/segments.html).
For details on modeling your data in Druid, see [schema design](schema-design.html).
#### Segment identifiers
Segments all have a four-part identifier with the following components:
- Datasource name.
- Time interval (for the time chunk containing the segment; this corresponds to the `segmentGranularity` specified
at ingestion time).
- Version number (generally an ISO8601 timestamp corresponding to when the segment set was first started).
- Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous).
For example, this is the identifier for a segment in datasource `clarity-cloud0`, time chunk
`2018-05-21T16:00:00.000Z/2018-05-21T17:00:00.000Z`, version `2018-05-21T15:56:09.909Z`, and partition number 1:
```
clarity-cloud0_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T15:56:09.909Z_1
```
Segments with partition number 0 (the first partition in a chunk) omit the partition number, like the following
example, which is a segment in the same time chunk as the previous one, but with partition number 0 instead of 1:
```
clarity-cloud0_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T15:56:09.909Z
```
#### Segment versioning
You may be wondering what the "version number" described in the previous section is for. Or, you might not be, in which
case good for you and you can skip this section!
It's there to support batch-mode overwriting. In Druid, if all you ever do is append data, then there will be just a
single version for each time chunk. But when you overwrite data, what happens behind the scenes is that a new set of
segments is created with the same datasource, same time interval, but a higher version number. This is a signal to the
rest of the Druid system that the older version should be removed from the cluster, and the new version should replace
it.
The switch appears to happen instantaneously to a user, because Druid handles this by first loading the new data (but
not allowing it to be queried), and then, as soon as the new data is all loaded, switching all new queries to use those
new segments. Then it drops the old segments a few minutes later.
#### Segment states
Segments can be either _available_ or _unavailable_, which refers to whether or not they are currently served by some
Druid server process. They can also be _published_ or _unpublished_, which refers to whether or not they have been
written to deep storage and the metadata store. And published segments can be either _used_ or _unused_, which refers to
whether or not Druid considers them active segments that should be served.
Putting these together, there are five basic states that a segment can be in:
- **Published, available, and used:** These segments are published in deep storage and the metadata store, and they are
served by Historical processes. They are the majority of active data in a Druid cluster (they include everything except
in-flight realtime data).
- **Published, available, and unused:** These segments are being served by Historicals, but won't be for very long. They
may be segments that have recently been overwritten (see [Segment versioning](#segment-versioning)) or dropped for
other reasons (like drop rules, or being dropped manually).
- **Published, unavailable, and used:** These segments are published in deep storage and the metadata store, and
_should_ be served, but are not actually being served. If segments stay in this state for more than a few minutes, it's
usually because something is wrong. Some of the more common causes include: failure of a large number of Historicals,
Historicals being out of capacity to download more segments, and some issue with coordination that prevents the
Coordinator from telling Historicals to load new segments.
- **Published, unavailable, and unused:** These segments are published in deep storage and the metadata store, but
are inactive (because they have been overwritten or dropped). They lie dormant, and can potentially be resurrected
by manual action if needed (in particular: setting the "used" flag to true).
- **Unpublished and available:** This is the state that segments are in while they are being built by Druid ingestion
tasks. This includes all "realtime" data that has not been handed off to Historicals yet. Segments in this state may or
may not be replicated. If all replicas are lost, then the segment must be rebuilt from scratch. This may or may not be
possible. (It is possible with Kafka, and happens automatically; it is possible with S3/HDFS by restarting the job; and
it is _not_ possible with Tranquility, so in that case, data will be lost.)
The sixth state in this matrix, "unpublished and unavailable," isn't possible. If a segment isn't published and isn't
being served then does it really exist?
#### Indexing and handoff
_Indexing_ is the mechanism by which new segments are created, and _handoff_ is the mechanism by which they are published
and begin being served by Historical processes. The mechanism works like this on the indexing side:
1. An _indexing task_ starts running and building a new segment. It must determine the identifier of the segment before
it starts building it. For a task that is appending (like a Kafka task, or an index task in append mode) this will be
done by calling an "allocate" API on the Overlord to potentially add a new partition to an existing set of segments. For
a task that is overwriting (like a Hadoop task, or an index task _not_ in append mode) this is done by locking an
interval and creating a new version number and new set of segments.
2. If the indexing task is a realtime task (like a Kafka task) then the segment is immediately queryable at this point.
It's available, but unpublished.
3. When the indexing task has finished reading data for the segment, it pushes it to deep storage and then publishes it
by writing a record into the metadata store.
4. If the indexing task is a realtime task, at this point it waits for a Historical process to load the segment. If the
indexing task is not a realtime task, it exits immediately.
And like this on the Coordinator / Historical side:
1. The Coordinator polls the metadata store periodically (by default, every 1 minute) for newly published segments.
2. When the Coordinator finds a segment that is published and used, but unavailable, it chooses a Historical process
to load that segment and instructs that Historical to do so.
3. The Historical loads the segment and begins serving it.
4. At this point, if the indexing task was waiting for handoff, it will exit.
## Ingestion methods
In most ingestion methods, this work is done by Druid
MiddleManager processes. One exception is Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce
job on YARN (although MiddleManager processes are still involved in starting and monitoring the Hadoop jobs).
Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.html), they will be loaded by Druid Historical processes. Some Druid
ingestion methods additionally support _real-time queries_, meaning you can query in-flight data on MiddleManager processes
before it is finished being converted and written to deep storage. In general, a small amount of data will be in-flight
on MiddleManager processes relative to the larger amount of historical data being served from Historical processes.
See the [Design](../design/index.html) page for more details on how Druid stores and manages your data.
The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose
the best one for your situation.
|Method|How it works|Can append and overwrite?|Can handle late data?|Exactly-once ingestion?|Real-time queries?|
|------|------------|-------------------------|---------------------|-----------------------|------------------|
|[Native batch](native_tasks.html)|Druid loads data directly from S3, HTTP, NFS, or other networked storage.|Append or overwrite|Yes|Yes|No|
|[Hadoop](hadoop.html)|Druid launches Hadoop Map/Reduce jobs to load data files.|Overwrite|Yes|Yes|No|
|[Kafka indexing service](../development/extensions-core/kafka-ingestion.html)|Druid reads directly from Kafka.|Append only|Yes|Yes|Yes|
|[Tranquility](stream-push.html)|You use Tranquility, a client side library, to push individual records into Druid.|Append only|No - late data is dropped|No - may drop or duplicate data|Yes|
## Partitioning
Druid is a distributed data store, and it partitions your data in order to process it in parallel. Druid
[datasources](../design/index.html) are always partitioned first by time based on the
[segmentGranularity](../ingestion/index.html#granularityspec) parameter of your ingestion spec. Each of these time partitions is called
a _time chunk_, and each time chunk contains one or more [segments](../design/segments.html). The segments within a
particular time chunk may be partitioned further using options that vary based on the ingestion method you have chosen.
* With [Hadoop](hadoop.html) you can do hash- or range-based partitioning on one or more columns.
* With [Native batch](native_tasks.html) you can partition on a hash of dimension columns. This is useful when
rollup is enabled, since it maximizes your space savings.
* With [Kafka indexing](../development/extensions-core/kafka-ingestion.html), partitioning is based on Kafka
partitions, and is not configurable through Druid. You can configure it on the Kafka side by using the partitioning
functionality of the Kafka producer.
* With [Tranquility](stream-push.html), partitioning is done by default on a hash of all dimension columns in order
to maximize rollup. You can also provide a custom Partitioner class; see the
[Tranquility documentation](https://github.com/druid-io/tranquility/blob/master/docs/overview.md#partitioning-and-replication)
for details.
All Druid datasources are partitioned by time. Each data ingestion method must acquire a write lock on a particular
time range when loading data, so no two methods can operate on the same time range of the same datasource at the same
time. However, two data ingestion methods _can_ operate on different time ranges of the same datasource at the same
time. For example, you can do a batch backfill from Hadoop while also doing a real-time load from Kafka, so long as
the backfill data and the real-time data do not need to be written to the same time partitions. (If they do, the
real-time load will take priority.)
For tips on how partitioning can affect performance and storage footprint, see the
[schema design](schema-design.html#partitioning) page.
## Rollup
Druid is able to summarize raw data at ingestion time using a process we refer to as "roll-up".
Roll-up is a first-level aggregation operation over a selected set of "dimensions", where a set of "metrics" are aggregated.
Suppose we have the following raw data, representing total packet/byte counts in particular seconds for traffic between a source and destination. The `srcIP` and `dstIP` fields are dimensions, while `packets` and `bytes` are metrics.
```
timestamp srcIP dstIP packets bytes
2018-01-01T01:01:35Z 1.1.1.1 2.2.2.2 100 1000
2018-01-01T01:01:51Z 1.1.1.1 2.2.2.2 200 2000
2018-01-01T01:01:59Z 1.1.1.1 2.2.2.2 300 3000
2018-01-01T01:02:14Z 1.1.1.1 2.2.2.2 400 4000
2018-01-01T01:02:29Z 1.1.1.1 2.2.2.2 500 5000
2018-01-01T01:03:29Z 1.1.1.1 2.2.2.2 600 6000
2018-01-02T21:33:14Z 7.7.7.7 8.8.8.8 100 1000
2018-01-02T21:33:45Z 7.7.7.7 8.8.8.8 200 2000
2018-01-02T21:35:45Z 7.7.7.7 8.8.8.8 300 3000
```
If we ingest this data into Druid with a `queryGranularity` of `minute` (which will floor timestamps to minutes), the roll-up operation is equivalent to the following pseudocode:
```
GROUP BY TRUNCATE(timestamp, MINUTE), srcIP, dstIP :: SUM(packets), SUM(bytes)
```
After the data above is aggregated during roll-up, the following rows will be ingested:
```
timestamp srcIP dstIP packets bytes
2018-01-01T01:01:00Z 1.1.1.1 2.2.2.2 600 6000
2018-01-01T01:02:00Z 1.1.1.1 2.2.2.2 900 9000
2018-01-01T01:03:00Z 1.1.1.1 2.2.2.2 600 6000
2018-01-02T21:33:00Z 7.7.7.7 8.8.8.8 300 3000
2018-01-02T21:35:00Z 7.7.7.7 8.8.8.8 300 3000
```
The rollup granularity is the minimum granularity you will be able to explore data at and events are floored to this granularity.
Hence, Druid ingestion specs define this granularity as the `queryGranularity` of the data. The lowest supported `queryGranularity` is millisecond.
The following links may be helpful in further understanding dimensions and metrics:
* [https://en.wikipedia.org/wiki/Dimension_(data_warehouse)](https://en.wikipedia.org/wiki/Dimension_(data_warehouse))
* [https://en.wikipedia.org/wiki/Measure_(data_warehouse)](https://en.wikipedia.org/wiki/Measure_(data_warehouse))
For tips on how to use rollup in your Druid schema designs, see the [schema design](schema-design.html#rollup) page.
### Roll-up modes
Druid supports two roll-up modes, i.e., _perfect roll-up_ and _best-effort roll-up_. In the perfect roll-up mode, Druid guarantees that input data are perfectly aggregated at ingestion time. Meanwhile, in the best-effort roll-up, input data might not be perfectly aggregated and thus there can be multiple segments holding the rows which should belong to the same segment with the perfect roll-up since they have the same dimension value and their timestamps fall into the same interval.
The perfect roll-up mode encompasses an additional preprocessing step to determine intervals and shardSpecs before actual data ingestion if they are not specified in the ingestionSpec. This preprocessing step usually scans the entire input data which might increase the ingestion time. The [Hadoop indexing task](../ingestion/hadoop.html) always runs with this perfect roll-up mode.
On the contrary, the best-effort roll-up mode doesn't require any preprocessing step, but the size of ingested data might be larger than that of the perfect roll-up. All types of [streaming indexing (e.g., kafka indexing service)](../ingestion/stream-ingestion.html) run with this mode.
Finally, the [native index task](../ingestion/native_tasks.html) supports both modes and you can choose either one which fits to your application.
## Data maintenance
### Inserts and overwrites
Druid can insert new data to an existing datasource by appending new segments to existing segment sets. It can also add new data by merging an existing set of segments with new data and overwriting the original set.
Druid does not support single-record updates by primary key.
Updates are described further at [update existing data](../ingestion/update-existing-data.html).
### Compaction
Compaction is a type of overwrite operation, which reads an existing set of segments, combines them into a new set with larger but fewer segments, and overwrites the original set with the new compacted set, without changing the data that is stored.
For performance reasons, it is sometimes beneficial to compact a set of segments into a set of larger but fewer segments, as there is some per-segment processing and memory overhead in both the ingestion and querying paths.
For compaction documentation, please see [tasks](../ingestion/tasks.html).
### Retention and Tiering
Druid supports retention rules, which are used to define intervals of time where data should be preserved, and intervals where data should be discarded.
Druid also supports separating Historical processes into tiers, and the retention rules can be configured to assign data for specific intervals to specific tiers.
These features are useful for performance/cost management; a common use case is separating Historical processes into a "hot" tier and a "cold" tier.
For more information, please see [Load rules](../operations/rule-configuration.html).
### Deletes
Druid supports permanent deletion of segments that are in an "unused" state (see the [Segment states](#segment-states) section above).
The Kill Task deletes unused segments within a specified interval from metadata storage and deep storage.
For more information, please see [Kill Task](../ingestion/tasks.html#kill-task).

Some files were not shown because too many files have changed in this diff Show More