druid/website/.spelling

2431 lines
34 KiB
Plaintext
Raw Normal View History

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# markdown-spellcheck spelling configuration file
# Format - lines beginning # are comments
# global dictionary is at the start, file overrides afterwards
# one word per line, to define a file override use ' - filename'
# where filename is relative to this configuration file
1M
100MiB
32-bit
4MiB
500MiB
64-bit
ACL
ACLs
APIs
apache.org
AvroStorage
ARN
ASC
arcsine
arccosine
arctangent
autokill
AWS
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
AWS_CONTAINER_CREDENTIALS_FULL_URI
Actian
Authorizer
Avatica
Avro
Azul
AzureDNSZone
BCP
Base64
Base64-encoded
ByteBuffer
bottlenecked
cartesian
concat
CIDR
CORS
CNF
CPUs
CSVs
CentralizedDatasourceSchema
Ceph
circledR
CloudWatch
ColumnDescriptor
Corretto
Add a configurable bufferPeriod between when a segment is marked unused and deleted by KillUnusedSegments duty (#12599) * Add new configurable buffer period to create gap between mark unused and kill of segment * Changes after testing * fixes and improvements * changes after initial self review * self review changes * update sql statement that was lacking last_used * shore up some code in SqlMetadataConnector after self review * fix derby compatibility and improve testing/docs * fix checkstyle violations * Fixes post merge with master * add some unit tests to improve coverage * ignore test coverage on new UpdateTools cli tool * another attempt to ignore UpdateTables in coverage check * change column name to used_flag_last_updated * fix a method signature after column name switch * update docs spelling * Update spelling dictionary * Fixing up docs/spelling and integrating altering tasks table with my alteration code * Update NULL values for used_flag_last_updated in the background * Remove logic to allow segs with null used_flag_last_updated to be killed regardless of bufferPeriod * remove unneeded things now that the new column is automatically updated * Test new background row updater method * fix broken tests * fix create table statement * cleanup DDL formatting * Revert adding columns to entry table by default * fix compilation issues after merge with master * discovered and fixed metastore inserts that were breaking integration tests * fixup forgotten insert by using pattern of sharing now timestamp across columns * fix issue introduced by merge * fixup after merge with master * add some directions to docs in the case of segment table validation issues
2023-08-17 20:32:51 -04:00
CLI
CUME_DIST
DDL
DENSE_RANK
DML
DNS
DRUIDVERSION
DataSketches
DateTime
DateType
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-31 00:53:50 -05:00
DeltaLakeInputSource
dimensionsSpec
DimensionSpec
DimensionSpecs
Dockerfile
Docusaurus
DogStatsD
DOCTYPE
Double.NEGATIVE_INFINITY
Double.NEGATIVE_INFINITY.
Double.POSITIVE_INFINITY
Double.POSITIVE_INFINITY.
downsampled
downsamples
downsampling
Dropwizard
dropwizard
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-31 00:53:50 -05:00
druid-deltalake-extensions
DruidInputSource
DruidSQL
DynamicConfigProvider
EC2
EC2ContainerCredentialsProviderWrapper
ECS
EMR
EMRFS
ETL
Elasticsearch
Enums
FIRST_VALUE
FlattenSpec
Float.NEGATIVE_INFINITY
Float.NEGATIVE_INFINITY.
Float.POSITIVE_INFINITY
Float.POSITIVE_INFINITY.
ForwardedRequestCustomizer
GC
GPG
GSSAPI
GUIs
GroupBy
Guice
HDFS
HLL
HashSet
Homebrew
html
HyperLogLog
IAM
IANA
IcebergFilter
IcebergInputSource
IETF
IoT
IP
IPv4
IPv6
IS_AGGREGATOR
IS_BROADCAST
IS_JOINABLE
IS0
ISO-8601
ISO8601
IndexSpec
IndexTask
InfluxDB
InputFormat
InputSource
InputSources
Integer.MAX_VALUE
IntelliJ
ioConfig
Istio
JBOD
JDBC
JDK
JDK7
JDK8
JKS
jks
JMX
JRE
JS
JSON
JsonPath
JSONPath
JSSE
JVM
JVMs
JWT
Joda
JsonProperty
Jupyter
JupyterLab
KMS
Kerberized
Kerberos
KeyStores
Kinesis
Kubernetes
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-31 00:53:50 -05:00
Lakehouse
LDAPS
LRU
LZ4
LZO
LimitSpec
Long.MAX_VALUE
Long.MAX_VALUE.
Long.MIN_VALUE
Long.MIN_VALUE.
Lucene
MapBD
MapDB
MariaDB
MiddleManager
MiddleManagers
Montréal
MSQ
Murmur3
MVCC
MV_TO_ARRAY
NFS
OCF
OIDC
OLAP
OOMs
OpenJDK
OpenLDAP
OpenTSDB
OutputStream
ParAccel
ParseSpec
ParseSpecs
Protobuf
protobuf
pull-deps
RDBMS
RDDs
RDS
ROUTINE_CATALOG
ROUTINE_NAME
ROUTINE_SCHEMA
ROUTINE_TYPE
Rackspace
Redis
S3
SAS
SDK
SIGAR
SPNEGO
Splunk
SqlInputSource
SQLServer
SSD
SSDs
SSL
Samza
Splunk
SqlParameter
SslContextFactory
StatsD
SYSTEM_TABLE
TaskRunner
TabItem
TCP
TGT
TLS
tls
TopN
TopNs
tryParseNumbers
UI
UIs
UPSERT
URI
URIs
uris
UTF-16
UTF-8
UTF8
XMLs
ZK
ZSTD
accessor
ad-hoc
aggregator
aggregators
ambari
analytics
arrayElement
arrayContainsElement
assumeNewlineDelimited
assumeRoleArn
assumeRoleExternalId
parallel broker merges on fork join pool (#8578) * sketch of broker parallel merges done in small batches on fork join pool * fix non-terminating sequences, auto compute parallelism * adjust benches * adjust benchmarks * now hella more faster, fixed dumb * fix * remove comments * log.info for debug * javadoc * safer block for sequence to yielder conversion * refactor LifecycleForkJoinPool into LifecycleForkJoinPoolProvider which wraps a ForkJoinPool * smooth yield rate adjustment, more logs to help tune * cleanup, less logs * error handling, bug fixes, on by default, more parallel, more tests * remove unused var * comments * timeboundary mergeFn * simplify, more javadoc * formatting * pushdown config * use nanos consistently, move logs back to debug level, bit more javadoc * static terminal result batch * javadoc for nullability of createMergeFn * cleanup * oops * fix race, add docs * spelling, remove todo, add unhandled exception log * cleanup, revert unintended change * another unintended change * review stuff * add ParallelMergeCombiningSequenceBenchmark, fixes * hyper-threading is the enemy * fix initial start delay, lol * parallelism computer now balances partition sizes to partition counts using sqrt of sequence count instead of sequence count by 2 * fix those important style issues with the benchmarks code * lazy sequence creation for benchmarks * more benchmark comments * stable sequence generation time * update defaults to use 100ms target time, 4096 batch size, 16384 initial yield, also update user docs * add jmh thread based benchmarks, cleanup some stuff * oops * style * add spread to jmh thread benchmark start range, more comments to benchmarks parameters and purpose * retool benchmark to allow modeling more typical heterogenous heavy workloads * spelling * fix * refactor benchmarks * formatting * docs * add maxThreadStartDelay parameter to threaded benchmark * why does catch need to be on its own line but else doesnt
2019-11-07 14:58:46 -05:00
async
authorizer
authorizers
autocomplete
autodiscovery
autoscaler
autoscaling
averager
averagers
backend
backfills
backfilled
backpressure
base64
big-endian
bigint
blkio
blobstore
Boolean
boolean
breakpoint
broadcasted
bucketSize
checksums
classpath
clickstream
clientConfig
codebase
codec
colocated
colocation
colocating
compactable
compactionTask
generic block compressed complex columns (#16863) changes: * Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns * Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible. * Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise. * Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier. * Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`. * add ObjectStrategy.readRetainsBufferReference so CompressedComplexColumn only copies on read if required * add copyValueOnRead flag down to CompressedBlockReader to avoid buffer duplicate if the value needs copied anyway
2024-08-27 03:34:41 -04:00
complexMetricCompression
config
configs
consumerProperties
cron
csv
customizable
dataset
datasets
datasketches
datasource
datasources
dbcp
deepstore
denormalization
denormalize
denormalized
deprioritization
deprioritizes
dequeued
deserialization
deserialize
deserialized
deserializes
downtimes
druid extension for OpenID Connect auth using pac4j lib (#8992) * druid pac4j security extension for OpenID Connect OAuth 2.0 authentication * update version in druid-pac4j pom * introducing unauthorized resource filter * authenticated but authorized /unified-webconsole.html * use httpReq.getRequestURI() for matching callback path * add documentation * minor doc addition * licesne file updates * make dependency analyze succeed * fix doc build * hopefully fixes doc build * hopefully fixes license check build * yet another try on fixing license build * revert unintentional changes to website folder * update version to 0.18.0-SNAPSHOT * check session and its expiry on each request * add crypto service * code for encrypting the cookie * update doc with cookiePassphrase * update license yaml * make sessionstore in Pac4jFilter private non static * make Pac4jFilter fields final * okta: use sha256 for hmac * remove incubating * add UTs for crypto util and session store impl * use standard charsets * add license header * remove unused file * add org.objenesis.objenesis to license.yaml * a bit of nit changes in CryptoService and embedding EncryptionResult for clarity * rename alg to cipherAlgName * take cipher alg name, mode and padding as input * add java doc for CryptoService and make it more understandable * another UT for CryptoService * cache pac4j Config * use generics clearly in Pac4jSessionStore * update cookiePassphrase doc to mention PasswordProvider * mark stuff Nullable where appropriate in Pac4jSessionStore * update doc to mention jdbc * add error log on reaching callback resource * javadoc for Pac4jCallbackResource * introduce NOOP_HTTP_ACTION_ADAPTER * add correct module name in license file * correct extensions folder name in licenses.yaml * replace druid-kubernetes-extensions to druid-pac4j * cache SecureRandom instance * rename UnauthorizedResourceFilter to AuthenticationOnlyResourceFilter
2020-03-23 21:15:45 -04:00
druid
druidkubernetes-extensions
druid.table
e.g.
encodings
endian
endpointConfig
enum
expectedType
expr
failover
2023-07-25 22:24:36 -04:00
failovers
featureSpec
findColumnsFromHeader
filenames
filesystem
filterColumn
filterValue
firefox
firehose
firehoses
fromPigAvroStorage
frontends
granularities
granularitySpec
gzip
gzipped
hadoop
hasher
hashcode
hashtable
high-QPS
historicals
hostname
hostnames
http
https
idempotency
i.e.
influxdb
influencer
influencers
ingestions
ingestionSpec
injective
inlined
inSubQueryThreshold
interruptible
isAllowList
jackson-jq
javadoc
javascript
joinable
jsonCompression
json_keys
json_object
json_paths
json_query
json_query_array
json_value
json_merge
karlkfi
kbps
kerberos
keystore
keytool
keytab
kubernetes
kubexit
k8s
laning
lifecycle
localhost
log4j
log4j2
log4j2.xml
lookback
lookups
mapreduce
masse
maxBytes
maxNumericInFilters
maxNumFiles
maxNumSegments
max_map_count
memcached
mergeable
mergeability
metadata
metastores
millis
microbatch
microbatches
misconfiguration
misconfigured
mostAvailableSize
multitenancy
multitenant
MVDs
mysql
namespace
namespaced
namespaces
natively
netflow
nondescriptive
nonfinalized
non-null
non-nullable
noop
NTILE
numerics
numShards
parameterize
objectGlob
parameterized
parse_json
parseable
partitioner
partitionFunction
partitionsSpec
pathParts
PERCENT_RANK
performant
plaintext
pluggable
podSpec
postgres
postgresql
pre-aggregate
pre-aggregated
pre-aggregates
pre-aggregating
pre-aggregation
pre-computation
pre-compute
pre-computing
preconfigured
pre-existing
pre-filtered
pre-filtering
pre-generated
pre-made
pre-processing
preemptible
prefetch
prefetched
prefetching
precached
prepend
prepended
prepending
prepends
prepopulated
preprocessing
priori
procs
processFromRaw
programmatically
proto
proxied
proxyConfig
Druid automated quickstart (#13365) * Druid automated quickstart * remove conf/druid/single-server/quickstart/_common/historical/jvm.config * Minor changes in python script * Add lower bound memory for some services * Additional runtime properties for services * Update supervise script to accept command arguments, corresponding changes in druid-quickstart.py * File end newline * Limit the ability to start multiple instances of a service, documentation changes * simplify script arguments * restore changes in medium profile * run-druid refactor * compute and pass middle manager runtime properties to run-druid supervise script changes to process java opts array use argparse, leave free memory, logging * Remove extra quotes from mm task javaopts array * Update logic to compute minimum memory * simplify run-druid * remove debug options from run-druid * resolve the config_path provided * comment out service specific runtime properties which are computed in the code * simplify run-druid * clean up docs, naming changes * Throw ValueError exception on illegal state * update docs * rename args, compute_only -> compute, run_zk -> zk * update help documentation * update help documentation * move task memory computation into separate method * Add validation checks * remove print * Add validations * remove start-druid bash script, rename start-druid-main * Include tasks in lower bound memory calculation * Fix test * 256m instead of 256g * caffeine cache uses 5% of heap * ensure min task count is 2, task count is monotonic * update configs and documentation for runtime props in conf/druid/single-server/quickstart * Update docs * Specify memory argument for each profile in single-server.md * Update middleManager runtime.properties * Move quickstart configs to conf/druid/base, add bash launch script, support python2 * Update supervise script * rename base config directory to auto * rename python script, changes to pass repeated args to supervise * remove exmaples/conf/druid/base dir * add docs * restore changes in conf dir * update start-druid-auto * remove hashref for commands in supervise script * start-druid-main java_opts array is comma separated * update entry point script name in python script * Update help docs * documentation changes * docs changes * update docs * add support for running indexer * update supported services list * update help * Update python.md * remove dir * update .spelling * Remove dependency on psutil and pathlib * update docs * Update get_physical_memory method * Update help docs * update docs * update method to get physical memory on python * udpate spelling * update .spelling * minor change * Minor change * memory comptuation for indexer * update start-druid * Update python.md * Update single-server.md * Update python.md * run python3 --version to check if python is installed * Update supervise script * start-druid: echo message if python not found * update anchor text * minor change * Update condition in supervise script * JVM not jvm in docs
2022-12-09 14:04:02 -05:00
python2
python3
Python2
Python3
QPS
quantile
quantiles
queryable
quickstart
realtime
rebalance
redis
regexes
reimported
reindex
reindexing
reingest
reingesting
reingestion
repo
requireSSL
rollup
rollups
ROW_NUMBER
rsync
runtime
schemas
schemaless
searchable
secondaryPartitionPruning
seekable
seekable-stream
servlet
setProcessingThreadNames
sigterm
simple-client-sslcontext
sharded
sharding
skipHeaderRows
Smoosh
smoosh
smooshed
snapshotting
splittable
ssl
sslmode
start_time
stdout
storages
stringDictionaryEncoding
stringified
sub-conditions
subarray
subnet
subqueries
subquery
subsecond
substring
substrings
subtask
subtasks
supervisorTaskId
SVG
symlink
syntaxes
systemFields
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-31 00:53:50 -05:00
tablePath
tiering
timeseries
Timeseries
timestamp
timestamps
to_json_string
tradeoffs
transformSpec
try_parse_json
tsv
ulimit
unannounce
unannouncements
unary
unassign
uncomment
underutilization
unintuitive
unioned
unmergeable
unmerged
UNNEST
unnest
unnested
unnesting
unnests
unparseable
unparsed
unsetting
untrusted
useFilterCNF
useJqSyntax
useJsonNodeReader
useSSL
upsert
uptime
uris
urls
useFieldDiscovery
v1
v2
vCPUs
validator
varchar
vectorizable
vectorize
vectorizeVirtualColumns
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-31 00:53:50 -05:00
versioned
versioning
virtualColumns
w.r.t.
walkthrough
whitelist
whitelisted
whitespace
wildcard
wildcards
xml
XOR
znode
znodes
APPROX_COUNT_DISTINCT
APPROX_QUANTILE
ARRAY_AGG
ARRAY_TO_MV
BIGINT
CATALOG_NAME
CHARACTER_MAXIMUM_LENGTH
CHARACTER_OCTET_LENGTH
CHARACTER_SET_NAME
COLLATION_NAME
COLUMN_DEFAULT
COLUMN_NAME
Concats
DATA_TYPE
DATETIME_PRECISION
DEFAULT_CHARACTER_SET_CATALOG
DEFAULT_CHARACTER_SET_NAME
DEFAULT_CHARACTER_SET_SCHEMA
ISODOW
ISOYEAR
IS_NULLABLE
JDBC_TYPE
MIDDLE_MANAGER
MILLIS_TO_TIMESTAMP
NULLable
NUMERIC_PRECISION
NUMERIC_PRECISION_RADIX
NUMERIC_SCALE
ORDINAL_POSITION
PNG
POSIX
P1M
P1Y
PT1M
PT5M
SCHEMA_NAME
SCHEMA_OWNER
SERVER_SEGMENTS
SMALLINT
SQL_PATH
STRING_AGG
SYSTEM_TABLE
TABLE_CATALOG
TABLE_NAME
TABLE_SCHEMA
TABLE_TYPE
TIME_PARSE
TIME_SHIFT
TINYINT
VARCHAR
avg_num_rows
avg_size
created_time
current_size
detailed_state
druid.server.maxSize
druid.server.tier
druid.sql.planner.maxSemiJoinRowsInMemory
druid.sql.planner.sqlTimeZone
druid.sql.planner.useApproximateCountDistinct
druid.sql.planner.useApproximateTopN
2022-08-22 21:47:40 -04:00
druid.sql.planner.useGroupingSetForExactDistinct
druid.sql.planner.useNativeQueryExplain
error_msg
exprs
group_id
interval_expr
is_active
is_available
is_leader
is_overshadowed
is_published
is_realtime
java.sql.Types
last_compaction_state
max_size
num_replicas
num_rows
num_segments
partition_num
plaintext_port
queue_insertion_time
replication_factor
runner_status
segment_id
server_type
shard_spec
sqlTimeZone
sql-msq-task
supervisor_id
sys
sys.segments
task_id
timestamp_expr
tls_port
total_size
useApproximateCountDistinct
useGroupingSetForExactDistinct
useApproximateTopN
wikipedia
your-table
enableTimeBoundaryPlanning
TimeBoundary
druid.query.default.context.enableTimeBoundaryPlanning
IEC
# MSQ general
SegmentGenerator
granularity_string
QueryKit
# MSQ report fields
taskId
multiStageQuery.taskId
multiStageQuery.payload.status
multiStageQuery.payload.status.status
multiStageQuery.payload.status.startTime
multiStageQuery.payload.status.durationMs
multiStageQuery.payload.status.pendingTasks
multiStageQuery.payload.status.runningTasks
multiStageQuery.payload.status.errorReport
multiStageQuery.payload.status.errorReport.taskId
multiStageQuery.payload.status.errorReport.host
multiStageQuery.payload.status.errorReport.stageNumber
multiStageQuery.payload.status.errorReport.error
multiStageQuery.payload.status.errorReport.error.errorCode
multiStageQuery.payload.status.errorReport.error.errorMessage
multiStageQuery.payload.status.errorReport.exceptionStackTrace
multiStageQuery.payload.stages stages
multiStageQuery.payload.stages[].stageNumber
definition.id
definition.input
definition.broadcast
definition.processor
definition.signature
stageNumber
startTime
multiStageQuery.payload.stages
READING_INPUT
POST_READING
RESULTS_COMPLETE
workerCount
partitionCount
startCount
# MSQ errors and limits
BroadcastTablesTooLarge
CannotParseExternalData
ColumnNameRestricted
ColumnTypeNotSupported
DurableStorageConfiguration
ColumnTypeNotSupported
InsertCannotAllocateSegment
InsertCannotBeEmpty
InsertCannotReplaceExistingSegment
InsertLockPreempted
InsertTimeNull
CURRENT_TIMESTAMP
InsertTimeOutOfBounds
UnknownError
TaskStartTimeout
OutOfMemoryError
SegmentGenerator
maxFrameSize
InvalidNullByte
QueryNotSupported
QueryNotSupported
RowTooLarge
TooManyBuckets
TooManyInputFiles
TooManyPartitions
TooManyColumns
TooManyWarnings
TooManyWorkers
NotEnoughMemory
WorkerFailed
WorkerRpcFailed
TIMED_OUT
# MSQ context parameters
maxNumTasks
taskAssignment
finalizeAggregations
indexSpec
rowsInMemory
segmentSortOrder
rowsPerSegment
durableShuffleStorage
composedIntermediateSuperSorterStorageEnabled
intermediateSuperSorterStorageMaxLocalBytes
ResourceLimitExceededException
# Aggregations
groupByEnableMultiValueUnnesting
APPROX_COUNT_DISTINCT_DS_HLL
APPROX_COUNT_DISTINCT_DS_THETA
APPROX_QUANTILE_DS
DS_QUANTILES_SKETCH
APPROX_QUANTILE_FIXED_BUCKETS
# Operators
pivoted
UNPIVOT
unpivoted
# File specific overrides
100x
_common
appender
appenders
druid-hdfs-storage
druid-s3-extensions
druid.sql.planner.maxNumericInFilters
Minio
multi-server
BasicDataSource
2021-11-23 01:28:51 -05:00
LeaderLatch
2.x
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-31 00:53:50 -05:00
28.x
3.0.x
3.5.x
3.4.x
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-31 00:53:50 -05:00
3.5.x.
AllowAll
AuthenticationResult
AuthorizationLoadingLookupTest
booleans
EOF
IE11
InsufficientResourceException
HttpClient
JsonConfigurator
KIP-297
allowAll
authenticatorChain
defaultUser
inputSegmentSizeBytes
skipOffsetFromLatest
brokerService
c3.2xlarge
defaultManualBrokerService
maxPriority
minPriority
NUMBER_FEATURES
NUMBER_OF_CONTRIBUTORS
PreparedStatement
pre-upgrade
QueryCapacityExceededException
QueryTimeoutException
QueryUnsupportedException
ResultSet
runtime.properties
SqlParseException
timeBoundary
ValidationException
0x0
0x9
2GB
300mb-700mb
Bieber
IndexTask-based
Ke
datasource_intervalStart_intervalEnd_version_partitionNum
partitionNum
v9
3.x
8u92
DskipTests
Papache-release
Pdist
Dweb.console.skip
yaml
Phadoop3
dist-hadoop3
hadoop3
2.x.x
3.x.x
ambari-metrics
metricName
trustStore
fetchTimeout
gz
maxCacheCapacityBytes
maxFetchCapacityBytes
maxFetchRetry
prefetchTriggerBytes
shardSpecs
sharedAccessStorageToken
cloudfiles
rackspace-cloudfiles-uk
rackspace-cloudfiles-us
gz
shardSpecs
maxCacheCapacityBytes
maxFetchCapacityBytes
fetchTimeout
maxFetchRetry
distinctCount
groupBy
maxIntermediateRows
numValuesPerPass
queryGranularity
segmentGranularity
topN
visitor_id
cpu
web_requests
_
druid_
druid_cache_total
druid_hits
druid_query
historical001
HadoopTuningConfig
TuningConfig
base-dataSource
baseDataSource
baseDataSource-hashCode
classpathPrefix
derivativeDataSource
druid.extensions.hadoopDependenciesDir
hadoopDependencyCoordinates
maxTaskCount
metricsSpec
queryType
tuningConfig
arcsinh
fieldName
momentSketchMerge
momentsketch
10-minutes
MeanNoNulls
P1D
cycleSize
doubleMax
doubleAny
doubleFirst
doubleLast
doubleMean
doubleMeanNoNulls
doubleMin
doubleSum
druid.generic.useDefaultValueForNull
druid.generic.ignoreNullsForStringCardinality
limitSpec
longMax
longAny
longFirst
longLast
longMean
longMeanNoNulls
longMin
longSum
movingAverage
postAggregations
postAveragers
pull-deps
defaultMetrics.json
Add config option for namespacePrefix (#9372) * Add config option for namespacePrefix opentsdb emitter sends metric names to opentsdb verbatim as what druid names them, for example "query.count", this doesn't fit well with a central opentsdb server which might have namespaced metrics, for example "druid.query.count". This adds support for adding an optional prefix. The prefix also gets a trailing dot (.), after it, so the metric name becomes <namespacePrefix>.<metricname> configureable as "druid.emitter.opentsdb.namespacePrefix", as documented. Co-authored-by: Martin Gerholm <martin.gerholm@deltaprojects.com> Signed-off-by: Martin Gerholm <martin.gerholm@deltaprojects.com> Signed-off-by: Björn Zettergren <bjorn.zettergren@deltaprojects.com> * Spelling for PR #9372 Added "namespacePrefix" to .spelling exceptions, it's a variable name used in documentation for opentsdb-emitter. * fixing tests for PR #9372 changed naming of variables to be more descriptive added test of prefix being an empty string: "". added a conditional to buildNamespacePrefix to check for empty string being fed if EventConverter called without OpentsdbEmitterConfig instance. * fixing checkstyle errors for PR #9372 used == to compare literal string, should be equals() * cleaned up and updated PR #9372 Created a buildMetric function as suggested by clintropolis, and removed redundant tests for empty strings as they're only used when calling EventConverter directly without going through OpentsdbEmitterConfig. * consistent naming of tests PR #9372 Changed names of tests in files to match better with what it was actually testing changed check for Strings.isNullOrEmpty to just check for `null`, as empty string valued `namespacePrefix` is handled in OpentsdbEmitterConfig. Co-authored-by: Martin Gerholm <inspector-martin@users.noreply.github.com>
2020-02-20 17:01:41 -05:00
namespacePrefix
src
loadList
pull-deps
PT2S
com.microsoft.sqlserver.jdbc.SQLServerDriver
sqljdbc
convertRange
prometheus metric exporter (#10412) * prometheus-emitter * use existing jetty server to expose prometheus collection endpoint * unused variables * better variable names * removed unused dependencies * more metric definitions * reorganize * use prometheus HTTPServer instead of hooking into Jetty server * temporary empty help string * temporary non-empty help. fix incorrect dimension value in JSON (also updated statsd json) * added full help text. added metric conversion factor for timers that are not using seconds. Correct metric dimension name in documentation * added documentation for prometheus emitter * safety for invalid labelNames * fix travis checks * Unit test and better sanitization of metrics names and label values * add precondition to check namespace against regex * use precompiled regex * remove static imports. fix metric types * better docs. fix possible NPE in PrometheusEmitterConfig. Guard against multiple calls to PrometheusEmitter.start() * Update regex for label-value replacements to allow internal numeric values. Additional tests * Adds missing license header updates website/.spelling to add words used in prometheus-emitter docs. updates docs/operations/metrics.md to correct the spelling of bufferPoolName * fixes version in extensions-contrib/prometheus-emitter * fix style guide errors * update import ordering * add another word to website/.spelling * remove unthrown declared exception * remove unused import * Pushgateway strategy for metrics * typo * Format fix and nullable strategy * Update pom file for prometheus-emitter * code review comments. Counter to gauge for cache metrics, periodical task to pushGateway * Syntax fix * Dimension label regex include numeric character back, fix previous commit * bump prometheus-emitter pom dev version * Remove scheduled task inside poen that push metrics * Fix checkstyle * Unit test coverage * Unit test coverage * Spelling * Doc fix * spelling Co-authored-by: Michael Schiff <michael.schiff@tubemogul.com> Co-authored-by: Michael Schiff <schiff.michael@gmail.com> Co-authored-by: Tianxin Zhao <tianxin.zhao@tubemogul.com> Co-authored-by: Tianxin Zhao <tizhao@adobe.com>
2021-03-09 17:37:31 -05:00
HTTPServer
conversionFactor
prometheus
Pushgateway
flushPeriod
postAggregator
postAggregators
quantileFromTDigestSketch
quantilesFromTDigestSketch
tDigestSketch
HadoopDruidIndexer
LzoThriftBlock
SequenceFile
classname
hadoop-lzo
inputFormat
inputSpec
ioConfig
parseSpec
thriftClass
thriftJar
timeMax
timeMin
Alibaba
support Aliyun OSS service as deep storage (#9898) * init commit, all tests passed * fix format Signed-off-by: frank chen <frank.chen021@outlook.com> * data stored successfully * modify config path * add doc * add aliyun-oss extension to project * remove descriptor deletion code to avoid warning message output by aliyun client * fix warnings reported by lgtm-com * fix ci warnings Signed-off-by: frank chen <frank.chen021@outlook.com> * fix errors reported by intellj inspection check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc spelling check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix dependency warnings reported by ci Signed-off-by: frank chen <frank.chen021@outlook.com> * fix warnings reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com> * add package configuration to support showing extension info Signed-off-by: frank chen <frank.chen021@outlook.com> * add IT test cases and fix bugs Signed-off-by: frank chen <frank.chen021@outlook.com> * 1. code review comments adopted 2. change schema from 'aliyun-oss' to 'oss' Signed-off-by: frank chen <frank.chen021@outlook.com> * add license info Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc Signed-off-by: frank chen <frank.chen021@outlook.com> * exclude execution of IT testcases of OSS extension from CI Signed-off-by: frank chen <frank.chen021@outlook.com> * put the extensions under contrib group and add to distribution * fix names in test cases * add unit test to cover OssInputSource * fix names in test cases * fix dependency problem reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com>
2020-07-02 01:20:53 -04:00
Aliyun
aliyun-oss-extensions
support Aliyun OSS service as deep storage (#9898) * init commit, all tests passed * fix format Signed-off-by: frank chen <frank.chen021@outlook.com> * data stored successfully * modify config path * add doc * add aliyun-oss extension to project * remove descriptor deletion code to avoid warning message output by aliyun client * fix warnings reported by lgtm-com * fix ci warnings Signed-off-by: frank chen <frank.chen021@outlook.com> * fix errors reported by intellj inspection check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc spelling check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix dependency warnings reported by ci Signed-off-by: frank chen <frank.chen021@outlook.com> * fix warnings reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com> * add package configuration to support showing extension info Signed-off-by: frank chen <frank.chen021@outlook.com> * add IT test cases and fix bugs Signed-off-by: frank chen <frank.chen021@outlook.com> * 1. code review comments adopted 2. change schema from 'aliyun-oss' to 'oss' Signed-off-by: frank chen <frank.chen021@outlook.com> * add license info Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc Signed-off-by: frank chen <frank.chen021@outlook.com> * exclude execution of IT testcases of OSS extension from CI Signed-off-by: frank chen <frank.chen021@outlook.com> * put the extensions under contrib group and add to distribution * fix names in test cases * add unit test to cover OssInputSource * fix names in test cases * fix dependency problem reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com>
2020-07-02 01:20:53 -04:00
AccessKey
accessKey
support Aliyun OSS service as deep storage (#9898) * init commit, all tests passed * fix format Signed-off-by: frank chen <frank.chen021@outlook.com> * data stored successfully * modify config path * add doc * add aliyun-oss extension to project * remove descriptor deletion code to avoid warning message output by aliyun client * fix warnings reported by lgtm-com * fix ci warnings Signed-off-by: frank chen <frank.chen021@outlook.com> * fix errors reported by intellj inspection check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc spelling check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix dependency warnings reported by ci Signed-off-by: frank chen <frank.chen021@outlook.com> * fix warnings reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com> * add package configuration to support showing extension info Signed-off-by: frank chen <frank.chen021@outlook.com> * add IT test cases and fix bugs Signed-off-by: frank chen <frank.chen021@outlook.com> * 1. code review comments adopted 2. change schema from 'aliyun-oss' to 'oss' Signed-off-by: frank chen <frank.chen021@outlook.com> * add license info Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc Signed-off-by: frank chen <frank.chen021@outlook.com> * exclude execution of IT testcases of OSS extension from CI Signed-off-by: frank chen <frank.chen021@outlook.com> * put the extensions under contrib group and add to distribution * fix names in test cases * add unit test to cover OssInputSource * fix names in test cases * fix dependency problem reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com>
2020-07-02 01:20:53 -04:00
aliyun-oss
json
Oshi
OSS
support Aliyun OSS service as deep storage (#9898) * init commit, all tests passed * fix format Signed-off-by: frank chen <frank.chen021@outlook.com> * data stored successfully * modify config path * add doc * add aliyun-oss extension to project * remove descriptor deletion code to avoid warning message output by aliyun client * fix warnings reported by lgtm-com * fix ci warnings Signed-off-by: frank chen <frank.chen021@outlook.com> * fix errors reported by intellj inspection check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc spelling check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix dependency warnings reported by ci Signed-off-by: frank chen <frank.chen021@outlook.com> * fix warnings reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com> * add package configuration to support showing extension info Signed-off-by: frank chen <frank.chen021@outlook.com> * add IT test cases and fix bugs Signed-off-by: frank chen <frank.chen021@outlook.com> * 1. code review comments adopted 2. change schema from 'aliyun-oss' to 'oss' Signed-off-by: frank chen <frank.chen021@outlook.com> * add license info Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc Signed-off-by: frank chen <frank.chen021@outlook.com> * exclude execution of IT testcases of OSS extension from CI Signed-off-by: frank chen <frank.chen021@outlook.com> * put the extensions under contrib group and add to distribution * fix names in test cases * add unit test to cover OssInputSource * fix names in test cases * fix dependency problem reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com>
2020-07-02 01:20:53 -04:00
oss
secretKey
support Aliyun OSS service as deep storage (#9898) * init commit, all tests passed * fix format Signed-off-by: frank chen <frank.chen021@outlook.com> * data stored successfully * modify config path * add doc * add aliyun-oss extension to project * remove descriptor deletion code to avoid warning message output by aliyun client * fix warnings reported by lgtm-com * fix ci warnings Signed-off-by: frank chen <frank.chen021@outlook.com> * fix errors reported by intellj inspection check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc spelling check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix dependency warnings reported by ci Signed-off-by: frank chen <frank.chen021@outlook.com> * fix warnings reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com> * add package configuration to support showing extension info Signed-off-by: frank chen <frank.chen021@outlook.com> * add IT test cases and fix bugs Signed-off-by: frank chen <frank.chen021@outlook.com> * 1. code review comments adopted 2. change schema from 'aliyun-oss' to 'oss' Signed-off-by: frank chen <frank.chen021@outlook.com> * add license info Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc Signed-off-by: frank chen <frank.chen021@outlook.com> * exclude execution of IT testcases of OSS extension from CI Signed-off-by: frank chen <frank.chen021@outlook.com> * put the extensions under contrib group and add to distribution * fix names in test cases * add unit test to cover OssInputSource * fix names in test cases * fix dependency problem reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com>
2020-07-02 01:20:53 -04:00
url
approxHistogram
approxHistogramFold
fixedBucketsHistogram
bucketNum
lowerLimit
numBuckets
upperLimit
AVRO-1124
Avro-1124
SchemaRepo
avro
avroBytesDecoder
protoBytesDecoder
flattenSpec
jq
org.apache.druid.extensions
schemaRepository
schema_inline
subjectAndIdConverter
url
BloomKFilter
bitset
outputStream
HLLSketchBuild
HLLSketchMerge
lgK
log2
tgtHllType
CDF
DoublesSketch
maxStreamLength
PMF
quantilesDoublesSketch
toString
isInputThetaSketch
thetaSketch
user_id
ArrayOfDoublesSketch
arrayOfDoublesSketch
metricColumns
nominalEntries
numberOfValues
INFORMATION_SCHEMA
MyBasicAuthenticator
MyBasicAuthorizer
authenticatorName
authorizerName
druid_system
pollingPeriod
roleName
LDAP
ldap
MyBasicMetadataAuthenticator
MyBasicLDAPAuthenticator
MyBasicMetadataAuthorizer
MyBasicLDAPAuthorizer
credentialsValidator
sAMAccountName
objectClass
initialAdminRole
adminGroupMapping
groupMappingName
8KiB
HttpComponents
MyKerberosAuthenticator
RFC-4559
SPNego
_HOST
cacheFactory
concurrencyLevel
dataFetcher
expireAfterAccess
expireAfterWrite
initialCapacity
loadingCacheSpec
maxEntriesSize
maxStoreSize
maximumSize
onHeapPolling
pollPeriod
reverseLoadingCacheSpec
druid extension for OpenID Connect auth using pac4j lib (#8992) * druid pac4j security extension for OpenID Connect OAuth 2.0 authentication * update version in druid-pac4j pom * introducing unauthorized resource filter * authenticated but authorized /unified-webconsole.html * use httpReq.getRequestURI() for matching callback path * add documentation * minor doc addition * licesne file updates * make dependency analyze succeed * fix doc build * hopefully fixes doc build * hopefully fixes license check build * yet another try on fixing license build * revert unintentional changes to website folder * update version to 0.18.0-SNAPSHOT * check session and its expiry on each request * add crypto service * code for encrypting the cookie * update doc with cookiePassphrase * update license yaml * make sessionstore in Pac4jFilter private non static * make Pac4jFilter fields final * okta: use sha256 for hmac * remove incubating * add UTs for crypto util and session store impl * use standard charsets * add license header * remove unused file * add org.objenesis.objenesis to license.yaml * a bit of nit changes in CryptoService and embedding EncryptionResult for clarity * rename alg to cipherAlgName * take cipher alg name, mode and padding as input * add java doc for CryptoService and make it more understandable * another UT for CryptoService * cache pac4j Config * use generics clearly in Pac4jSessionStore * update cookiePassphrase doc to mention PasswordProvider * mark stuff Nullable where appropriate in Pac4jSessionStore * update doc to mention jdbc * add error log on reaching callback resource * javadoc for Pac4jCallbackResource * introduce NOOP_HTTP_ACTION_ADAPTER * add correct module name in license file * correct extensions folder name in licenses.yaml * replace druid-kubernetes-extensions to druid-pac4j * cache SecureRandom instance * rename UnauthorizedResourceFilter to AuthenticationOnlyResourceFilter
2020-03-23 21:15:45 -04:00
OAuth
Okta
OpenID
pac4j
Env
POD_NAME
POD_NAMESPACE
ConfigMap
PT17S
GCS
gcs-connector
hdfs
Aotearoa
Czechia
KTable
LookupExtractorFactory
Zeelund
zookeeper.connect
0.11.x.
00Z
2016-01-01T11
2016-01-01T12
2016-01-01T14
CONNECTING_TO_STREAM
CREATING_TASKS
DISCOVERING_INITIAL_TASKS
KafkaSupervisorIOConfig
KafkaSupervisorTuningConfig
LOST_CONTACT_WITH_STREAM
OffsetOutOfRangeException
P2147483647D
PT10M
PT10S
PT1H
PT30M
PT30S
PT5S
PT80S
Docs - update dynamic config provider topic (#11795) * update dynamic config provider * update topic * add examples for dynamic config provider: * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Update kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Clint Wylie <cjwylie@gmail.com>
2021-10-14 20:51:32 -04:00
SASL
SegmentWriteOutMediumFactory
UNABLE_TO_CONNECT_TO_STREAM
UNHEALTHY_SUPERVISOR
UNHEALTHY_TASKS
dimensionCompression
earlyMessageRejectionPeriod
indexSpec
intermediateHandoffPeriod
longEncoding
maxBytesInMemory
maxPendingPersists
maxRowsInMemory
maxRowsPerSegment
maxSavedParseExceptions
maxTotalRows
metricCompression
numKafkaPartitions
taskCount
taskDuration
9.2dist
KinesisSupervisorIOConfig
KinesisSupervisorTuningConfig
RabbitMQ
Resharding
resharding
LZ4LZFuncompressedLZ4LZ4LZFuncompressednoneLZ4autolongsautolongslongstypeconcisetyperoaringtypestreamendpointreplicastaskCounttaskCount
deaggregate
druid-kinesis-indexing-service
maxRecordsPerPoll
Kinesis adaptive memory management (#15360) ### Description Our Kinesis consumer works by using the [GetRecords API](https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html) in some number of `fetchThreads`, each fetching some number of records (`recordsPerFetch`) and each inserting into a shared buffer that can hold a `recordBufferSize` number of records. The logic is described in our documentation at: https://druid.apache.org/docs/27.0.0/development/extensions-core/kinesis-ingestion/#determine-fetch-settings There is a problem with the logic that this pr fixes: the memory limits rely on a hard-coded “estimated record size” that is `10 KB` if `deaggregate: false` and `1 MB` if `deaggregate: true`. There have been cases where a supervisor had `deaggregate: true` set even though it wasn’t needed, leading to under-utilization of memory and poor ingestion performance. Users don’t always know if their records are aggregated or not. Also, even if they could figure it out, it’s better to not have to. So we’d like to eliminate the `deaggregate` parameter, which means we need to do memory management more adaptively based on the actual record sizes. We take advantage of the fact that GetRecords doesn’t return more than 10MB (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html ): This pr: eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set) eliminate `deaggregate`, always have it true cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller. add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes. deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified Fixed issue that when the record buffer is full, the fetchRecords logic throws away the rest of the GetRecords result after `recordBufferOfferTimeout` and starts a new shard iterator. This seems excessively churny. Instead, wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch. There was also a call to `newQ::offer` without check in `filterBufferAndResetBackgroundFetch`, which seemed like it could cause data loss. Now checking return value here, and failing if false. ### Release Note Kinesis ingestion memory tuning config has been greatly simplified, and a more adaptive approach is now taken for the configuration. Here is a summary of the changes made: eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set) eliminate `deaggregate`, always have it true cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller. add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes. deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified
2024-01-19 14:30:21 -05:00
maxBytesPerPoll
maxRecordsPerPollrecordsPerFetchfetchDelayMillisreplicasfetchDelayMillisrecordsPerFetchfetchDelayMillismaxRecordsPerPollamazon-kinesis-client1
numKinesisShards
numProcessors
q.size
repartitionTransitionDuration
replicastaskCounttaskCount
resetOffsets
resetuseEarliestSequenceNumberPOST
resumePOST
statusrecentErrorsdruid.supervisor.maxStoredExceptionEventsstatedetailedStatestatedetailedStatestatestatePENDINGRUNNINGSUSPENDEDSTOPPINGUNHEALTHY_SUPERVISORUNHEALTHY_TASKSdetailedStatestatedruid.supervisor.unhealthinessThresholddruid.supervisor.taskUnhealthinessThresholdtaskDurationtaskCountreplicasdetailedStatedetailedStateRUNNINGPOST
supervisorPOST
supervisorfetchThreadsfetchDelayMillisrecordsPerFetchmaxRecordsPerPollpoll
suspendPOST
taskCounttaskDurationreplicas
taskCounttaskDurationtaskDurationPOST
taskDurationstartDelayperioduseEarliestSequenceNumbercompletionTimeouttaskDurationlateMessageRejectionPeriodPT1HearlyMessageRejectionPeriodPT1HPT1HrecordsPerFetchfetchDelayMillisawsAssumedRoleArnawsExternalIddeaggregateGET
terminatePOST
terminatedruid.worker.capacitytaskDurationcompletionTimeoutreplicastaskCountreplicas
PT2M
kinesis.us
amazonaws.com
PT6H
GetRecords
KCL
signalled
ProvisionedThroughputExceededException
Deaggregation
baz
customJson
lookupParseSpec
namespaceParseSpec
simpleJson
dimensionSpec
flattenSpec
binaryAsString
replaceMissingValueWith
sslFactory
sslMode
Proto
metrics.desc
metrics.desc.
metrics.proto.
metrics_pb
protoMessageType
timeAndDims
tmp
SigV4
jvm.config
kms
s3
s3a
s3n
uris
KeyManager
SSLContext
TrustManager
GenericUDAFVariance
Golub
J.L.
LeVeque
Numer
chunk1
chunk2
stddev
t1
t2
variance1
variance2
varianceFold
variance_pop
variance_sample
Berry_statbook
Berry_statbook_chpt6.pdf
S.E.
engineering.com
jcb0773
n1
n2
p1
p2
pvalue2tailedZtest
sqrt
successCount1
successCount2
www.isixsigma.com
www.paypal
www.ucs.louisiana.edu
zscore
zscore2sample
ztests
DistinctCount
artifactId
com.example
common.runtime.properties
druid-aws-rds-extensions
druid-cassandra-storage
druid-compressed-bigdecimal
druid-distinctcount
druid-ec2-extensions
druid-kafka-extraction-namespace
druid-kafka-indexing-service
druid-opentsdb-emitter
druid-protobuf-extensions
druid-tdigestsketch
druid.apache.org
groupId
jvm-global
kafka-emitter
org.apache.druid.extensions.contrib.
pull-deps
sqlserver-metadata-storage
statsd-emitter
coords
dimName
maxCoords
Mb
minCoords
Metaspace
dev
AggregatorFactory
ArchiveTask
ComplexMetrics
DataSegmentArchiver
DataSegmentKiller
DataSegmentMover
URIDataPuller
DataSegmentPusher
DruidModule
ExtractionFns
HdfsStorageDruidModule
JacksonInject
MapBinder
MoveTask
ObjectMapper
PasswordProvider
PostAggregators
QueryRunnerFactory
segmentmetadataquery
SegmentMetadataQuery
SegmentMetadataQueryQueryToolChest
loadSpec
multibind
pom.xml
0.6.x
0.7.x
0.7.x.
TimeAndDims
catalogProperties
catalogUri
column2
column_1
column_n
com.opencsv
ctrl
descriptorString
Kafka Input Format for headers, key and payload parsing (#11630) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
2021-10-07 11:56:27 -04:00
headerFormat
headerLabelPrefix
icebergFilter
icebergCatalog
jsonLowercase
Kafka Input Format for headers, key and payload parsing (#11630) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
2021-10-07 11:56:27 -04:00
kafka
KafkaStringHeaderFormat
kafka.header.
kafka.key
kafka.timestamp
kafka.topic
Kafka Input Format for headers, key and payload parsing (#11630) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
2021-10-07 11:56:27 -04:00
keyColumnName
keyFormat
listDelimiter
lowerOpen
timestamp
Kafka Input Format for headers, key and payload parsing (#11630) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
2021-10-07 11:56:27 -04:00
timestampColumnName
timestampSpec
upperOpen
urls
Kafka Input Format for headers, key and payload parsing (#11630) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
2021-10-07 11:56:27 -04:00
valueFormat
1GB
IOConfig
compactionTask
compactionTasks
numShards
IngestSegment
maxSizes
snapshotTime
windowPeriod
2012-01-01T00
2012-01-03T00
2012-01-05T00
2012-01-07T00
500MB
CombineTextInputFormat
HadoopIndexTask
InputFormat
InputSplit
JobHistory
a.example.com
assumeGrouped
awaitSegmentAvailabilityTimeoutMillis
cleanupOnFailure
combineText
connectURI
dataGranularity
datetime
f.example.com
filePattern
forceExtendableShardSpecs
ignoreInvalidRows
ignoreWhenNoSegments
indexSpecForIntermediatePersists
index_hadoop
inputPath
inputSpecs
interval1
interval2
jobProperties
leaveIntermediate
logParseExceptions
mapred.map.tasks
mapreduce.job.maps
maxParseExceptions
maxPartitionSize
maxSplitSize
metadataUpdateSpec
numBackgroundPersistThreads
overwriteFiles
partitionDimension
partitionDimensions
partitionSpec
pathFormat
segmentOutputPath
segmentTable
shardSpec
single_dim
tableName
targetPartitionSize
targetRowsPerSegment
useCombiner
useExplicitVersion
useNewAggs
useYarnRMJobStatusFallback
workingPath
z.example.com
150MB
DataSchema
DefaultPassword
EnvironmentVariablePasswordProvider
IOConfig
PartitionsSpec
PasswordProviders
SegmentsSplitHintSpec
SplitHintSpec
accessKeyId
appendToExisting
baseDir
chatHandlerNumRetries
chatHandlerTimeout
cityName
connectorConfig
DruidInputSource: Fix issues in column projection, timestamp handling. (#10267) * DruidInputSource: Fix issues in column projection, timestamp handling. DruidInputSource, DruidSegmentReader changes: 1) Remove "dimensions" and "metrics". They are not necessary, because we can compute which columns we need to read based on what is going to be used by the timestamp, transform, dimensions, and metrics. 2) Start using ColumnsFilter (see below) to decide which columns we need to read. 3) Actually respect the "timestampSpec". Previously, it was ignored, and the timestamp of the returned InputRows was set to the `__time` column of the input datasource. (1) and (2) together fix a bug in which the DruidInputSource would not properly read columns that are used as inputs to a transformSpec. (3) fixes a bug where the timestampSpec would be ignored if you attempted to set the column to something other than `__time`. (1) and (3) are breaking changes. Web console changes: 1) Remove "Dimensions" and "Metrics" from the Druid input source. 2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for compatibility with the new behavior. Other changes: 1) Add ColumnsFilter, a new class that allows input readers to determine which columns they need to read. Currently, it's only used by the DruidInputSource, but it could be used by other columnar input sources in the future. 2) Add a ColumnsFilter to InputRowSchema. 3) Remove the metric names from InputRowSchema (they were unused). 4) Add InputRowSchemas.fromDataSchema method that computes the proper ColumnsFilter for given timestamp, dimensions, transform, and metrics. 5) Add "getRequiredColumns" method to TransformSpec to support the above. * Various fixups. * Uncomment incorrectly commented lines. * Move TransformSpecTest to the proper module. * Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting. * Fix. * Fix build. * Checkstyle. * Misc fixes. * Fix test. * Move config. * Fix imports. * Fixup. * Fix ShuffleResourceTest. * Add import. * Smarter exclusions. * Fixes based on tests. Also, add TIME_COLUMN constant in the web console. * Adjustments for tests. * Reorder test data. * Update docs. * Update docs to say Druid 0.22.0 instead of 0.21.0. * Fix test. * Fix ITAutoCompactionTest. * Changes from review & from merging.
2021-03-25 13:32:21 -04:00
countryName
dataSchema
dropExisting
foldCase
forceGuaranteedRollup
httpAuthenticationPassword
httpAuthenticationUsername
ingestSegment
InputSource
DruidInputSource
maxColumnsToMerge
maxInputSegmentBytesPerTask
maxNumConcurrentSubTasks
maxNumSegmentsToMerge
maxRetry
pushTimeout
reportParseExceptions
secretAccessKey
segmentWriteOutMediumFactory
sql
sqls
splitHintSpec
taskStatusCheckPeriodMs
timeChunk
totalNumMergeTasks
prefetchTriggerBytes
awaitSegmentAvailabilityTimeoutMillis
baseDir
httpAuthenticationUsername
DefaultPassword
PasswordProviders
EnvironmentVariablePasswordProvider
ingestSegment
maxInputSegmentBytesPerTask
150MB
foldCase
sqls
connectorConfig
httpAuthenticationPassword
accessKeyId
secretAccessKey
accessKeyId
httpAuthenticationPassword
countryName
appendToExisting
dropExisting
timeChunk
PartitionsSpec
forceGuaranteedRollup
reportParseExceptions
pushTimeout
segmentWriteOutMediumFactory
product_category
product_id
product_name
BUILD_SEGMENTS
DETERMINE_PARTITIONS
forceTimeChunkLock
taskLockTimeout
warehouseSource
warehousePath
index.md
DOUBLE_ARRAY
DOY
DateTimeFormat
LONG_ARRAY
Los_Angeles
P3M
PT12H
STRING_ARRAY
String.format
acos
args
arr1
arr2
array_append
array_concat
ARRAY_CONCAT
array_set_add
array_set_add_all
array_contains
array_length
array_offset
array_offset_of
array_ordinal
array_ordinal_of
array_overlap
array_prepend
array_slice
array_to_string
scalar_in_array
asin
atan
atan2
bitwise
bitwiseAnd
bitwiseComplement
bitwiseConvertDoubleToLongBits
bitwiseConvertLongBitsToDouble
bitwiseOr
bitwiseShiftLeft
bitwiseShiftRight
bitwiseXor
bloom_filter_test
cartesian_fold
cartesian_map
case_searched
case_simple
cbrt
concat
copysign
expm1
expr
expr1
expr2
expr3
expr4
fromIndex
getExponent
hypot
ipv4_match
ipv4_parse
ipv4_stringify
ipv6_match
# IPv6 Address Example Sections
75e9
efa4
29c6
85f6
232c
isnull
java.lang.Math
java.lang.String
JNA
log10
log1p
lpad
ltrim
nextUp
nextafter
notnull
nvl
parse_long
regexp_extract
regexp_like
regexp_replace
contains_string
icontains_string
result1
result2
rint
rpad
rtrim
safe_divide
scalb
signum
str1
str2
string_to_array
stringAny
stringFirst
stringLast
Strlen
strlen
strpos
timestamp_ceil
timestamp_extract
timestamp_floor
timestamp_format
timestamp_parse
timestamp_shift
todegrees
toradians
ulp
unix_timestamp
value1
value2
valueOf
IEC
human_readable_binary_byte_format
human_readable_decimal_byte_format
human_readable_decimal_format
RADStack
00.000Z
2015-09-12T03
2015-09-12T05
2016-06-27_2016-06-28
Param
SupervisorSpec
dropRule
druid.query.segmentMetadata.defaultHistory
isointerval
json
loadRule
maxTime
minTime
numCandidates
param
segmentId1
segmentId2
taskId
taskid
un
100MiB
128MiB
15ms
2.5MiB
24GiB
256MiB
30GiB-60GiB
4GiB
5MB
64KiB
8GiB
G1GC
GroupBys
QoS-type
DumpSegment
SegmentMetadata
__time
bitmapSerdeFactory
columnName
index.zip
time-iso8601
hadoopStorageDirectory
0.14.x
G1
Temurin
0.14.x
1s
Bufferpool
Filesystem
JVMMonitor
jvmVersion
QueryCountStatsMonitor
Sys
SysMonitor
TaskCountStatsMonitor
TaskSlotCountStatsMonitor
WorkerTaskCountStatsMonitor
2022-07-14 01:09:03 -04:00
workerVersion
bufferCapacity
bufferpoolName
cms
cpuName
cpuTime
druid.server.http.numThreads
druid.server.http.queueSize
fsDevName
fsDirName
fsOptions
fsSysTypeName
fsTypeName
g1
gcGen
gcName
handoffed
hasFilters
memKind
nativeQueryIds
netAddress
netHwaddr
netName
noticeType
numComplexMetrics
numDimensions
numMetrics
poolKind
poolName
remoteAddress
segmentAvailabilityConfirmed
serviceName
taskActionType
taskIngestionMode
taskStatus
taskType
threadPoolNumBusyThreads.
threadPoolNumIdleThreads
threadPoolNumTotalThreads.
CDH
Classloader
assembly.sbt
build.sbt
classloader
druid_build
mapred-default
mapred-site
sbt
scala-2
org.apache.hadoop
proxy.com.
remoteRepository
JBOD
druid.processing.buffer.sizeBytes.
druid.processing.numMergeBuffers
druid.processing.numThreads
tmpfs
broadcastByInterval
broadcastByPeriod
broadcastForever
colocatedDataSources
dropBeforeByPeriod
dropByInterval
dropByPeriod
dropForever
loadByInterval
loadByPeriod
loadForever
700MB
128GiB
16GiB
256GiB
4GiB
512GiB
64GiB
Nano-Quickstart
i3
i3.16xlarge
i3.2xlarge
i3.4xlarge
i3.8xlarge
CN
subjectAltNames
HyperUnique
hyperUnique
longSum
groupBys
dataSourceMetadata
ExtractionDimensionSpec
SimpleDateFormat
bar_1
dimensionSpecs
isWhitelist
joda
nullHandling
product_1
product_3
registeredLookup
timeFormat
tz
v3
weekyears
___bar
caseSensitive
extractionFn
insensitive_contains
last_name
lowerStrict
upperStrict
1970-01-01T00
P2W
PT0.750S
PT1H30M
TimeseriesQuery
D1
D2
D3
druid.query.groupBy.defaultStrategy
druid.query.groupBy.maxSelectorDictionarySize
druid.query.groupBy.maxMergingDictionarySize
druid.query.groupBy.maxOnDiskStorage
druid.query.groupBy.maxResults.
groupByStrategy
maxOnDiskStorage
maxResults
orderby
orderbys
outputName
pushdown
row1
subtotalsSpec
tradeoff
HavingSpec
HavingSpecs
dimSelector
equalTo
greaterThan
lessThan
DefaultDimensionSpec
druid-hll
isInputHyperUnique
pre-join
DefaultLimitSpec
OrderByColumnSpec
OrderByColumnSpecs
dimensionOrder
60_000
kafka-extraction-namespace
mins
tierName
row2
row3
row4
t3
t4
t5
groupByEnableMultiValueUnnesting
unnesting
500ms
tenant_id
fieldAccess
finalizingFieldAccess
hyperUniqueCardinality
brokerService
bySegment
doubleSum
druid.broker.cache.populateCache
druid.broker.cache.populateResultLevelCache
druid.broker.cache.useCache
druid.broker.cache.useResultLevelCache
druid.historical.cache.populateCache
druid.historical.cache.useCache
parallel broker merges on fork join pool (#8578) * sketch of broker parallel merges done in small batches on fork join pool * fix non-terminating sequences, auto compute parallelism * adjust benches * adjust benchmarks * now hella more faster, fixed dumb * fix * remove comments * log.info for debug * javadoc * safer block for sequence to yielder conversion * refactor LifecycleForkJoinPool into LifecycleForkJoinPoolProvider which wraps a ForkJoinPool * smooth yield rate adjustment, more logs to help tune * cleanup, less logs * error handling, bug fixes, on by default, more parallel, more tests * remove unused var * comments * timeboundary mergeFn * simplify, more javadoc * formatting * pushdown config * use nanos consistently, move logs back to debug level, bit more javadoc * static terminal result batch * javadoc for nullability of createMergeFn * cleanup * oops * fix race, add docs * spelling, remove todo, add unhandled exception log * cleanup, revert unintended change * another unintended change * review stuff * add ParallelMergeCombiningSequenceBenchmark, fixes * hyper-threading is the enemy * fix initial start delay, lol * parallelism computer now balances partition sizes to partition counts using sqrt of sequence count instead of sequence count by 2 * fix those important style issues with the benchmarks code * lazy sequence creation for benchmarks * more benchmark comments * stable sequence generation time * update defaults to use 100ms target time, 4096 batch size, 16384 initial yield, also update user docs * add jmh thread based benchmarks, cleanup some stuff * oops * style * add spread to jmh thread benchmark start range, more comments to benchmarks parameters and purpose * retool benchmark to allow modeling more typical heterogenous heavy workloads * spelling * fix * refactor benchmarks * formatting * docs * add maxThreadStartDelay parameter to threaded benchmark * why does catch need to be on its own line but else doesnt
2019-11-07 14:58:46 -05:00
enableParallelMerge
enableJoinLeftTableScanDirect
enableJoinFilterPushDown
enableJoinFilterRewrite
enableRewriteJoinToFilter
enableJoinFilterRewriteValueColumnFilters
floatFirst
floatLast
floatSum
joinFilterRewriteMaxSize
maxQueuedBytes
maxScatterGatherBytes
minTopNThreshold
parallel broker merges on fork join pool (#8578) * sketch of broker parallel merges done in small batches on fork join pool * fix non-terminating sequences, auto compute parallelism * adjust benches * adjust benchmarks * now hella more faster, fixed dumb * fix * remove comments * log.info for debug * javadoc * safer block for sequence to yielder conversion * refactor LifecycleForkJoinPool into LifecycleForkJoinPoolProvider which wraps a ForkJoinPool * smooth yield rate adjustment, more logs to help tune * cleanup, less logs * error handling, bug fixes, on by default, more parallel, more tests * remove unused var * comments * timeboundary mergeFn * simplify, more javadoc * formatting * pushdown config * use nanos consistently, move logs back to debug level, bit more javadoc * static terminal result batch * javadoc for nullability of createMergeFn * cleanup * oops * fix race, add docs * spelling, remove todo, add unhandled exception log * cleanup, revert unintended change * another unintended change * review stuff * add ParallelMergeCombiningSequenceBenchmark, fixes * hyper-threading is the enemy * fix initial start delay, lol * parallelism computer now balances partition sizes to partition counts using sqrt of sequence count instead of sequence count by 2 * fix those important style issues with the benchmarks code * lazy sequence creation for benchmarks * more benchmark comments * stable sequence generation time * update defaults to use 100ms target time, 4096 batch size, 16384 initial yield, also update user docs * add jmh thread based benchmarks, cleanup some stuff * oops * style * add spread to jmh thread benchmark start range, more comments to benchmarks parameters and purpose * retool benchmark to allow modeling more typical heterogenous heavy workloads * spelling * fix * refactor benchmarks * formatting * docs * add maxThreadStartDelay parameter to threaded benchmark * why does catch need to be on its own line but else doesnt
2019-11-07 14:58:46 -05:00
parallelMergeInitialYieldRows
parallelMergeParallelism
parallelMergeSmallBatchRows
populateCache
populateResultLevelCache
queryId
row-matchers
serializeDateTimeAsLong
serializeDateTimeAsLongInner
skipEmptyBuckets
useCache
useResultLevelCache
vectorSize
enableJoinLeftTableScanDirect
enableJoinFilterPushDown
enableJoinFilterRewrite
enableJoinFilterRewriteValueColumnFilters
joinFilterRewriteMaxSize
7KiB
DatasourceMetadata
TimeBoundary
errorClass
errorMessage
x-jackson-smile
batchSize
compactedList
druid.query.scan.legacy
druid.query.scan.maxRowsQueuedForOrdering
druid.query.scan.maxSegmentPartitionsOrderedInMemory
maxRowsQueuedForOrdering
maxSegmentPartitionsOrderedInMemory
resultFormat
valueVector
SearchQuerySpec
cursorOnly
druid.query.search.searchStrategy
queryableIndexSegment
searchDimensions
searchStrategy
useIndexes
ContainsSearchQuerySpec
FragmentSearchQuerySpec
InsensitiveContainsSearchQuerySpec
RegexSearchQuerySpec
analysisType
analysisTypes
aggregatorMergeStrategy
lenientAggregatorMerge
minmax
segmentMetadata
toInclude
PagingSpec
fromNext
pagingSpec
BoundFilter
GroupByQuery
SearchQuery
TopNMetricSpec
compareTo
file12
file2
_x_
fieldName1
fieldName2
DimensionTopNMetricSpec
metricSpec
previousStop
GroupByQuery
top500
outputType
1.9TB
16CPU
WebUpd8
m5.2xlarge
metadata.storage.
256GiB
128GiB
PATH_TO_DRUID
namenode
segmentID
segmentIds
dstIP
dstPort
srcIP
srcPort
common_runtime_properties
druid.extensions.directory
druid.extensions.loadList
druid.hadoop.security.kerberos.keytab
druid.hadoop.security.kerberos.principal
druid.indexer.logs.directory
druid.indexer.logs.type
druid.storage.storageDirectory
druid.storage.type
hdfs.headless.keytab
indexing_log
keytabs
dsql
2015-09-12T12
clickstreams
uid
_k_
Bridgerton
Hellmar
bear-111
10KiB
2GiB
512KiB
1GiB
KiB
GiB
00.000Z
100ms
10ms
1GB
1_000_000
2012-01-01T00
2GB
524288000L
5MiB
8u60
Autoscaler
APPROX_COUNT_DISTINCT_BUILTIN
AvaticaConnectionBalancer
File.getFreeSpace
File.getTotalSpace
ForkJoinPool
GCE
HadoopIndexTasks
HttpEmitter
HttpPostEmitter
InetAddress.getLocalHost
IOConfig
JRE8u60
KeyManager
L1
L2
ListManagedInstances
LoadSpec
LoggingEmitter
Los_Angeles
MDC
NoopServiceEmitter
NUMA
ONLY_EVENTS
P1D
P1W
PT-1S
PT0.050S
PT10M
PT10S
PT15M
PT1800S
PT1M
PT1S
PT24H
PT300S
PT30S
PT3600S
PT5M
PT5S
PT60S
PT90M
Param
Runtime.maxMemory
SSLContext
SegmentMetadata
SegmentWriteOutMediumFactory
ServiceEmitter
System.getProperty
TLSv1.2
TrustManager
TuningConfig
_N_
_default
_default_tier
addr
affinityConfig
allowAll
ANDed
array_mod
autoscale
autoscalers
batch_index_task
cgroup
classloader
com.metamx
common.runtime.properties
cpuacct
dataSourceName
datetime
defaultHistory
doubleMax
doubleMin
doubleSum
druid.enableTlsPort
druid.indexer.autoscale.workerVersion
druid.service
druid.storage.disableAcl
druid_audit
druid_config
druid_dataSource
druid_pendingSegments
druid_rules
druid_segments
druid_supervisors
druid_tasklocks
druid_tasklogs
druid_tasks
DruidQueryRel
durationToRetain
ec2
equalDistribution
extractionFn
filename
file.encoding
fillCapacity
first_location
floatMax
floatAny
floatMin
floatSum
freeSpacePercent
gce
gce-extensions
getCanonicalHostName
groupBy
hdfs
httpRemote
indexTask
info_dir
inlining
java.class.path
java.io.tmpdir
javaOpts
javaOptsArray
Making optimal usage of multiple segment cache locations (#8038) * #7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations * Fixing indentation * WIP * Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy * fixing code style * Fixing test * Adding a method visible only for testing, fixing tests * 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes * fixing the conditional statement * Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy * to trigger CI build * Add documentation for the selection strategy configuration * to re trigger CI build * updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes * In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail * Implementing review comments. Added tests for StorageLocationSelectorStrategy * Checkstyle fixes * Adding java doc comments for StorageLocationSelectorStrategy interface * checkstyle * empty commit to retrigger build * Empty commit * Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file * Impl review comments including updating docs as suggested * Removing checkLocationConfigForNull(), @NotEmpty annotation serves the purpose * Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy * Fixing the round robin iterator * Removed numLocationsToTry, updated java docs * changing property attribute value from tier to type * Fixing assert messages
2019-09-28 02:17:44 -04:00
leastBytesUsed
loadList
loadqueuepeon
loadspec
localStorage
maxHeaderSize
maxQueuedBytes
maxSize
middlemanager
minTimeMs
minmax
mins
nullable
orderby
orderbys
org.apache.druid
org.apache.druid.jetty.RequestLog
org.apache.hadoop
OSHI
OshiSysMonitor
overlord.html
pendingSegments
pre-flight
preloaded
queryType
remoteTaskRunnerConfig
rendezvousHash
replicants
resultsets
Making optimal usage of multiple segment cache locations (#8038) * #7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations * Fixing indentation * WIP * Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy * fixing code style * Fixing test * Adding a method visible only for testing, fixing tests * 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes * fixing the conditional statement * Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy * to trigger CI build * Add documentation for the selection strategy configuration * to re trigger CI build * updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes * In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail * Implementing review comments. Added tests for StorageLocationSelectorStrategy * Checkstyle fixes * Adding java doc comments for StorageLocationSelectorStrategy interface * checkstyle * empty commit to retrigger build * Empty commit * Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file * Impl review comments including updating docs as suggested * Removing checkLocationConfigForNull(), @NotEmpty annotation serves the purpose * Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy * Fixing the round robin iterator * Removed numLocationsToTry, updated java docs * changing property attribute value from tier to type * Fixing assert messages
2019-09-28 02:17:44 -04:00
roundRobin
runtime.properties
runtime.properties.
s3
s3a
s3n
slf4j
sql
sqlQuery
successfulSending
[S]igar
taskBlackListCleanupPeriod
tasklogs
timeBoundary
DruidInputSource: Fix issues in column projection, timestamp handling. (#10267) * DruidInputSource: Fix issues in column projection, timestamp handling. DruidInputSource, DruidSegmentReader changes: 1) Remove "dimensions" and "metrics". They are not necessary, because we can compute which columns we need to read based on what is going to be used by the timestamp, transform, dimensions, and metrics. 2) Start using ColumnsFilter (see below) to decide which columns we need to read. 3) Actually respect the "timestampSpec". Previously, it was ignored, and the timestamp of the returned InputRows was set to the `__time` column of the input datasource. (1) and (2) together fix a bug in which the DruidInputSource would not properly read columns that are used as inputs to a transformSpec. (3) fixes a bug where the timestampSpec would be ignored if you attempted to set the column to something other than `__time`. (1) and (3) are breaking changes. Web console changes: 1) Remove "Dimensions" and "Metrics" from the Druid input source. 2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for compatibility with the new behavior. Other changes: 1) Add ColumnsFilter, a new class that allows input readers to determine which columns they need to read. Currently, it's only used by the DruidInputSource, but it could be used by other columnar input sources in the future. 2) Add a ColumnsFilter to InputRowSchema. 3) Remove the metric names from InputRowSchema (they were unused). 4) Add InputRowSchemas.fromDataSchema method that computes the proper ColumnsFilter for given timestamp, dimensions, transform, and metrics. 5) Add "getRequiredColumns" method to TransformSpec to support the above. * Various fixups. * Uncomment incorrectly commented lines. * Move TransformSpecTest to the proper module. * Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting. * Fix. * Fix build. * Checkstyle. * Misc fixes. * Fix test. * Move config. * Fix imports. * Fixup. * Fix ShuffleResourceTest. * Add import. * Smarter exclusions. * Fixes based on tests. Also, add TIME_COLUMN constant in the web console. * Adjustments for tests. * Reorder test data. * Update docs. * Update docs to say Druid 0.22.0 instead of 0.21.0. * Fix test. * Fix ITAutoCompactionTest. * Changes from review & from merging.
2021-03-25 13:32:21 -04:00
timestampSpec
tmp
tmpfs
truststore
tuningConfig
unioning
useIndexes
user.timezone
v0.12.0
versionReplacementString
workerId
yyyy-MM-dd
taskType
index_kafka
c1
c2
ds1
equalDistributionWithCategorySpec
fillCapacityWithCategorySpec
WorkerCategorySpec
workerCategorySpec
CategoryConfig
logsearch
2000-01-01T01
DateTimeFormat
JsonPath
autodetect
createBitmapIndex
dimensionExclusions
expr
jackson-jq
missingValue
skipBytesInMemoryOverheadCheck
spatialDimensions
Geo spatial interfaces (#16029) This PR creates an interface for ImmutableRTree and moved the existing implementation to new class which represent 32 bit implementation (stores coordinate as floats). This PR makes the ImmutableRTree extendable to create higher precision implementation as well (64 bit). In all spatial bound filters, we accept float as input which might not be accurate in the case of high precision implementation of ImmutableRTree. This PR changed the bound filters to accepts the query bounds as double instead of float and it is backward compatible change as it compares double to existing float values in RTree. Previously it was comparing input float to RTree floats which can cause precision loss, now it is little better as it compares double to float which is still not 100% accurate. There are no changes in the way that we query spatial dimension today except input bound parsing. There is little improvement in string filter predicate which now parse double strings instead of float and compares double to double which is 100% accurate but string predicate is only called when we dont have spatial index. With allowing the interface to extend ImmutableRTree, we allow to create high precision (HP) implementation and defines new search strategies to perform HP search Iterable<ImmutableBitmap> search(ImmutableDoubleNode node, Bound bound); With possible HP implementations, Radius bound filter can not really focus on accuracy, it is calculating Euclidean distance in comparing. As EARTH 🌍 is round and not flat, Euclidean distances are not accurate in geo system. This PR adds new param called 'radiusUnit' which allows you to specify units like meters, km, miles etc. It uses https://en.wikipedia.org/wiki/Haversine_formula to check if given geo point falls inside circle or not. Added a test that generates set of points inside and outside in RadiusBoundTest.
2024-04-01 05:28:03 -04:00
radiusUnit
euclidean
kilometers
useFieldDiscovery
4CPU
cityName
countryIsoCode
countryName
isAnonymous
isMinor
isNew
isRobot
isUnpatrolled
metroCode
regionIsoCode
regionName
4GiB
512GiB
json
metastore
UserGroupInformation
CVE-2019-17571
CVE-2019-12399
CVE-2018-17196
bin.tar.gz
0s
1T
3G
1_000
1_000_000
1_000_000_000
1_000_000_000_000
1_000_000_000_000_000
Giga
Tera
Peta
KiB
MiB
GiB
TiB
PiB
protobuf
Golang
multiValueHandling
_n_
KLL
KllFloatsSketch
KllDoublesSketch
PMF
CDF
maxStreamLength
toString
100TB
compressedBigDecimal
limitSpec
metricsSpec
postAggregations
SaleAmount
IngestionSpec
druid-compressed-bigdecimal
doubleSum
ANY_VALUE
APPROX_COUNT_DISTINCT_DS_HLL
APPROX_COUNT_DISTINCT_DS_THETA
APPROX_QUANTILE_DS
APPROX_QUANTILE_FIXED_BUCKETS
ARRAY_CONCAT_AGG
BIT_AND
BIT_OR
BIT_XOR
BITWISE_AND
BITWISE_COMPLEMENT
BITWISE_CONVERT_DOUBLE_TO_LONG_BITS
BITWISE_CONVERT_LONG_BITS_TO_DOUBLE
BITWISE_OR
BITWISE_SHIFT_LEFT
BITWISE_SHIFT_RIGHT
BITWISE_XOR
BLOOM_FILTER
BTRIM
CHAR_LENGTH
CHARACTER_LENGTH
CURRENT_DATE
CURRENT_TIMESTAMP
DATE_TRUNC
DECODE_BASE64_COMPLEX
DECODE_BASE64_UTF8
DS_CDF
DS_GET_QUANTILE
DS_GET_QUANTILES
DS_HISTOGRAM
DS_HLL
DS_QUANTILE_SUMMARY
DS_QUANTILES_SKETCH
DS_RANK
DS_THETA
DS_TUPLE_DOUBLES
DS_TUPLE_DOUBLES_INTERSECT
DS_TUPLE_DOUBLES_METRICS_SUM_ESTIMATE
DS_TUPLE_DOUBLES_NOT
DS_TUPLE_DOUBLES_UNION
EARLIEST_BY
_e_
HLL_SKETCH_ESTIMATE
HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS
HLL_SKETCH_TO_STRING
HLL_SKETCH_UNION
LATEST_BY
base-10
MV_APPEND
MV_CONCAT
MV_CONTAINS
MV_FILTER_NONE
MV_FILTER_ONLY
MV_LENGTH
MV_OFFSET
MV_OFFSET_OF
MV_ORDINAL
MV_ORDINAL_OF
MV_OVERLAP
MV_PREPEND
MV_SLICE
MV_TO_STRING
NULLIF
_n_th
STDDEV_POP
STDDEV_SAMP
STRING_FORMAT
STRING_TO_MV
SUBSTR
TDIGEST_GENERATE_SKETCH
TDIGEST_QUANTILE
TEXTCAT
THETA_SKETCH_ESTIMATE
THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS
THETA_SKETCH_INTERSECT
THETA_SKETCH_NOT
THETA_SKETCH_UNION
TIME_CEIL
TIME_EXTRACT
TIME_FLOOR
TIME_FORMAT
TIME_IN_INTERVAL
TIMESTAMP_TO_MILLIS
TIMESTAMPADD
TIMESTAMPDIFF
TRUNC
VAR_POP
VAR_SAMP
KTable
Aotearoa
Czechia
2022-08-22 21:47:40 -04:00
Zeelund
nano
Druid automated quickstart (#13365) * Druid automated quickstart * remove conf/druid/single-server/quickstart/_common/historical/jvm.config * Minor changes in python script * Add lower bound memory for some services * Additional runtime properties for services * Update supervise script to accept command arguments, corresponding changes in druid-quickstart.py * File end newline * Limit the ability to start multiple instances of a service, documentation changes * simplify script arguments * restore changes in medium profile * run-druid refactor * compute and pass middle manager runtime properties to run-druid supervise script changes to process java opts array use argparse, leave free memory, logging * Remove extra quotes from mm task javaopts array * Update logic to compute minimum memory * simplify run-druid * remove debug options from run-druid * resolve the config_path provided * comment out service specific runtime properties which are computed in the code * simplify run-druid * clean up docs, naming changes * Throw ValueError exception on illegal state * update docs * rename args, compute_only -> compute, run_zk -> zk * update help documentation * update help documentation * move task memory computation into separate method * Add validation checks * remove print * Add validations * remove start-druid bash script, rename start-druid-main * Include tasks in lower bound memory calculation * Fix test * 256m instead of 256g * caffeine cache uses 5% of heap * ensure min task count is 2, task count is monotonic * update configs and documentation for runtime props in conf/druid/single-server/quickstart * Update docs * Specify memory argument for each profile in single-server.md * Update middleManager runtime.properties * Move quickstart configs to conf/druid/base, add bash launch script, support python2 * Update supervise script * rename base config directory to auto * rename python script, changes to pass repeated args to supervise * remove exmaples/conf/druid/base dir * add docs * restore changes in conf dir * update start-druid-auto * remove hashref for commands in supervise script * start-druid-main java_opts array is comma separated * update entry point script name in python script * Update help docs * documentation changes * docs changes * update docs * add support for running indexer * update supported services list * update help * Update python.md * remove dir * update .spelling * Remove dependency on psutil and pathlib * update docs * Update get_physical_memory method * Update help docs * update docs * update method to get physical memory on python * udpate spelling * update .spelling * minor change * Minor change * memory comptuation for indexer * update start-druid * Update python.md * Update single-server.md * Update python.md * run python3 --version to check if python is installed * Update supervise script * start-druid: echo message if python not found * update anchor text * minor change * Update condition in supervise script * JVM not jvm in docs
2022-12-09 14:04:02 -05:00
MacOS
RHEL
psutil
pathlib
kttm_simple
dist_country
# Extensions
druid-avro-extensions
druid-azure-extensions
druid-basic-security
druid-bloom-filter
druid-datasketches
druid-google-extensions
druid-hdfs-storage
druid-histogram
druid-kafka-extraction-name
druid-kafka-indexing-service
druid-kinesis-indexing-service
druid-kerberos
druid-lookups-cached-global
druid-lookups-cached-single
druid-multi-stage-query
druid-orc-extensions
druid-parquet-extensions
druid-protobuf-extensions
druid-ranger-security
druid-s3-extensions
druid-ec2-extensions
druid-aws-rds-extensions
druid-stats
mysql-metadata-storage
postgresql-metadata-storage
simple-client-sslcontext
druid-pac4j
druid-kubernetes-extensions
aliyun-oss-extensions
ambari-metrics-emitter
druid-cassandra-storage
druid-cloudfiles-extensions
druid-compressed-bigdecimal
druid-distinctcount
druid-redis-cache
druid-time-min-max
sqlserver-metadata-storage
graphite-emitter|Graphite metrics emitter
statsd-emitter|StatsD metrics emitter
kafka-emitter|Kafka metrics emitter
druid-thrift-extensions
druid-opentsdb-emitter
materialized-view-selection
materialized-view-maintenance
druid-moving-average-query
druid-influxdb-emitter
druid-momentsketch
druid-tdigestsketch
gce-extensions
prometheus-emitter
kubernetes-overlord-extensions
UCS
ISO646-US
completeTasks
runningTasks
waitingTasks
pendingTasks
shutdownAllTasks
supervisorId
suspendAll
resumeAll
terminateAll
selfDiscovered
loadstatus
isLeader
taskslots
loadstatus
sqlQueryId
useAzureCredentialsChain
DefaultAzureCredential
LAST_VALUE
markUnused
markUsed
segmentId
aggregateMultipleValues
Azure multi read options (#15630) * Include new dependencies * Mostly implemented * More azure fixes * Tests passing * Unit tests running * Test running after removing storage exception * Happy with coverage now * Add more tests * fix client factory * cleanup from testing * Remove old client * update docs * Exclude from spellcheck * Add licenses * Fix identity version * Save work * Add azure clients * add licenses * typos * Add dependencies * Exception is not thrown * Fix intellij check * Don't need to override * specify length * urldecode * encode path * Fix checks * Revert urlencode changes * Urlencode with azure library * Update docs/development/extensions-core/azure.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * PR changes * Update docs/development/extensions-core/azure.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Add config for multiple storage accounts * Deprecate AzureTaskLogsConfig.maxRetries * Clean up azure retry block * logic update to reuse clients * fix comments * Create container conditionally * Fix key auth * save work * Fix unit tests * Revert old azure input type * Separate input source * save work * Add support for app registrations * Fix unit tests * clean up spacing * Add coverage * fixes from testing * cleanup some caching behavior * Add docs * Fix spelling issues * fix more spelling errors' * Fix intellij inspections * add simple changes from pr * save work on fixing bug * Fix unit tests * Add more testing * Fix unit test * Add tests * Add annotation for azureStorage * Fix up docs * Add comment for list method * Fix tests * Remove uneeded toString * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * PR changes * fix injection of StorageConnector * Fix checkstyle * clean up unit tests * More pr fixes --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-01-25 13:29:16 -05:00
appRegistrationClientId
appRegistrationClientSecret
tenantId
relativeError
ddSketch
DDSketch
druid-ddsketch
numBins
- ../docs/development/extensions-contrib/spectator-histogram.md
SpectatorHistogram
PercentileBuckets
spectatorHistogram
spectatorHistogramTimer
spectatorHistogramDistribution
percentileSpectatorHistogram
percentilesSpectatorHistogram
- ../docs/development/extensions-contrib/ddsketch-quantiles.md
quantilesFromDDSketch
quantileFromDDSketch
collapsingLowestDense
snapshotVersion
- ../docs/development/extensions-core/catalog.md
ColumnSpec
PropertyKeyName
PropertyValueType
tableSpec
TableSpec