diff --git a/DataIngestion/schemadesign.md b/DataIngestion/schemadesign.md index 3934dd5..98ddcc3 100644 --- a/DataIngestion/schemadesign.md +++ b/DataIngestion/schemadesign.md @@ -102,9 +102,9 @@ Druid可以在接收数据时将其汇总,以最小化需要存储的原始数 Druid schema必须始终包含一个主时间戳, 主时间戳用于对数据进行 [分区和排序](ingestion.md#分区),因此它应该是您最常筛选的时间戳。Druid能够快速识别和检索与主时间戳列的时间范围相对应的数据。 -如果数据有多个时间戳,则可以将其他时间戳作为辅助时间戳摄取。最好的方法是将它们作为 [毫秒格式的Long类型维度](ingestion.md#dimensionsspec) 摄取。如有必要,可以使用 [`transformSpec`](ingestion.md#transformspec) 和 `timestamp_parse` 等 [表达式](../Misc/expression.md) 将它们转换成这种格式,后者返回毫秒时间戳。 +如果数据有多个时间戳,则可以将其他时间戳作为辅助时间戳摄取。最好的方法是将它们作为 [毫秒格式的Long类型维度](ingestion.md#dimensionsspec) 摄取。如有必要,可以使用 [`transformSpec`](ingestion.md#transformspec) 和 `timestamp_parse` 等 [表达式](../misc/expression.md) 将它们转换成这种格式,后者返回毫秒时间戳。 -在查询时,可以使用诸如 `MILLIS_TO_TIMESTAMP`、`TIME_FLOOR` 等 [SQL时间函数](../Querying/druidsql.md) 查询辅助时间戳。如果您使用的是原生Druid查询,那么可以使用 [表达式](../Misc/expression.md)。 +在查询时,可以使用诸如 `MILLIS_TO_TIMESTAMP`、`TIME_FLOOR` 等 [SQL时间函数](../Querying/druidsql.md) 查询辅助时间戳。如果您使用的是原生Druid查询,那么可以使用 [表达式](../misc/expression.md)。 #### 嵌套维度 diff --git a/Misc/expression.md b/Misc/expression.md deleted file mode 100644 index 1eef3b8..0000000 --- a/Misc/expression.md +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/Misc/index.md b/Misc/index.md deleted file mode 100644 index 2a71d64..0000000 --- a/Misc/index.md +++ /dev/null @@ -1 +0,0 @@ -## 其他 \ No newline at end of file diff --git a/Misc/learning.md b/Misc/learning.md deleted file mode 100644 index fdf4547..0000000 --- a/Misc/learning.md +++ /dev/null @@ -1,43 +0,0 @@ -## Druid入门学习类文章 - -1. [十分钟了解Apache Druid](https://www.cnblogs.com/WeaRang/p/12421873.html) - - Apache Druid是一个集时间序列数据库、数据仓库和全文检索系统特点于一体的分析性数据平台。本文将带你简单了解Druid的特性,使用场景,技术特点和架构。这将有助于你选型数据存储方案,深入了解Druid存储,深入了解时间序列存储等。 - - [原文链接](https://www.cnblogs.com/WeaRang/p/12421873.html) - -2. [勾叔谈大数据:大厂做法:Apache Druid在电商领域的实践应用](https://www.bilibili.com/read/cv8594505) - - Apache Druid虽然尚未在各个企业绝对普及,但是在互联网大厂是得到了较多应用的,毕竟它出道时间不长,还算作是新技术呢,而对于新技术,互联网一线大厂往往是践行者。 - - [原文链接](https://www.bilibili.com/read/cv8594505) - -3. [Kylin、Druid、ClickHouse核心技术对比](https://zhuanlan.zhihu.com/p/267311457) - - Druid索引结构使用自定义的数据结构,整体上它是一种列式存储结构,每个列独立一个逻辑文件(实际上是一个物理文件,在物理文件内部标记了每个列的start和offset) - - [原文链接](https://zhuanlan.zhihu.com/p/267311457) - -4. [适用于大数据的开源OLAP系统的比较:ClickHouse,Druid和Pinot](https://www.cnblogs.com/029zz010buct/p/12674287.html) - - ClickHouse,Druid和Pinot在效率和性能优化上具有大​​约相同的“极限”。没有“魔术药”可以使这些系统中的任何一个都比其他系统快得多。在当前状态下,这些系统在某些基准测试中的性能有很大不同,这一事实并不会让您感到困惑。 - - [原文链接](https://www.cnblogs.com/029zz010buct/p/12674287.html) - -5. [有人说下kudu,kylin,druid,clickhouse的区别,使用场景么?](https://www.zhihu.com/question/303991599) - - Kylin 和 ClickHouse 都能通过 SQL 的方式在 PB 数据量级下,亚秒级(绝多数查询 5s内返回)返回 OLAP(在线分析查询) 查询结果 - - [原文链接](https://www.zhihu.com/question/303991599) - -6. [OLAP演进实战,Druid对比ClickHouse输在哪里?](https://www.manongdao.com/article-2427509.html) - - 本文介绍eBay广告数据平台的基本情况,并对比分析了ClickHouse与Druid的使用特点。基于ClickHouse表现出的良好性能和扩展能力,本文介绍了如何将eBay广告系统从Druid迁移至ClickHouse,希望能为同业人员带来一定的启发。 - - [原文链接](https://www.manongdao.com/article-2427509.html) - -7. [clickhouse和druid实时分析性能总结](https://www.pianshen.com/article/26311113725/) - - clickhouse 是俄罗斯的“百度”Yandex公司在2016年开源的,一款针对大数据实时分析的高性能分布式数据库,与之对应的有hadoop生态hive,Vertica和百度出品的palo。 - - [原文链接](https://www.pianshen.com/article/26311113725/) \ No newline at end of file diff --git a/Misc/optimized.md b/Misc/optimized.md deleted file mode 100644 index 8569cbd..0000000 --- a/Misc/optimized.md +++ /dev/null @@ -1,36 +0,0 @@ - -## 各个大厂对Druid的优化与实践类文章合集 - -1. [快手 Druid 精确去重的设计和实现](https://www.infoq.cn/article/YdPlYzWCCQ5sPR_iKtVz) - - 快手的业务特点包括超大数据规模、毫秒级查询时延、高数据实时性要求、高并发查询、高稳定性以及较高的 Schema 灵活性要求;因此快手选择 Druid 平台作为底层架构。由于 Druid 原生不支持数据精确去重功能,而快手业务中会涉及到例如计费等场景,有精确去重的需求。因此,本文重点讲述如何在 Druid 平台中实现精确去重。另一方面,Druid 对外的接口是 json 形式 ( Druid 0.9 版本之后逐步支持 SQL ) ,对 SQL 并不友好,本文最后部分会简述 Druid 平台与 MySQL 交互方面做的一些改进。 - - [原文链接](https://www.infoq.cn/article/YdPlYzWCCQ5sPR_iKtVz) - -2. [基于ApacheDruid的实时分析平台在爱奇艺的实践](https://www.sohu.com/a/398880575_315839) - - 爱奇艺大数据服务团队评估了市面上主流的OLAP引擎,最终选择Apache Druid时序数据库来满足业务的实时分析需求。本文将介绍Druid在爱奇艺的实践情况、优化经验以及平台化建设的一些思考 - - [原文链接](https://www.sohu.com/a/398880575_315839) - -3. [熵简技术谈 | 实时OLAP引擎之Apache Druid:架构、原理和应用实践](https://zhuanlan.zhihu.com/p/178572172) - - 本文以实时 OLAP 引擎的优秀代表 Druid 为研究对象,详细介绍 Druid 的架构思想和核心特性。在此基础上,我们介绍了熵简科技在数据智能分析场景下,针对私有化部署与实时响应优化的实践经验。 - - [原文链接](https://zhuanlan.zhihu.com/p/178572172) - -4. [Apache Druid性能测评-云栖社区-阿里云](https://developer.aliyun.com/article/712725) - - [原文链接](https://developer.aliyun.com/article/712725) - -5. [Druid在有赞的实践](https://www.cnblogs.com/oldtrafford/p/10301581.html) - - 有赞作为一家 SaaS 公司,有很多的业务的场景和非常大量的实时数据和离线数据。在没有是使用 Druid 之前,一些 OLAP 场景的场景分析,开发的同学都是使用 SparkStreaming 或者 Storm 做的。用这类方案会除了需要写实时任务之外,还需要为了查询精心设计存储。带来问题是:开发的周期长;初期的存储设计很难满足需求的迭代发展;不可扩展。 - - [原文链接](https://www.cnblogs.com/oldtrafford/p/10301581.html) - -6. [Druid 在小米公司的技术实践](https://zhuanlan.zhihu.com/p/25593670) - - Druid 作为一款开源的实时大数据分析软件,自诞生以来,凭借着自己优秀的特质,逐渐在技术圈收获了越来越多的知名度与口碑,并陆续成为了很多技术团队解决方案中的关键一环,从而真正在很多公司的技术栈中赢得了一席之地。 - [原文链接](https://zhuanlan.zhihu.com/p/25593670) - diff --git a/Misc/sourcecode.md b/Misc/sourcecode.md deleted file mode 100644 index a55b928..0000000 --- a/Misc/sourcecode.md +++ /dev/null @@ -1,29 +0,0 @@ -## Druid源码解析类文章合集 - -1. [Apache Druid源码导读--Google Guice DI框架](https://blog.csdn.net/yueguanghaidao/article/details/102531570) - - 在大数据应用组件中,有两款OLAP引擎应用广泛,一款是偏离线处理的Kylin,另一个是偏实时的Druid。Kylin是一款国人开源的优秀离线OLAP引擎,基本上是Hadoop领域离线OLAP事实标准,在离线报表,指标分析领域应用广泛。而Apache Druid则在实时OLAP领域独领风骚,优异的性能、高可用、易扩展。 - - [原文链接]((https://blog.csdn.net/yueguanghaidao/article/details/102531570)) - -2. [Apache Druid源码解析的一个合集](https://blog.csdn.net/mytobaby00/category_7561069.html) - - [原文链接](https://blog.csdn.net/mytobaby00/category_7561069.html) - - * [Druid中的Extension在启动时是如何加载的](https://blog.csdn.net/mytobaby00/article/details/79857681) - * [Druid解析之管理用的接口大全](https://blog.csdn.net/mytobaby00/article/details/80088795) - * [Druid原理分析之内存池管理](https://blog.csdn.net/mytobaby00/article/details/80071101) - * [Druid源码解析之Segment](Druid源码解析之Segment) - * [Druid源码解析之Column](https://blog.csdn.net/mytobaby00/article/details/80056826) - * [Druid源码解析之HDFS存储](https://blog.csdn.net/mytobaby00/article/details/80045662) - * [Druid源码解析之Coordinator](https://blog.csdn.net/mytobaby00/article/details/80041970) - * [让Druid实现事件设备数留存数的精准计算](https://blog.csdn.net/mytobaby00/article/details/79804685) - * [在Druid中定制自己的扩展【Extension】](https://blog.csdn.net/mytobaby00/article/details/79803605) - * [Druid原理分析之“批”任务数据流转过程](https://blog.csdn.net/mytobaby00/article/details/79802776) - * [Druid原理分析之“流”任务数据流转过程](https://blog.csdn.net/mytobaby00/article/details/79801614) - * [Druid原理分析之Segment的存储结构](https://blog.csdn.net/mytobaby00/article/details/79801425) - * [Druid索引与查询原理简析](https://blog.csdn.net/mytobaby00/article/details/79800553) - * [Druid中的负载均衡策略分析](https://blog.csdn.net/mytobaby00/article/details/79860836) - * [Druid中的Kafka Indexing Service源码分析](https://blog.csdn.net/mytobaby00/article/details/79858403) - * [Druid源码分析之Query -- Sequence与Yielder](https://blog.csdn.net/mytobaby00/article/details/80103230) - * [Druid原理分析之Segment的存储结构](https://blog.csdn.net/mytobaby00/article/details/79801425) \ No newline at end of file diff --git a/Misc/usercase.md b/Misc/usercase.md deleted file mode 100644 index c7dbba9..0000000 --- a/Misc/usercase.md +++ /dev/null @@ -1 +0,0 @@ -## 使用场景 \ No newline at end of file diff --git a/Querying/datasource.md b/Querying/datasource.md index b52e222..0e6091d 100644 --- a/Querying/datasource.md +++ b/Querying/datasource.md @@ -258,7 +258,7 @@ SQL中的join格式如下: | `left` | 左侧数据源。 类型必须是 `table`, `join`, `lookup`, `query`或者`inline`。将另一个join作为左数据源放置允许您任意连接多个数据源。| | `right` | 右侧数据源。 类型必须是 `lookup`, `query` 或者 `inline`。注意:这一点比Druid SQL更加严格 | | `rightPrefix` | 字符串前缀,将应用于右侧数据源中的所有列,以防止它们与左侧数据源中的列发生冲突。可以是任何字符串,只要它不是空的并且不是 `__time` 字符串的前缀。左侧以 `rightPrefix` 开头的任何列都将被隐藏。您需要提供一个前缀,它不会从左侧隐藏任何重要的列。 | -| `condition` | [表达式](../Misc/expression.md),该表达式必须是相等的,其中一侧是左侧的表达式,另一侧是对右侧的简单列引用。注意,这比Druid SQL所要求的更严格:这里,右边的引用必须是一个简单的列引用;在SQL中,它可以是一个表达式。 | +| `condition` | [表达式](../misc/expression.md),该表达式必须是相等的,其中一侧是左侧的表达式,另一侧是对右侧的简单列引用。注意,这比Druid SQL所要求的更严格:这里,右边的引用必须是一个简单的列引用;在SQL中,它可以是一个表达式。 | | `joinType` | `INNER` 或者 `LEFT` | **Join的性能** diff --git a/README.md b/README.md index 8bd044a..b51248c 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ Druid 具有非常强大的 UI 界面,能够让用户进行 即席查询(Ad- ### 云原生、流原生的分析型数据库 Druid专为需要快速数据查询与摄入的工作流程而设计,在即时数据可见性、即席查询、运营分析以及高并发等方面表现非常出色。 -在实际中[众多场景](Misc/usercase.md)下数据仓库解决方案中,可以考虑将Druid当做一种开源的替代解决方案。 +在实际中 [众多场景](misc/index.md) 下数据仓库解决方案中,可以考虑将Druid当做一种开源的替代解决方案。 ### 可轻松与现有的数据管道进行集成 Druid原生支持从[Kafka](http://kafka.apache.org/)、[Amazon Kinesis](https://aws.amazon.com/cn/kinesis/)等消息总线中流式的消费数据,也同时支持从[HDFS](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html)、[Amazon S3](https://aws.amazon.com/cn/s3/)等存储服务中批量的加载数据文件。 @@ -41,7 +41,7 @@ Druid原生支持从[Kafka](http://kafka.apache.org/)、[Amazon Kinesis](https:/ Druid创新地在架构设计上吸收和结合了[数据仓库](https://en.wikipedia.org/wiki/Data_warehouse)、[时序数据库](https://en.wikipedia.org/wiki/Time_series_database)以及[检索系统](https://en.wikipedia.org/wiki/Search_engine_(computing))的优势,在已经完成的[基准测试](https://imply.io/post/performance-benchmark-druid-presto-hive)中展现出来的性能远远超过数据摄入与查询的传统解决方案。 ### 解锁了一种新型的工作流程 -Druid为点击流、APM、供应链、网络监测、市场营销以及其他事件驱动类型的数据分析解锁了一种[新型的查询与工作流程](Misc/usercase.md),它专为实时和历史数据高效快速的即席查询而设计。 +Druid为点击流、APM、供应链、网络监测、市场营销以及其他事件驱动类型的数据分析解锁了一种[新型的查询与工作流程](misc/usercase.md),它专为实时和历史数据高效快速的即席查询而设计。 ### 可部署在AWS/GCP/Azure,混合云,Kubernetes, 以及裸机上 无论在云上还是本地,Druid可以轻松的部署在商用硬件上的任何*NIX环境。部署Druid也是非常简单的,包括集群的扩容或者下线都也同样很简单。 diff --git a/SUMMARY.md b/SUMMARY.md index 5544149..44c6ff3 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -5,9 +5,9 @@ * [技术类](book/tech.md) * [优质技术文章合集]() - * [分析调研类](Misc/learning.md) - * [源码解读类](Misc/sourcecode.md) - * [优化实践类](Misc/optimized.md) + * [分析调研类](misc/learning.md) + * [源码解读类](misc/sourcecode.md) + * [优化实践类](misc/optimized.md) * [整体介绍]() * [Druid概述](README.md) @@ -110,7 +110,7 @@ * [开发指南](Development/index.md) * [其他相关]() - * [其他相关](Misc/index.md) + * [其他相关](misc/index.md) * [站内链接合集]() * [产品架构师进阶](book/1.md) diff --git a/misc/index.md b/misc/index.md new file mode 100644 index 0000000..307735d --- /dev/null +++ b/misc/index.md @@ -0,0 +1,87 @@ +## Druid 资源快速导航 + +## 文章合集 +* [Apache Druid源码导读--Google Guice DI框架](https://blog.csdn.net/yueguanghaidao/article/details/102531570) + 在大数据应用组件中,有两款OLAP引擎应用广泛,一款是偏离线处理的Kylin,另一个是偏实时的Druid。Kylin是一款国人开源的优秀离线OLAP引擎,基本上是Hadoop领域离线OLAP事实标准,在离线报表,指标分析领域应用广泛。而Apache Druid则在实时OLAP领域独领风骚,优异的性能、高可用、易扩展。 + +* [Apache Druid源码解析的一个合集](https://blog.csdn.net/mytobaby00/category_7561069.html) + +* [Druid中的Extension在启动时是如何加载的](https://blog.csdn.net/mytobaby00/article/details/79857681) + +* [Druid解析之管理用的接口大全](https://blog.csdn.net/mytobaby00/article/details/80088795) + +* [Druid原理分析之内存池管理](https://blog.csdn.net/mytobaby00/article/details/80071101) + +* [Druid源码解析之Segment](Druid源码解析之Segment) + +* [Druid源码解析之Column](https://blog.csdn.net/mytobaby00/article/details/80056826) + +* [Druid源码解析之HDFS存储](https://blog.csdn.net/mytobaby00/article/details/80045662) + +* [Druid源码解析之Coordinator](https://blog.csdn.net/mytobaby00/article/details/80041970) + +* [让Druid实现事件设备数留存数的精准计算](https://blog.csdn.net/mytobaby00/article/details/79804685) + +* [在Druid中定制自己的扩展【Extension】](https://blog.csdn.net/mytobaby00/article/details/79803605) + +* [Druid原理分析之“批”任务数据流转过程](https://blog.csdn.net/mytobaby00/article/details/79802776) + +* [Druid原理分析之“流”任务数据流转过程](https://blog.csdn.net/mytobaby00/article/details/79801614) + +* [Druid原理分析之Segment的存储结构](https://blog.csdn.net/mytobaby00/article/details/79801425) + +* [Druid索引与查询原理简析](https://blog.csdn.net/mytobaby00/article/details/79800553) + +* [Druid中的负载均衡策略分析](https://blog.csdn.net/mytobaby00/article/details/79860836) + +* [Druid中的Kafka Indexing Service源码分析](https://blog.csdn.net/mytobaby00/article/details/79858403) + +* [Druid源码分析之Query -- Sequence与Yielder](https://blog.csdn.net/mytobaby00/article/details/80103230) + +* [Druid原理分析之Segment的存储结构](https://blog.csdn.net/mytobaby00/article/details/79801425) + + + +## 各个大厂对Druid的优化与实践类文章合集 + +* [快手 Druid 精确去重的设计和实现](https://www.infoq.cn/article/YdPlYzWCCQ5sPR_iKtVz) + 快手的业务特点包括超大数据规模、毫秒级查询时延、高数据实时性要求、高并发查询、高稳定性以及较高的 Schema 灵活性要求;因此快手选择 Druid 平台作为底层架构。由于 Druid 原生不支持数据精确去重功能,而快手业务中会涉及到例如计费等场景,有精确去重的需求。因此,本文重点讲述如何在 Druid 平台中实现精确去重。另一方面,Druid 对外的接口是 json 形式 ( Druid 0.9 版本之后逐步支持 SQL ) ,对 SQL 并不友好,本文最后部分会简述 Druid 平台与 MySQL 交互方面做的一些改进。 + +* [基于ApacheDruid的实时分析平台在爱奇艺的实践](https://www.sohu.com/a/398880575_315839) + 爱奇艺大数据服务团队评估了市面上主流的OLAP引擎,最终选择Apache Druid时序数据库来满足业务的实时分析需求。本文将介绍Druid在爱奇艺的实践情况、优化经验以及平台化建设的一些思考 + +* [熵简技术谈 | 实时OLAP引擎之Apache Druid:架构、原理和应用实践](https://zhuanlan.zhihu.com/p/178572172) + 本文以实时 OLAP 引擎的优秀代表 Druid 为研究对象,详细介绍 Druid 的架构思想和核心特性。在此基础上,我们介绍了熵简科技在数据智能分析场景下,针对私有化部署与实时响应优化的实践经验。 + +* [Apache Druid性能测评-云栖社区-阿里云](https://developer.aliyun.com/article/712725) + +* [Druid在有赞的实践](https://www.cnblogs.com/oldtrafford/p/10301581.html) + 有赞作为一家 SaaS 公司,有很多的业务的场景和非常大量的实时数据和离线数据。在没有是使用 Druid 之前,一些 OLAP 场景的场景分析,开发的同学都是使用 SparkStreaming 或者 Storm 做的。用这类方案会除了需要写实时任务之外,还需要为了查询精心设计存储。带来问题是:开发的周期长;初期的存储设计很难满足需求的迭代发展;不可扩展。 + +* [Druid 在小米公司的技术实践](https://zhuanlan.zhihu.com/p/25593670) + Druid 作为一款开源的实时大数据分析软件,自诞生以来,凭借着自己优秀的特质,逐渐在技术圈收获了越来越多的知名度与口碑,并陆续成为了很多技术团队解决方案中的关键一环,从而真正在很多公司的技术栈中赢得了一席之地。 + + +## Druid入门学习类文章 + +* [十分钟了解Apache Druid](https://www.cnblogs.com/WeaRang/p/12421873.html) + Apache Druid是一个集时间序列数据库、数据仓库和全文检索系统特点于一体的分析性数据平台。本文将带你简单了解Druid的特性,使用场景,技术特点和架构。这将有助于你选型数据存储方案,深入了解Druid存储,深入了解时间序列存储等。 + + +* [勾叔谈大数据:大厂做法:Apache Druid在电商领域的实践应用](https://www.bilibili.com/read/cv8594505) + Apache Druid虽然尚未在各个企业绝对普及,但是在互联网大厂是得到了较多应用的,毕竟它出道时间不长,还算作是新技术呢,而对于新技术,互联网一线大厂往往是践行者。 + +* [Kylin、Druid、ClickHouse核心技术对比](https://zhuanlan.zhihu.com/p/267311457) + Druid索引结构使用自定义的数据结构,整体上它是一种列式存储结构,每个列独立一个逻辑文件(实际上是一个物理文件,在物理文件内部标记了每个列的start和offset) + +* [适用于大数据的开源OLAP系统的比较:ClickHouse,Druid和Pinot](https://www.cnblogs.com/029zz010buct/p/12674287.html) + ClickHouse,Druid和Pinot在效率和性能优化上具有大​​约相同的“极限”。没有“魔术药”可以使这些系统中的任何一个都比其他系统快得多。在当前状态下,这些系统在某些基准测试中的性能有很大不同,这一事实并不会让您感到困惑。 + +* [有人说下kudu,kylin,druid,clickhouse的区别,使用场景么?](https://www.zhihu.com/question/303991599) + Kylin 和 ClickHouse 都能通过 SQL 的方式在 PB 数据量级下,亚秒级(绝多数查询 5s内返回)返回 OLAP(在线分析查询) 查询结果 + +* [OLAP演进实战,Druid对比ClickHouse输在哪里?](https://www.manongdao.com/article-2427509.html) + 本文介绍eBay广告数据平台的基本情况,并对比分析了ClickHouse与Druid的使用特点。基于ClickHouse表现出的良好性能和扩展能力,本文介绍了如何将eBay广告系统从Druid迁移至ClickHouse,希望能为同业人员带来一定的启发。 + +* [clickhouse和druid实时分析性能总结](https://www.pianshen.com/article/26311113725/) + clickhouse 是俄罗斯的“百度”Yandex公司在2016年开源的,一款针对大数据实时分析的高性能分布式数据库,与之对应的有hadoop生态hive,Vertica和百度出品的palo。 \ No newline at end of file diff --git a/misc/math-expr.md b/misc/math-expr.md new file mode 100644 index 0000000..c82f92d --- /dev/null +++ b/misc/math-expr.md @@ -0,0 +1,260 @@ +--- +id: math-expr +title: "Expressions" +--- + + + +> Apache Druid supports two query languages: [native queries](../querying/querying.md) and [Druid SQL](../querying/sql.md). +> This document describes the native language. For information about functions available in SQL, refer to the +> [SQL documentation](../querying/sql.md#scalar-functions). + +Expressions are used in various places in the native query language, including +[virtual columns](../querying/virtual-columns.md) and [join conditions](../querying/datasource.md#join). They are +also generated by most [Druid SQL functions](../querying/sql.md#scalar-functions) during the +[query translation](../querying/sql.md#query-translation) process. + +This expression language supports the following operators (listed in decreasing order of precedence). + +|Operators|Description| +|---------|-----------| +|!, -|Unary NOT and Minus| +|^|Binary power op| +|*, /, %|Binary multiplicative| +|+, -|Binary additive| +|<, <=, >, >=, ==, !=|Binary Comparison| +|&&, |||Binary Logical AND, OR| + +Long, double, and string data types are supported. If a number contains a dot, it is interpreted as a double, otherwise it is interpreted as a long. That means, always add a '.' to your number if you want it interpreted as a double value. String literals should be quoted by single quotation marks. + +Additionally, the expression language supports long, double, and string arrays. Array literals are created by wrapping square brackets around a list of scalar literals values delimited by a comma or space character. All values in an array literal must be the same type, however null values are accepted. Typed empty arrays may be defined by prefixing with their type in angle brackets: `[]`, `[]`, or `[]`. + +Expressions can contain variables. Variable names may contain letters, digits, '\_' and '$'. Variable names must not begin with a digit. To escape other special characters, you can quote it with double quotation marks. + +For logical operators, a number is true if and only if it is positive (0 or negative value means false). For string type, it's the evaluation result of 'Boolean.valueOf(string)'. + +[Multi-value string dimensions](../querying/multi-value-dimensions.md) are supported and may be treated as either scalar or array typed values. When treated as a scalar type, an expression will automatically be transformed to apply the scalar operation across all values of the multi-valued type, to mimic Druid's native behavior. Values that result in arrays will be coerced back into the native Druid string type for aggregation. Druid aggregations on multi-value string dimensions on the individual values, _not_ the 'array', behaving similar to the `UNNEST` operator available in many SQL dialects. However, by using the `array_to_string` function, aggregations may be done on a stringified version of the complete array, allowing the complete row to be preserved. Using `string_to_array` in an expression post-aggregator, allows transforming the stringified dimension back into the true native array type. + + +The following built-in functions are available. + +## General functions + +|name|description| +|----|-----------| +|cast|cast(expr,'LONG' or 'DOUBLE' or 'STRING' or 'LONG_ARRAY', or 'DOUBLE_ARRAY' or 'STRING_ARRAY') returns expr with specified type. exception can be thrown. Scalar types may be cast to array types and will take the form of a single element list (null will still be null). | +|if|if(predicate,then,else) returns 'then' if 'predicate' evaluates to a positive number, otherwise it returns 'else' | +|nvl|nvl(expr,expr-for-null) returns 'expr-for-null' if 'expr' is null (or empty string for string type) | +|like|like(expr, pattern[, escape]) is equivalent to SQL `expr LIKE pattern`| +|case_searched|case_searched(expr1, result1, \[\[expr2, result2, ...\], else-result\])| +|case_simple|case_simple(expr, value1, result1, \[\[value2, result2, ...\], else-result\])| +|bloom_filter_test|bloom_filter_test(expr, filter) tests the value of 'expr' against 'filter', a bloom filter serialized as a base64 string. See [bloom filter extension](../development/extensions-core/bloom-filter.md) documentation for additional details.| + +## String functions + +|name|description| +|----|-----------| +|concat|concat(expr, expr...) concatenate a list of strings| +|format|format(pattern[, args...]) returns a string formatted in the manner of Java's [String.format](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#format-java.lang.String-java.lang.Object...-).| +|like|like(expr, pattern[, escape]) is equivalent to SQL `expr LIKE pattern`| +|lookup|lookup(expr, lookup-name) looks up expr in a registered [query-time lookup](../querying/lookups.md)| +|parse_long|parse_long(string[, radix]) parses a string as a long with the given radix, or 10 (decimal) if a radix is not provided.| +|regexp_extract|regexp_extract(expr, pattern[, index]) applies a regular expression pattern and extracts a capture group index, or null if there is no match. If index is unspecified or zero, returns the substring that matched the pattern. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.| +|regexp_like|regexp_like(expr, pattern) returns whether `expr` matches regular expression `pattern`. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern. | +|contains_string|contains_string(expr, string) returns whether `expr` contains `string` as a substring. This method is case-sensitive.| +|icontains_string|contains_string(expr, string) returns whether `expr` contains `string` as a substring. This method is case-insensitive.| +|replace|replace(expr, pattern, replacement) replaces pattern with replacement| +|substring|substring(expr, index, length) behaves like java.lang.String's substring| +|right|right(expr, length) returns the rightmost length characters from a string| +|left|left(expr, length) returns the leftmost length characters from a string| +|strlen|strlen(expr) returns length of a string in UTF-16 code units| +|strpos|strpos(haystack, needle[, fromIndex]) returns the position of the needle within the haystack, with indexes starting from 0. The search will begin at fromIndex, or 0 if fromIndex is not specified. If the needle is not found then the function returns -1.| +|trim|trim(expr[, chars]) remove leading and trailing characters from `expr` if they are present in `chars`. `chars` defaults to ' ' (space) if not provided.| +|ltrim|ltrim(expr[, chars]) remove leading characters from `expr` if they are present in `chars`. `chars` defaults to ' ' (space) if not provided.| +|rtrim|rtrim(expr[, chars]) remove trailing characters from `expr` if they are present in `chars`. `chars` defaults to ' ' (space) if not provided.| +|lower|lower(expr) converts a string to lowercase| +|upper|upper(expr) converts a string to uppercase| +|reverse|reverse(expr) reverses a string| +|repeat|repeat(expr, N) repeats a string N times| +|lpad|lpad(expr, length, chars) returns a string of `length` from `expr` left-padded with `chars`. If `length` is shorter than the length of `expr`, the result is `expr` which is truncated to `length`. The result will be null if either `expr` or `chars` is null. If `chars` is an empty string, no padding is added, however `expr` may be trimmed if necessary.| +|rpad|rpad(expr, length, chars) returns a string of `length` from `expr` right-padded with `chars`. If `length` is shorter than the length of `expr`, the result is `expr` which is truncated to `length`. The result will be null if either `expr` or `chars` is null. If `chars` is an empty string, no padding is added, however `expr` may be trimmed if necessary.| + +## Time functions + +|name|description| +|----|-----------| +|timestamp|timestamp(expr[,format-string]) parses string expr into date then returns milliseconds from java epoch. without 'format-string' it's regarded as ISO datetime format | +|unix_timestamp|same with 'timestamp' function but returns seconds instead | +|timestamp_ceil|timestamp_ceil(expr, period, \[origin, \[timezone\]\]) rounds up a timestamp, returning it as a new timestamp. Period can be any ISO8601 period, like P3M (quarters) or PT12H (half-days). The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00".| +|timestamp_floor|timestamp_floor(expr, period, \[origin, [timezone\]\]) rounds down a timestamp, returning it as a new timestamp. Period can be any ISO8601 period, like P3M (quarters) or PT12H (half-days). The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00".| +|timestamp_shift|timestamp_shift(expr, period, step, \[timezone\]) shifts a timestamp by a period (step times), returning it as a new timestamp. Period can be any ISO8601 period. Step may be negative. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00".| +|timestamp_extract|timestamp_extract(expr, unit, \[timezone\]) extracts a time part from expr, returning it as a number. Unit can be EPOCH (number of seconds since 1970-01-01 00:00:00 UTC), SECOND, MINUTE, HOUR, DAY (day of month), DOW (day of week), DOY (day of year), WEEK (week of [week year](https://en.wikipedia.org/wiki/ISO_week_date)), MONTH (1 through 12), QUARTER (1 through 4), or YEAR. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00"| +|timestamp_parse|timestamp_parse(string expr, \[pattern, [timezone\]\]) parses a string into a timestamp using a given [Joda DateTimeFormat pattern](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat). If the pattern is not provided, this parses time strings in either ISO8601 or SQL format. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00", and will be used as the time zone for strings that do not include a time zone offset. Pattern and time zone must be literals. Strings that cannot be parsed as timestamps will be returned as nulls.| +|timestamp_format|timestamp_format(expr, \[pattern, \[timezone\]\]) formats a timestamp as a string with a given [Joda DateTimeFormat pattern](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat), or ISO8601 if the pattern is not provided. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00". Pattern and time zone must be literals.| + +## Math functions + +See javadoc of java.lang.Math for detailed explanation for each function. + +|name|description| +|----|-----------| +|abs|abs(x) returns the absolute value of x| +|acos|acos(x) returns the arc cosine of x| +|asin|asin(x) returns the arc sine of x| +|atan|atan(x) returns the arc tangent of x| +|bitwiseAnd|bitwiseAnd(x,y) returns the result of x & y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles| +|bitwiseComplement|bitwiseComplement(x) returns the result of ~x. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles| +|bitwiseConvertDoubleToLongBits|bitwiseConvertDoubleToLongBits(x) converts the bits of an IEEE 754 floating-point double value to a long. If the input is not a double, it is implicitly cast to a double prior to conversion| +|bitwiseConvertLongBitsToDouble|bitwiseConvertLongBitsToDouble(x) converts a long to the IEEE 754 floating-point double specified by the bits stored in the long. If the input is not a long, it is implicitly cast to a long prior to conversion| +|bitwiseOr|bitwiseOr(x,y) returns the result of x [PIPE] y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles| +|bitwiseShiftLeft|bitwiseShiftLeft(x,y) returns the result of x << y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles| +|bitwiseShiftRight|bitwiseShiftRight(x,y) returns the result of x >> y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles| +|bitwiseXor|bitwiseXor(x,y) returns the result of x ^ y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles| +|atan2|atan2(y, x) returns the angle theta from the conversion of rectangular coordinates (x, y) to polar * coordinates (r, theta)| +|cbrt|cbrt(x) returns the cube root of x| +|ceil|ceil(x) returns the smallest (closest to negative infinity) double value that is greater than or equal to x and is equal to a mathematical integer| +|copysign|copysign(x) returns the first floating-point argument with the sign of the second floating-point argument| +|cos|cos(x) returns the trigonometric cosine of x| +|cosh|cosh(x) returns the hyperbolic cosine of x| +|cot|cot(x) returns the trigonometric cotangent of an angle x| +|div|div(x,y) is integer division of x by y| +|exp|exp(x) returns Euler's number raised to the power of x| +|expm1|expm1(x) returns e^x-1| +|floor|floor(x) returns the largest (closest to positive infinity) double value that is less than or equal to x and is equal to a mathematical integer| +|getExponent|getExponent(x) returns the unbiased exponent used in the representation of x| +|hypot|hypot(x, y) returns sqrt(x^2+y^2) without intermediate overflow or underflow| +|log|log(x) returns the natural logarithm of x| +|log10|log10(x) returns the base 10 logarithm of x| +|log1p|log1p(x) will the natural logarithm of x + 1| +|max|max(x, y) returns the greater of two values| +|min|min(x, y) returns the smaller of two values| +|nextafter|nextafter(x, y) returns the floating-point number adjacent to the x in the direction of the y| +|nextUp|nextUp(x) returns the floating-point value adjacent to x in the direction of positive infinity| +|pi|pi returns the constant value of the π | +|pow|pow(x, y) returns the value of the x raised to the power of y| +|remainder|remainder(x, y) returns the remainder operation on two arguments as prescribed by the IEEE 754 standard| +|rint|rint(x) returns value that is closest in value to x and is equal to a mathematical integer| +|round|round(x, y) returns the value of the x rounded to the y decimal places. While x can be an integer or floating-point number, y must be an integer. The type of the return value is specified by that of x. y defaults to 0 if omitted. When y is negative, x is rounded on the left side of the y decimal points. If x is `NaN`, x returns 0. If x is infinity, x will be converted to the nearest finite double. | +|scalb|scalb(d, sf) returns d * 2^sf rounded as if performed by a single correctly rounded floating-point multiply to a member of the double value set| +|signum|signum(x) returns the signum function of the argument x| +|sin|sin(x) returns the trigonometric sine of an angle x| +|sinh|sinh(x) returns the hyperbolic sine of x| +|sqrt|sqrt(x) returns the correctly rounded positive square root of x| +|tan|tan(x) returns the trigonometric tangent of an angle x| +|tanh|tanh(x) returns the hyperbolic tangent of x| +|todegrees|todegrees(x) converts an angle measured in radians to an approximately equivalent angle measured in degrees| +|toradians|toradians(x) converts an angle measured in degrees to an approximately equivalent angle measured in radians| +|ulp|ulp(x) returns the size of an ulp of the argument x| + + +## Array functions + +| function | description | +| --- | --- | +| array(expr1,expr ...) | constructs an array from the expression arguments, using the type of the first argument as the output array type | +| array_length(arr) | returns length of array expression | +| array_offset(arr,long) | returns the array element at the 0 based index supplied, or null for an out of range index| +| array_ordinal(arr,long) | returns the array element at the 1 based index supplied, or null for an out of range index | +| array_contains(arr,expr) | returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0 | +| array_overlap(arr1,arr2) | returns 1 if arr1 and arr2 have any elements in common, else 0 | +| array_offset_of(arr,expr) | returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false`if no matching elements exist in the array. | +| array_ordinal_of(arr,expr) | returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. | +| array_prepend(expr,arr) | adds expr to arr at the beginning, the resulting array type determined by the type of the array | +| array_append(arr,expr) | appends expr to arr, the resulting array type determined by the type of the first array | +| array_concat(arr1,arr2) | concatenates 2 arrays, the resulting array type determined by the type of the first array | +| array_set_add(arr,expr) | adds expr to arr and converts the array to a new array composed of the unique set of elements. The resulting array type determined by the type of the array | +| array_set_add_all(arr1,arr2) | combines the unique set of elements of 2 arrays, the resulting array type determined by the type of the first array | +| array_slice(arr,start,end) | return the subarray of arr from the 0 based index start(inclusive) to end(exclusive), or `null`, if start is less than 0, greater than length of arr or less than end| +| array_to_string(arr,str) | joins all elements of arr by the delimiter specified by str | +| string_to_array(str1,str2) | splits str1 into an array on the delimiter specified by str2 | + + +## Apply functions +Apply functions allow for special 'lambda' expressions to be defined and applied to array inputs to enable free-form transformations. + +| function | description | +| --- | --- | +| map(lambda,arr) | applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array | +| cartesian_map(lambda,arr1,arr2,...) | applies a transform specified by a multi argument lambda expression to all elements of the Cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same | +| filter(lambda,arr) | filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match | +| fold(lambda,arr) | folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value. | +| cartesian_fold(lambda,arr1,arr2,...) | folds a multi argument lambda across the Cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value. | +| any(lambda,arr) | returns 1 if any element in the array matches the lambda expression, else 0 | +| all(lambda,arr) | returns 1 if all elements in the array matches the lambda expression, else 0 | + + +### Lambda expressions syntax +Lambda expressions are a sort of function definition, where new identifiers can be defined and passed as input to the expression body +``` +(identifier1 ...) -> expr +``` +e.g. +``` +(x, y) -> x + y +``` +The identifier arguments of a lambda expression correspond to the elements of the array it is being applied to. For example: +``` +map((x) -> x + 1, some_multi_value_column) +``` +will map each element of `some_multi_value_column` to the identifier `x` so that the lambda expression body can be evaluated for each `x`. The scoping rules are that lambda arguments will override identifiers which are defined externally from the lambda expression body. Using the same example: + +``` +map((x) -> x + 1, x) +``` +in this case, the `x` when evaluating `x + 1` is the lambda argument, thus an element of the multi-valued column `x`, rather than the column `x` itself. + +## Reduction functions + +Reduction functions operate on zero or more expressions and return a single expression. If no expressions are passed as +arguments, then the result is `NULL`. The expressions must all be convertible to a common data type, which will be the +type of the result: +* If all arguments are `NULL`, the result is `NULL`. Otherwise, `NULL` arguments are ignored. +* If the arguments comprise a mix of numbers and strings, the arguments are interpreted as strings. +* If all arguments are integer numbers, the arguments are interpreted as longs. +* If all arguments are numbers and at least one argument is a double, the arguments are interpreted as doubles. + +| function | description | +| --- | --- | +| greatest([expr1, ...]) | Evaluates zero or more expressions and returns the maximum value based on comparisons as described above. | +| least([expr1, ...]) | Evaluates zero or more expressions and returns the minimum value based on comparisons as described above. | + + +## IP address functions + +For the IPv4 address functions, the `address` argument can either be an IPv4 dotted-decimal string (e.g., "192.168.0.1") or an IP address represented as a long (e.g., 3232235521). The `subnet` argument should be a string formatted as an IPv4 address subnet in CIDR notation (e.g., "192.168.0.0/16"). + +| function | description | +| --- | --- | +| ipv4_match(address, subnet) | Returns 1 if the `address` belongs to the `subnet` literal, else 0. If `address` is not a valid IPv4 address, then 0 is returned. This function is more efficient if `address` is a long instead of a string.| +| ipv4_parse(address) | Parses `address` into an IPv4 address stored as a long. If `address` is a long that is a valid IPv4 address, then it is passed through. Returns null if `address` cannot be represented as an IPv4 address. | +| ipv4_stringify(address) | Converts `address` into an IPv4 address dotted-decimal string. If `address` is a string that is a valid IPv4 address, then it is passed through. Returns null if `address` cannot be represented as an IPv4 address.| + + +## Vectorization Support +A number of expressions support ['vectorized' query engines](../querying/query-context.md#vectorization-parameters) + +supported features: +* constants and identifiers are supported for any column type +* `cast` is supported for numeric and string types +* math operators: `+`,`-`,`*`,`/`,`%`,`^` are supported for numeric types +* comparison operators: `=`, `!=`, `>`, `>=`, `<`, `<=` are supported for numeric types +* math functions: `abs`, `acos`, `asin`, `atan`, `cbrt`, `ceil`, `cos`, `cosh`, `cot`, `exp`, `expm1`, `floor`, `getExponent`, `log`, `log10`, `log1p`, `nextUp`, `rint`, `signum`, `sin`, `sinh`, `sqrt`, `tan`, `tanh`, `toDegrees`, `toRadians`, `ulp`, `atan2`, `copySign`, `div`, `hypot`, `max`, `min`, `nextAfter`, `pow`, `remainder`, `scalb` are supported for numeric types +* time functions: `timestamp_floor` (with constant granularity argument) is supported for numeric types +* other: `parse_long` is supported for numeric and string types \ No newline at end of file diff --git a/misc/papers-and-talks.md b/misc/papers-and-talks.md new file mode 100644 index 0000000..5bc23dc --- /dev/null +++ b/misc/papers-and-talks.md @@ -0,0 +1,43 @@ +--- +id: papers-and-talks +title: "Papers" +--- + + + +## Papers + +* [Druid: A Real-time Analytical Data Store](http://static.druid.io/docs/druid.pdf) - Discusses the Druid architecture in detail. + +* [The RADStack: Open Source Lambda Architecture for Interactive Analytics](http://static.druid.io/docs/radstack.pdf) - Discusses how Druid supports real-time and batch workflows. + +## Presentations + +* [Introduction to Druid](https://www.youtube.com/watch?v=hgmxVPx4vVw) - Discusses the motivations behind Druid and the architecture of the system. + +* [Druid: Interactive Queries Meet Real-Time Data](https://www.youtube.com/watch?v=Dlqj34l2upk) - Discusses how real-time ingestion in Druid works and use cases at Netflix. + +* [Not Exactly! Fast Queries via Approximation Algorithms](https://www.youtube.com/watch?v=Hpd3f_MLdXo) - Discusses how approximate algorithms work in Druid. + +* [Real-time Analytics with Open Source Technologies](https://www.youtube.com/watch?v=kJMYVpnW_AQ) - Discusses Lambda architectures with Druid. + +* [Stories from the Trenches - The Challenges of Building an Analytics Stack](https://www.youtube.com/watch?v=Sz4w75xRrYM) - Discusses features that were added to scale Druid. + +* [Building Interactive Applications at Scale](https://www.youtube.com/watch?v=bZ3LqG3iHbM) - Discusses building applications on top of Druid. \ No newline at end of file