完成翻译什么是 Druid 的内容
This commit is contained in:
parent
feaad880e1
commit
bfae313cf8
102
design/index.md
102
design/index.md
|
@ -3,81 +3,53 @@
|
|||
|
||||
## 什么是 Druid
|
||||
|
||||
Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics
|
||||
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries) on large data sets. Druid is most often
|
||||
used as a database for powering use cases where real-time ingest, fast query performance, and high uptime are important.
|
||||
As such, Druid is commonly used for powering GUIs of analytical applications, or as a backend for highly-concurrent APIs
|
||||
that need fast aggregations. Druid works best with event-oriented data.
|
||||
Apache Druid 是一个实时分析型数据库,旨在对大型数据集进行快速查询和分析("[OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing)" 查询)。
|
||||
|
||||
Common application areas for Druid include:
|
||||
Druid 最常被当做数据库,用以支持实时摄取、高查询性能和高稳定运行的应用场景。
|
||||
例如,Druid 通常被用来作为图形分析工具的数据源来提供数据,或当有需要高聚和高并发的后端 API。
|
||||
同时 Druid 也非常适合针对面向事件类型的数据。
|
||||
|
||||
- Clickstream analytics (web and mobile analytics)
|
||||
- Network telemetry analytics (network performance monitoring)
|
||||
- Server metrics storage
|
||||
- Supply chain analytics (manufacturing metrics)
|
||||
- Application performance metrics
|
||||
- Digital marketing/advertising analytics
|
||||
- Business intelligence / OLAP
|
||||
通常可以使用 Druid 作为数据源的系统包括有:
|
||||
- 点击流量分析(Web 或者移动分析)
|
||||
- 网络监测分析(网络性能监控)
|
||||
- 服务器存储指标
|
||||
- 供应链分析(生产数据指标)
|
||||
- 应用性能指标
|
||||
- 数字广告分析
|
||||
- 商业整合 / OLAP
|
||||
|
||||
Druid's core architecture combines ideas from data warehouses, timeseries databases, and logsearch systems. Some of
|
||||
Druid's key features are:
|
||||
Druid 的核心架构集合了数据仓库(data warehouses),时序数据库(timeseries databases),日志分析系统(logsearch systems)的概念。
|
||||
|
||||
1. **Columnar storage format.** Druid uses column-oriented storage, meaning it only needs to load the exact columns
|
||||
needed for a particular query. This gives a huge speed boost to queries that only hit a few columns. In addition, each
|
||||
column is stored optimized for its particular data type, which supports fast scans and aggregations.
|
||||
2. **Scalable distributed system.** Druid is typically deployed in clusters of tens to hundreds of servers, and can
|
||||
offer ingest rates of millions of records/sec, retention of trillions of records, and query latencies of sub-second to a
|
||||
few seconds.
|
||||
3. **Massively parallel processing.** Druid can process a query in parallel across the entire cluster.
|
||||
4. **Realtime or batch ingestion.** Druid can ingest data either real-time (ingested data is immediately available for
|
||||
querying) or in batches.
|
||||
5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale the cluster out or in, simply add or
|
||||
remove servers and the cluster will rebalance itself automatically, in the background, without any downtime. If any
|
||||
Druid servers fail, the system will automatically route around the damage until those servers can be replaced. Druid
|
||||
is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software
|
||||
updates.
|
||||
6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is
|
||||
stored safely in [deep storage](architecture.html#deep-storage) (typically cloud storage, HDFS, or a shared filesystem).
|
||||
Your data can be recovered from deep storage even if every single Druid server fails. For more limited failures affecting
|
||||
just a few Druid servers, replication ensures that queries are still possible while the system recovers.
|
||||
7. **Indexes for quick filtering.** Druid uses [Roaring](https://roaringbitmap.org/) or
|
||||
[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create indexes that power fast filtering and
|
||||
searching across multiple columns.
|
||||
8. **Time-based partitioning.** Druid first partitions data by time, and can additionally partition based on other fields.
|
||||
This means time-based queries will only access the partitions that match the time range of the query. This leads to
|
||||
significant performance improvements for time-based data.
|
||||
9. **Approximate algorithms.** Druid includes algorithms for approximate count-distinct, approximate ranking, and
|
||||
computation of approximate histograms and quantiles. These algorithms offer bounded memory usage and are often
|
||||
substantially faster than exact computations. For situations where accuracy is more important than speed, Druid also
|
||||
offers exact count-distinct and exact ranking.
|
||||
10. **Automatic summarization at ingest time.** Druid optionally supports data summarization at ingestion time. This
|
||||
summarization partially pre-aggregates your data, and can lead to big costs savings and performance boosts.
|
||||
如果你对上面的各种数据类型,数据库不是非常了解的话,那么我们建议你进行一些搜索来了解相关的一些定义和提供的功能。
|
||||
|
||||
Druid 的一些关键特性包括有:
|
||||
1. **列示存储格式(Columnar storage format)** Druid 使用列式存储,这意味着在一个特定的数据查询中它只需要查询特定的列。
|
||||
这样的设计极大的提高了部分列查询场景性能。另外,每一列数据都针对特定数据类型做了优化存储,从而能够支持快速扫描和聚合。
|
||||
|
||||
Apache Druid是一个实时分析型数据库,旨在对大型数据集进行快速的查询分析("[OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing)"查询)。Druid最常被当做数据库来用以支持实时摄取、高性能查询和高稳定运行的应用场景,同时,Druid也通常被用来助力分析型应用的图形化界面,或者当做需要快速聚合的高并发后端API,Druid最适合应用于面向事件类型的数据。
|
||||
2. **可扩展的分布式系统(Scalable distributed system)** Druid通常部署在数十到数百台服务器的集群中,
|
||||
并且可以提供每秒数百万级的数据导入,并且保存有万亿级的数据,同时提供 100ms 到 几秒钟之间的查询延迟。
|
||||
|
||||
Druid通常应用于以下场景:
|
||||
3. **高性能并发处理(Massively parallel processing)** Druid 可以在整个集群中并行处理查询。
|
||||
|
||||
* 点击流分析(Web端和移动端)
|
||||
* 网络监测分析(网络性能监控)
|
||||
* 服务指标存储
|
||||
* 供应链分析(制造类指标)
|
||||
* 应用性能指标分析
|
||||
* 数字广告分析
|
||||
* 商务智能 / OLAP
|
||||
4. **实时或者批量数据处理(Realtime or batch ingestion)** Druid 可以实时(已经被导入和摄取的数据可立即用于查询)导入摄取数据库或批量导入摄取数据。
|
||||
|
||||
Druid的核心架构吸收和结合了[数据仓库](https://en.wikipedia.org/wiki/Data_warehouse)、[时序数据库](https://en.wikipedia.org/wiki/Time_series_database)以及[检索系统](https://en.wikipedia.org/wiki/Search_engine_(computing))的优势,其主要特征如下:
|
||||
5. **自我修复、自我平衡、易于操作(Self-healing, self-balancing, easy to operate)** 为集群运维操作人员,要伸缩集群只需添加或删除服务,集群就会在后台自动重新平衡自身,而不会造成任何停机。
|
||||
如果任何一台 Druid 服务器发生故障,系统将自动绕过损坏的节点而保持无间断运行。
|
||||
Druid 被设计为 7*24 运行,无需设计任何原因的计划内停机(例如需要更改配置或者进行软件更新)。
|
||||
|
||||
1. **列式存储**,Druid使用列式存储,这意味着在一个特定的数据查询中它只需要查询特定的列,这样极地提高了部分列查询场景的性能。另外,每一列数据都针对特定数据类型做了优化存储,从而支持快速的扫描和聚合。
|
||||
2. **可扩展的分布式系统**,Druid通常部署在数十到数百台服务器的集群中,并且可以提供每秒数百万条记录的接收速率,数万亿条记录的保留存储以及亚秒级到几秒的查询延迟。
|
||||
3. **大规模并行处理**,Druid可以在整个集群中并行处理查询。
|
||||
4. **实时或批量摄取**,Druid可以实时(已经被摄取的数据可立即用于查询)或批量摄取数据。
|
||||
5. **自修复、自平衡、易于操作**,作为集群运维操作人员,要伸缩集群只需添加或删除服务,集群就会在后台自动重新平衡自身,而不会造成任何停机。如果任何一台Druid服务器发生故障,系统将自动绕过损坏。 Druid设计为7*24全天候运行,无需出于任何原因而导致计划内停机,包括配置更改和软件更新。
|
||||
6. **不会丢失数据的云原生容错架构**,一旦Druid摄取了数据,副本就安全地存储在[深度存储介质](Design/../chapter-1.md)(通常是云存储,HDFS或共享文件系统)中。即使某个Druid服务发生故障,也可以从深度存储中恢复您的数据。对于仅影响少数Druid服务的有限故障,副本可确保在系统恢复时仍然可以进行查询。
|
||||
7. **用于快速过滤的索引**,Druid使用[CONCISE](https://arxiv.org/pdf/1004.0403.pdf)或[Roaring](https://roaringbitmap.org/)压缩的位图索引来创建索引,以支持快速过滤和跨多列搜索。
|
||||
8. **基于时间的分区**,Druid首先按时间对数据进行分区,另外同时可以根据其他字段进行分区。这意味着基于时间的查询将仅访问与查询时间范围匹配的分区,这将大大提高基于时间的数据的性能。
|
||||
9. **近似算法**,Druid应用了近似count-distinct,近似排序以及近似直方图和分位数计算的算法。这些算法占用有限的内存使用量,通常比精确计算要快得多。对于精度要求比速度更重要的场景,Druid还提供了精确count-distinct和精确排序。
|
||||
10. **摄取时自动汇总聚合**,Druid支持在数据摄取阶段可选地进行数据汇总,这种汇总会部分预先聚合您的数据,并可以节省大量成本并提高性能。
|
||||
6. **原生结合云的容错架构,不丢失数据(Cloud-native, fault-tolerant architecture that won't lose data)** 一旦 Druid 获得了数据,那么获得的数据将会安全的保存在 [深度存储](architecture.md#deep-storage) (通常是云存储,HDFS 或共享文件系统)中。
|
||||
即使单个个 Druid 服务发生故障,你的数据也可以从深度存储中进行恢复。对于仅影响少数 Druid 服务的有限故障,保存的副本可确保在系统恢复期间仍然可以进行查询。
|
||||
|
||||
7. **针对快速过滤的索引(Indexes for quick filtering)** Druid 使用 [Roaring](https://roaringbitmap.org/) 或
|
||||
[CONCISE](https://arxiv.org/pdf/1004.0403) 来压缩 bitmap indexes 后来创建索引,以支持快速过滤和跨多列搜索。
|
||||
|
||||
8. **基于时间的分区(Time-based partitioning)** Druid 首先按时间对数据进行分区,同时也可以根据其他字段进行分区。
|
||||
这意味着基于时间的查询将仅访问与查询时间范围匹配的分区,这将大大提高基于时间的数据处理性能。
|
||||
|
||||
9. **近似算法(Approximate algorithms)** Druid应用了近似 `count-distinct`,近似排序以及近似直方图和分位数计算的算法。
|
||||
这些算法占用有限的内存使用量,通常比精确计算要快得多。对于精度要求比速度更重要的场景,Druid 还提供了exact count-distinct 和 exact ranking。
|
||||
|
||||
10. **在数据摄取的时候自动进行汇总(Automatic summarization at ingest time)** Druid 支持在数据摄取阶段可选地进行数据汇总,这种汇总会部分预先聚合您的数据,并可以节省大量成本并提高性能。
|
||||
|
||||
|
||||
## 我应该在什么时候使用 Druid
|
||||
|
|
Loading…
Reference in New Issue