完成翻译什么是 Druid 的内容

2021-07-22 14:43:17 -04:00 · 2021-07-22 14:43:17 -04:00 · bfae313cf8
parent feaad880e1
commit bfae313cf8
1 changed files with 40 additions and 68 deletions
--- a/design/index.md
+++ b/design/index.md
@ -3,81 +3,53 @@

 ## 什么是 Druid

-Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics
-("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries) on large data sets. Druid is most often
-used as a database for powering use cases where real-time ingest, fast query performance, and high uptime are important.
-As such, Druid is commonly used for powering GUIs of analytical applications, or as a backend for highly-concurrent APIs
-that need fast aggregations. Druid works best with event-oriented data.
+Apache Druid 是一个实时分析型数据库，旨在对大型数据集进行快速查询和分析（"[OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing)" 查询)。

-Common application areas for Druid include:
+Druid 最常被当做数据库，用以支持实时摄取、高查询性能和高稳定运行的应用场景。
+例如，Druid 通常被用来作为图形分析工具的数据源来提供数据，或当有需要高聚和高并发的后端 API。
+同时 Druid 也非常适合针对面向事件类型的数据。

- Clickstream analytics (web and mobile analytics)
- Network telemetry analytics (network performance monitoring)
- Server metrics storage
- Supply chain analytics (manufacturing metrics)
- Application performance metrics
- Digital marketing/advertising analytics
- Business intelligence / OLAP
+通常可以使用 Druid 作为数据源的系统包括有：
+- 点击流量分析（Web 或者移动分析）
+- 网络监测分析（网络性能监控）
+- 服务器存储指标
+- 供应链分析（生产数据指标）
+- 应用性能指标
+- 数字广告分析
+- 商业整合 / OLAP

-Druid's core architecture combines ideas from data warehouses, timeseries databases, and logsearch systems. Some of
-Druid's key features are:
+Druid 的核心架构集合了数据仓库（data warehouses），时序数据库（timeseries databases），日志分析系统（logsearch systems）的概念。

-1. **Columnar storage format.** Druid uses column-oriented storage, meaning it only needs to load the exact columns
-needed for a particular query.  This gives a huge speed boost to queries that only hit a few columns. In addition, each
-column is stored optimized for its particular data type, which supports fast scans and aggregations.
-2. **Scalable distributed system.** Druid is typically deployed in clusters of tens to hundreds of servers, and can
-offer ingest rates of millions of records/sec, retention of trillions of records, and query latencies of sub-second to a
-few seconds.
-3. **Massively parallel processing.** Druid can process a query in parallel across the entire cluster.
-4. **Realtime or batch ingestion.** Druid can ingest data either real-time (ingested data is immediately available for
-querying) or in batches.
-5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale the cluster out or in, simply add or
-remove servers and the cluster will rebalance itself automatically, in the background, without any downtime. If any
-Druid servers fail, the system will automatically route around the damage until those servers can be replaced. Druid
-is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software
-updates.
-6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is
-stored safely in [deep storage](architecture.html#deep-storage) (typically cloud storage, HDFS, or a shared filesystem).
-Your data can be recovered from deep storage even if every single Druid server fails. For more limited failures affecting
-just a few Druid servers, replication ensures that queries are still possible while the system recovers.
-7. **Indexes for quick filtering.** Druid uses [Roaring](https://roaringbitmap.org/) or
-[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create indexes that power fast filtering and
-searching across multiple columns.
-8. **Time-based partitioning.** Druid first partitions data by time, and can additionally partition based on other fields.
-This means time-based queries will only access the partitions that match the time range of the query. This leads to
-significant performance improvements for time-based data.
-9. **Approximate algorithms.** Druid includes algorithms for approximate count-distinct, approximate ranking, and
-computation of approximate histograms and quantiles. These algorithms offer bounded memory usage and are often
-substantially faster than exact computations. For situations where accuracy is more important than speed, Druid also
-offers exact count-distinct and exact ranking.
-10. **Automatic summarization at ingest time.** Druid optionally supports data summarization at ingestion time. This
-summarization partially pre-aggregates your data, and can lead to big costs savings and performance boosts.
+如果你对上面的各种数据类型，数据库不是非常了解的话，那么我们建议你进行一些搜索来了解相关的一些定义和提供的功能。

+Druid 的一些关键特性包括有：
+1. **列示存储格式（Columnar storage format）** Druid 使用列式存储，这意味着在一个特定的数据查询中它只需要查询特定的列。
+   这样的设计极大的提高了部分列查询场景性能。另外，每一列数据都针对特定数据类型做了优化存储，从而能够支持快速扫描和聚合。

-Apache Druid是一个实时分析型数据库，旨在对大型数据集进行快速的查询分析（"[OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing)"查询)。Druid最常被当做数据库来用以支持实时摄取、高性能查询和高稳定运行的应用场景，同时，Druid也通常被用来助力分析型应用的图形化界面，或者当做需要快速聚合的高并发后端API，Druid最适合应用于面向事件类型的数据。
+2. **可扩展的分布式系统(Scalable distributed system)** Druid通常部署在数十到数百台服务器的集群中，
+   并且可以提供每秒数百万级的数据导入，并且保存有万亿级的数据，同时提供 100ms 到 几秒钟之间的查询延迟。
   
-Druid通常应用于以下场景：
+3. **高性能并发处理（Massively parallel processing）** Druid 可以在整个集群中并行处理查询。

-* 点击流分析（Web端和移动端）
-* 网络监测分析（网络性能监控）
-* 服务指标存储
-* 供应链分析（制造类指标）
-* 应用性能指标分析
-* 数字广告分析
-* 商务智能 / OLAP
+4. **实时或者批量数据处理（Realtime or batch ingestion）** Druid 可以实时（已经被导入和摄取的数据可立即用于查询）导入摄取数据库或批量导入摄取数据。 
   
-Druid的核心架构吸收和结合了[数据仓库](https://en.wikipedia.org/wiki/Data_warehouse)、[时序数据库](https://en.wikipedia.org/wiki/Time_series_database)以及[检索系统](https://en.wikipedia.org/wiki/Search_engine_(computing))的优势，其主要特征如下：
+5. **自我修复、自我平衡、易于操作（Self-healing, self-balancing, easy to operate）** 为集群运维操作人员，要伸缩集群只需添加或删除服务，集群就会在后台自动重新平衡自身，而不会造成任何停机。
+   如果任何一台 Druid 服务器发生故障，系统将自动绕过损坏的节点而保持无间断运行。
+   Druid 被设计为 7*24 运行，无需设计任何原因的计划内停机（例如需要更改配置或者进行软件更新）。
   
-1. **列式存储**，Druid使用列式存储，这意味着在一个特定的数据查询中它只需要查询特定的列，这样极地提高了部分列查询场景的性能。另外，每一列数据都针对特定数据类型做了优化存储，从而支持快速的扫描和聚合。
-2. **可扩展的分布式系统**，Druid通常部署在数十到数百台服务器的集群中，并且可以提供每秒数百万条记录的接收速率，数万亿条记录的保留存储以及亚秒级到几秒的查询延迟。
-3. **大规模并行处理**，Druid可以在整个集群中并行处理查询。
-4. **实时或批量摄取**，Druid可以实时（已经被摄取的数据可立即用于查询）或批量摄取数据。
-5. **自修复、自平衡、易于操作**，作为集群运维操作人员，要伸缩集群只需添加或删除服务，集群就会在后台自动重新平衡自身，而不会造成任何停机。如果任何一台Druid服务器发生故障，系统将自动绕过损坏。 Druid设计为7*24全天候运行，无需出于任何原因而导致计划内停机，包括配置更改和软件更新。
-6. **不会丢失数据的云原生容错架构**，一旦Druid摄取了数据，副本就安全地存储在[深度存储介质](Design/../chapter-1.md)（通常是云存储，HDFS或共享文件系统）中。即使某个Druid服务发生故障，也可以从深度存储中恢复您的数据。对于仅影响少数Druid服务的有限故障，副本可确保在系统恢复时仍然可以进行查询。
-7. **用于快速过滤的索引**，Druid使用[CONCISE](https://arxiv.org/pdf/1004.0403.pdf)或[Roaring](https://roaringbitmap.org/)压缩的位图索引来创建索引，以支持快速过滤和跨多列搜索。
-8. **基于时间的分区**，Druid首先按时间对数据进行分区，另外同时可以根据其他字段进行分区。这意味着基于时间的查询将仅访问与查询时间范围匹配的分区，这将大大提高基于时间的数据的性能。
-9. **近似算法**，Druid应用了近似count-distinct，近似排序以及近似直方图和分位数计算的算法。这些算法占用有限的内存使用量，通常比精确计算要快得多。对于精度要求比速度更重要的场景，Druid还提供了精确count-distinct和精确排序。
-10. **摄取时自动汇总聚合**，Druid支持在数据摄取阶段可选地进行数据汇总，这种汇总会部分预先聚合您的数据，并可以节省大量成本并提高性能。
+6. **原生结合云的容错架构，不丢失数据（Cloud-native, fault-tolerant architecture that won't lose data）** 一旦 Druid 获得了数据，那么获得的数据将会安全的保存在 [深度存储](architecture.md#deep-storage) (通常是云存储，HDFS 或共享文件系统)中。
+   即使单个个 Druid 服务发生故障，你的数据也可以从深度存储中进行恢复。对于仅影响少数 Druid 服务的有限故障，保存的副本可确保在系统恢复期间仍然可以进行查询。
+   
+7. **针对快速过滤的索引（Indexes for quick filtering）** Druid 使用 [Roaring](https://roaringbitmap.org/) 或
+[CONCISE](https://arxiv.org/pdf/1004.0403) 来压缩 bitmap indexes 后来创建索引，以支持快速过滤和跨多列搜索。
+   
+8. **基于时间的分区（Time-based partitioning）** Druid 首先按时间对数据进行分区，同时也可以根据其他字段进行分区。
+   这意味着基于时间的查询将仅访问与查询时间范围匹配的分区，这将大大提高基于时间的数据处理性能。
+   
+9. **近似算法(Approximate algorithms)** Druid应用了近似 `count-distinct`，近似排序以及近似直方图和分位数计算的算法。
+   这些算法占用有限的内存使用量，通常比精确计算要快得多。对于精度要求比速度更重要的场景，Druid 还提供了exact count-distinct 和 exact ranking。
+   
+10. **在数据摄取的时候自动进行汇总(Automatic summarization at ingest time)** Druid 支持在数据摄取阶段可选地进行数据汇总，这种汇总会部分预先聚合您的数据，并可以节省大量成本并提高性能。


 ## 我应该在什么时候使用 Druid