mirror of https://github.com/apache/druid.git
1) A few minor edits to the paper
This commit is contained in:
parent
bf2f2df22c
commit
847790f784
|
@ -153,13 +153,13 @@ In many ways, Druid shares similarities with other interactive query systems
|
||||||
stores such as BigTable \cite{chang2008bigtable}, Dynamo \cite{decandia2007dynamo}, and Cassandra \cite{lakshman2010cassandra}. Unlike
|
stores such as BigTable \cite{chang2008bigtable}, Dynamo \cite{decandia2007dynamo}, and Cassandra \cite{lakshman2010cassandra}. Unlike
|
||||||
most traditional data stores, Druid operates mainly on read-only data
|
most traditional data stores, Druid operates mainly on read-only data
|
||||||
and has limited functionality for writes. The system is highly optimized
|
and has limited functionality for writes. The system is highly optimized
|
||||||
for large-scale transactional data aggregation and arbitrarily deep data exploration. Druid is highly configurable
|
for large-scale event data aggregation and arbitrarily deep data exploration. Druid is highly configurable
|
||||||
and allows users to adjust levels of fault tolerance and
|
and allows users to adjust levels of fault tolerance and
|
||||||
performance.
|
performance.
|
||||||
|
|
||||||
Druid builds on the ideas of other distributed data stores, real-time
|
Druid builds on the ideas of other distributed data stores, real-time
|
||||||
computation engines, and search engine indexing algorithms. In this
|
computation engines, and search engine indexing algorithms. In this
|
||||||
paper, we make the following contributions to academia:
|
paper, we make the following contributions.
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item We outline Druid’s real-time ingestion and query capabilities
|
\item We outline Druid’s real-time ingestion and query capabilities
|
||||||
and explain how we can explore events within milliseconds of their
|
and explain how we can explore events within milliseconds of their
|
||||||
|
@ -204,9 +204,9 @@ a Druid segment, consider the data set shown in Table~\ref{tab:sample_data}.
|
||||||
A segment is composed of multiple binary files, each representing a
|
A segment is composed of multiple binary files, each representing a
|
||||||
column of a data set. The data set in Table~\ref{tab:sample_data} consists of 8 distinct
|
column of a data set. The data set in Table~\ref{tab:sample_data} consists of 8 distinct
|
||||||
columns, one of which is the timestamp column. Druid always requires a
|
columns, one of which is the timestamp column. Druid always requires a
|
||||||
timestamp column because it (currently) only operates with event-based
|
timestamp column as a method of simplifying data distribution, data retention policies and
|
||||||
data. Segments always represent some time interval and each column
|
first-level query pruning. Segments always represent some time interval and each column
|
||||||
file contains the specific values for that column over the time
|
contains the specific values for that column over the time
|
||||||
interval. Since segments always contain data for a time range, it is
|
interval. Since segments always contain data for a time range, it is
|
||||||
logical that Druid partitions data into smaller chunks based on the
|
logical that Druid partitions data into smaller chunks based on the
|
||||||
timestamp value. In other words, segments can be thought of as blocks
|
timestamp value. In other words, segments can be thought of as blocks
|
||||||
|
@ -371,11 +371,11 @@ in Figure~\ref{fig:data-ingestion}.
|
||||||
The purpose of the message bus in Figure~\ref{fig:data-ingestion} is to act as a buffer for
|
The purpose of the message bus in Figure~\ref{fig:data-ingestion} is to act as a buffer for
|
||||||
incoming events. The message bus can maintain offsets indicating the
|
incoming events. The message bus can maintain offsets indicating the
|
||||||
position in an event stream that a real-time node has read up to and
|
position in an event stream that a real-time node has read up to and
|
||||||
real-time nodes can update these offsets periodically. The message bus also acts as a backup storage for recent events.
|
real-time nodes can update these offsets periodically. The message bus also acts as backup storage for recent events.
|
||||||
Real-time nodes ingest data by reading events from the message bus. The time from event creation to message bus storage to
|
Real-time nodes ingest data by reading events from the message bus. The time from event creation to message bus storage to
|
||||||
event consumption is on the order of hundreds of milliseconds.
|
event consumption is on the order of hundreds of milliseconds.
|
||||||
|
|
||||||
Real-time nodes maintain an in-memory index for all incoming
|
Real-time nodes maintain an in-memory index buffer for all incoming
|
||||||
events. These indexes are incrementally populated as new events appear on the message bus. The indexes are also directly queryable.
|
events. These indexes are incrementally populated as new events appear on the message bus. The indexes are also directly queryable.
|
||||||
Real-time nodes persist their indexes to disk either periodically or after some maximum row limit is
|
Real-time nodes persist their indexes to disk either periodically or after some maximum row limit is
|
||||||
reached. After each persist, a real-time node updates the message bus
|
reached. After each persist, a real-time node updates the message bus
|
||||||
|
@ -432,7 +432,7 @@ partitions across nodes. Each node announces the real-time segment it
|
||||||
is serving and each real-time segment has a partition number. Data
|
is serving and each real-time segment has a partition number. Data
|
||||||
from individual nodes will be merged at the Broker level. To our
|
from individual nodes will be merged at the Broker level. To our
|
||||||
knowledge, the largest production level real-time Druid cluster is
|
knowledge, the largest production level real-time Druid cluster is
|
||||||
consuming approximately 2 TB of raw data per hour.
|
consuming approximately 500MB/s (150,000 events/s or 2 TB/hour of raw data).
|
||||||
|
|
||||||
\subsection{Broker Nodes}
|
\subsection{Broker Nodes}
|
||||||
Broker nodes act as query routers to other queryable nodes such as
|
Broker nodes act as query routers to other queryable nodes such as
|
||||||
|
|
Loading…
Reference in New Issue