mirror of https://github.com/apache/druid.git
Merge pull request #1838 from druid-io/update-paper-again
more edits to radstack paper
This commit is contained in:
commit
092b5b19d2
Binary file not shown.
|
@ -676,10 +676,11 @@ To understand a real-world pipeline, let's consider an example from online
|
|||
advertising. In online advertising, events are generated by impressions (views)
|
||||
of an ad and clicks of an ad. Many advertisers are interested in knowing how
|
||||
many impressions of an ad converted into clicks. Impression streams and click
|
||||
streams are often recorded as separate streams by ad servers and need to be
|
||||
joined. An example of data generated by these two event streams is shown in
|
||||
Figure~\ref{fig:imps_clicks}. Every event has a unique id or key that
|
||||
identifies the ad served. We use this id as our join key.
|
||||
streams are almost always generated as separate streams by ad servers. Recall
|
||||
that Druid does not support joins at query time, so the events must be
|
||||
generated at processing time. An example of data generated by these two event
|
||||
streams is shown in Figure~\ref{fig:imps_clicks}. Every event has a unique
|
||||
impression id that identifies the ad served. We use this id as our join key.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -692,16 +693,13 @@ located in two different Kafka partitions on two different topics.
|
|||
\label{fig:imps_clicks}
|
||||
\end{figure}
|
||||
|
||||
Given that Druid does not support joins in queries, we need to do this join at
|
||||
the data processing level. Our approach to do streaming joins is to buffer
|
||||
events for a configurable period of time. We can leverage any key/value
|
||||
database as the buffer, although we prefer one that has relatively high read
|
||||
and write throughput. Events are typically buffered for 30 minutes. Once an
|
||||
event arrives in the system with a join key that exists in the buffer, we
|
||||
perform the first operation in our data pipeline: shuffling. A shuffle
|
||||
operation writes events from our impressions and clicks streams to Kafka such
|
||||
that the events that need to be joined are written to the same Kafka partition.
|
||||
This is shown in Figure~\ref{fig:shuffled}.
|
||||
The first stage of the Samza processing pipeline is a shuffle step. Events are
|
||||
written to a keyed Kafka topic based on the hash of an event's impression id.
|
||||
This ensures that the events that need to be joined are written to the same
|
||||
Kafka topic. YARN containers running Samza tasks may read from one or more
|
||||
Kafka topics, so it is important Samza task for joins actually has both events
|
||||
that need to be joined. This is shuffle stage is shown in
|
||||
Figure~\ref{fig:shuffled}.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -714,9 +712,11 @@ partition.
|
|||
\end{figure}
|
||||
|
||||
The next stage in the data pipeline is to actually join the impression and
|
||||
click. This is done by creating a new field in the data, called "is\_clicked".
|
||||
This field is marked as "true" if a successful join occurs. This is shown in
|
||||
Figure~\ref{fig:joined}
|
||||
click events. This is done by another Samza task that creates a new event in
|
||||
the data with a new field called "is\_clicked". This field is marked as "true"
|
||||
if an impression event and a click event with the same impression id are both
|
||||
present. The original events are discarded, and the new event is send further
|
||||
downstream. This join stage shown in Figure~\ref{fig:joined}
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
|
Loading…
Reference in New Issue