Merge pull request #1838 from druid-io/update-paper-again

more edits to radstack paper
This commit is contained in:
Fangjin Yang 2015-10-18 19:53:31 -07:00
commit 092b5b19d2
2 changed files with 17 additions and 17 deletions

Binary file not shown.

View File

@ -676,10 +676,11 @@ To understand a real-world pipeline, let's consider an example from online
advertising. In online advertising, events are generated by impressions (views)
of an ad and clicks of an ad. Many advertisers are interested in knowing how
many impressions of an ad converted into clicks. Impression streams and click
streams are often recorded as separate streams by ad servers and need to be
joined. An example of data generated by these two event streams is shown in
Figure~\ref{fig:imps_clicks}. Every event has a unique id or key that
identifies the ad served. We use this id as our join key.
streams are almost always generated as separate streams by ad servers. Recall
that Druid does not support joins at query time, so the events must be
generated at processing time. An example of data generated by these two event
streams is shown in Figure~\ref{fig:imps_clicks}. Every event has a unique
impression id that identifies the ad served. We use this id as our join key.
\begin{figure}
\centering
@ -692,16 +693,13 @@ located in two different Kafka partitions on two different topics.
\label{fig:imps_clicks}
\end{figure}
Given that Druid does not support joins in queries, we need to do this join at
the data processing level. Our approach to do streaming joins is to buffer
events for a configurable period of time. We can leverage any key/value
database as the buffer, although we prefer one that has relatively high read
and write throughput. Events are typically buffered for 30 minutes. Once an
event arrives in the system with a join key that exists in the buffer, we
perform the first operation in our data pipeline: shuffling. A shuffle
operation writes events from our impressions and clicks streams to Kafka such
that the events that need to be joined are written to the same Kafka partition.
This is shown in Figure~\ref{fig:shuffled}.
The first stage of the Samza processing pipeline is a shuffle step. Events are
written to a keyed Kafka topic based on the hash of an event's impression id.
This ensures that the events that need to be joined are written to the same
Kafka topic. YARN containers running Samza tasks may read from one or more
Kafka topics, so it is important Samza task for joins actually has both events
that need to be joined. This is shuffle stage is shown in
Figure~\ref{fig:shuffled}.
\begin{figure}
\centering
@ -714,9 +712,11 @@ partition.
\end{figure}
The next stage in the data pipeline is to actually join the impression and
click. This is done by creating a new field in the data, called "is\_clicked".
This field is marked as "true" if a successful join occurs. This is shown in
Figure~\ref{fig:joined}
click events. This is done by another Samza task that creates a new event in
the data with a new field called "is\_clicked". This field is marked as "true"
if an impression event and a click event with the same impression id are both
present. The original events are discarded, and the new event is send further
downstream. This join stage shown in Figure~\ref{fig:joined}
\begin{figure}
\centering