diff --git a/publications/radstack/radstack.pdf b/publications/radstack/radstack.pdf index bee0604e6aa..50e52b3d45e 100644 Binary files a/publications/radstack/radstack.pdf and b/publications/radstack/radstack.pdf differ diff --git a/publications/radstack/radstack.tex b/publications/radstack/radstack.tex index 9b904c146c3..655f46b5b12 100644 --- a/publications/radstack/radstack.tex +++ b/publications/radstack/radstack.tex @@ -676,10 +676,11 @@ To understand a real-world pipeline, let's consider an example from online advertising. In online advertising, events are generated by impressions (views) of an ad and clicks of an ad. Many advertisers are interested in knowing how many impressions of an ad converted into clicks. Impression streams and click -streams are often recorded as separate streams by ad servers and need to be -joined. An example of data generated by these two event streams is shown in -Figure~\ref{fig:imps_clicks}. Every event has a unique id or key that -identifies the ad served. We use this id as our join key. +streams are almost always generated as separate streams by ad servers. Recall +that Druid does not support joins at query time, so the events must be +generated at processing time. An example of data generated by these two event +streams is shown in Figure~\ref{fig:imps_clicks}. Every event has a unique +impression id that identifies the ad served. We use this id as our join key. \begin{figure} \centering @@ -692,16 +693,13 @@ located in two different Kafka partitions on two different topics. \label{fig:imps_clicks} \end{figure} -Given that Druid does not support joins in queries, we need to do this join at -the data processing level. Our approach to do streaming joins is to buffer -events for a configurable period of time. We can leverage any key/value -database as the buffer, although we prefer one that has relatively high read -and write throughput. Events are typically buffered for 30 minutes. Once an -event arrives in the system with a join key that exists in the buffer, we -perform the first operation in our data pipeline: shuffling. A shuffle -operation writes events from our impressions and clicks streams to Kafka such -that the events that need to be joined are written to the same Kafka partition. -This is shown in Figure~\ref{fig:shuffled}. +The first stage of the Samza processing pipeline is a shuffle step. Events are +written to a keyed Kafka topic based on the hash of an event's impression id. +This ensures that the events that need to be joined are written to the same +Kafka topic. YARN containers running Samza tasks may read from one or more +Kafka topics, so it is important Samza task for joins actually has both events +that need to be joined. This is shuffle stage is shown in +Figure~\ref{fig:shuffled}. \begin{figure} \centering @@ -714,9 +712,11 @@ partition. \end{figure} The next stage in the data pipeline is to actually join the impression and -click. This is done by creating a new field in the data, called "is\_clicked". -This field is marked as "true" if a successful join occurs. This is shown in -Figure~\ref{fig:joined} +click events. This is done by another Samza task that creates a new event in +the data with a new field called "is\_clicked". This field is marked as "true" +if an impression event and a click event with the same impression id are both +present. The original events are discarded, and the new event is send further +downstream. This join stage shown in Figure~\ref{fig:joined} \begin{figure} \centering