Quantcast
Channel: Datasalt – Datasalt
Viewing all articles
Browse latest Browse all 10

A practical Storm’s Trident API Overview

$
0
0

On the 10th of April Pere gave a Trident hackaton at Berlin’s Big Data Beers. There was also a parallel Disco hackaton by Dave from Continuum Analytics. Es war viel spaß! The people who came had the chance to learn the basics of Trident in the Storm session while trying it right away. The hackaton covered the most basic aspects of the API, the philosophy and typical use cases for Storm and included a simple exercise that manipulated a stream of fake tweets. The project with the session guideline, some runnable examples and the tweet generator can be found on github.

BDB4-2

In this post we will see an introductory overview of Trident’s API that can be followed with the help of the aforementioned github project.

Storm (in a nutshell)

In a nutshell, Storm is about real-time processing of data streams. What makes it different is a higher level of abstraction than simple message passing (which allows defining topologies as a DAG), per-process fault-tolerance and guaranteed at-least-once semantics for every message in the system.

Storm
A typical use case is the cleaning, pre-processing and pre-aggregation of many concurrent messages (think about logs, clicks, sensor data, …). A typical Big Data real-time processing architecture may be a system where messages are read from a Kafka queue, pre-processed and pre-aggregated by Storm and persisted into any NoSQL such as Cassandra, or into Hadoop’s HDFS for later deep analysis.

Trident (in a nutshell)

Trident is an interesting abstraction on top of Storm. Besides providing higher-level constructs “a-la-Cascading”, it batches groups of Tuples to 1) Make reasoning about processing easier and 2) Encourage efficient data persistence, even with the help of an API that can provide exactly-once semantics for some cases.

We need to be aware of the fact that storing in-memory state in Bolts is not fault-tolerant: if a node dies, the process will get reassigned, but the state can’t be recovered. There is a ticket for that, but meanwhile the wisest way of using Storm is persisting to a reliable database, and for that Trident is particularly useful. If we are dealing with Big Data, we need to pre-batch things to not overload our datastore with one update per each message. Trident does that for us by pre-batching groups of Tuples and providing an aggregation API.

Getting started: each()

The starting point for our tutorial is the class Skeleton. Here we see the usage of FakeTweetsBatchSpout which is the Spout that is going to generate a stream of fake, random tweets. We can change the batch size of the Spout by constructor parameter. The first operation we will see is each() which allows us to manipulate every Tuple in the batch either by a Filter or a Function. We can implement a Filter that filters tweets from a certain actor:

public static class PerActorTweetsFilter extends BaseFilter {
  String actor;

  public PerActorTweetsFilter(String actor) {
    this.actor = actor;
  }
  @Override
  public boolean isKeep(TridentTuple tuple) {
    return tuple.getString(0).equals(actor);
  }
}

We can chain the previous Filter in the topology definition:

topology.newStream("spout", spout)
  .each(new Fields("actor", "text"), new PerActorTweetsFilter("dave"))
  .each(new Fields("actor", "text"), new Utils.PrintFilter());

Observe the input field selector where we have selected “actor” and “text”. The Tuples that are input to the each() method may have more fields, but the selector allows us to input only a subset of them. Therefore the input Tuple to the Filter will be an array with the actor at position 0 and the tweet text at position 1. We also chained another Filter that just prints every Tuple that passes by for being able to visualize the output. The behavior of this topology is pretty predictable: it will filter out any tweet which was not written by “dave”. Now let’s see an example function:

public static class UppercaseFunction extends BaseFunction {
  @Override
  public void execute(TridentTuple tuple, TridentCollector collector) {
    collector.emit(new Values(tuple.getString(0).toUpperCase()));
  }
}

This dummy Function just emits every string which is at position 0 of the tuple as an uppercased string. We can chain the Function in the topology, for instance:

topology.newStream("spout", spout)
  .each(new Fields("actor", "text"), new PerActorTweetsFilter("dave"))
  .each(new Fields("text", "actor"), new UppercaseFunction(), new Fields("uppercased_text"))
  .each(new Fields("actor", "text", "uppercased_text"), new Utils.PrintFilter());

Two things to observe here. The first is the input field selector for the Function: we put the text of the tweet in the first position of the Tuple and the actor in the second so that the uppercased string will be indeed the text (and not the actor). The second thing to observe is the output fields declaration which we always need to do for every Function call. The output Tuples after the Function call will have appended the output fields (so we can’t name an output field after one of the input fields). The code above will uppercase all tweets by “dave” and print both the original text and the uppercased text.

As a side note, every call to each() allows us to do an implicit projection of the Tuples by selecting a subset of them (the non-projected values will still be available in successive calls), but if for some reason we needed to project them explicitly we can use the project() API method.

Making things more interesting: parallelismHint() and partitionBy()

Let’s go back to the simple Filter example. What happens when we define a topology like this?

topology.newStream("spout", spout)
  .each(new Fields("actor", "text"), new PerActorTweetsFilter("dave"))
  .parallelismHint(5)
  .each(new Fields("actor", "text"), new Utils.PrintFilter());

parallelismHint() configures the topology up to where we placed it to execute with the specified degree of parallelism. This is not exactly true, but for now it is also a good hint to understand it. To better visualize this we can modify the PerActorTweetsFilter as follows:

public static class PerActorTweetsFilter extends BaseFilter {

  private int partitionIndex;
  private String actor;

  public PerActorTweetsFilter(String actor) {
    this.actor = actor;
  }
  @Override
  public void prepare(Map conf, TridentOperationContext context) {
    this.partitionIndex = context.getPartitionIndex();
  }
  @Override
  public boolean isKeep(TridentTuple tuple) {
    boolean filter = tuple.getString(0).equals(actor);
    if(filter) {
      System.err.println("I am partition [" + partitionIndex + "] and I have kept a tweet by: " + actor);
    }
    return filter;
  }
}

If we run the topology now we will obtain logs like:

I am partition [4] and I have kept a tweet by: dave
I am partition [3] and I have kept a tweet by: dave
I am partition [0] and I have kept a tweet by: dave
I am partition [2] and I have kept a tweet by: dave
I am partition [1] and I have kept a tweet by: dave

This makes it clear that the Filter is being executed in parallel by 5 different processes. We also have 5 Spouts now (you can grep logs for “Open Spout instance” message to check it). What happens if we only want 2 Spouts and 5 filtering processes?

topology.newStream("spout", spout)
  .parallelismHint(2)
  .shuffle()
  .each(new Fields("actor", "text"), new PerActorTweetsFilter("dave"))
  .parallelismHint(5)
  .each(new Fields("actor", "text"), new Utils.PrintFilter());

The shuffle() method is a repartitioning operation. There others like partitionBy() or global(). Repartitioning allows us to specify how Tuples should be routed to the next processing layer, as well as making different layers possibly run with different degrees of parallelism. Shuffle() performs a random routing meanwhile partitionBy() makes a routing based on a consistent hashing of the Fields we specify in it. Now that we introduced all these concepts, we can clarify the previous definition of parallelismHint(): it applies a certain degree of parallelism to all operations before it until there’s a repartitioning of some sort.

Let’s change shuffle() by partitionBy(new Fields(“actor”)). What do you think will happen?

I am partition [2] and I have kept a tweet by: dave
I am partition [2] and I have kept a tweet by: dave
I am partition [2] and I have kept a tweet by: dave
I am partition [2] and I have kept a tweet by: dave

By using partitionBy(new Fields(“actor”)) we are saying that all Tuples from the same actor should go to the exactly same process, so of course now there’s only one out of the five which is filtering Tuples for “dave”.

Aggregation

We said before that Trident is about processing batches of Tuples. A “batch” operation that comes naturally to one’s mind is aggregation. Trident provides primitives for aggregating batches. Let’s see one example:

public static class LocationAggregator extends BaseAggregator<Map<String, Integer>> {

  @Override
  public Map<String, Integer> init(Object batchId, TridentCollector collector) {
    return new HashMap<String, Integer>();
  }

  @Override
  public void aggregate(Map<String, Integer> val, TridentTuple tuple, TridentCollector collector) {
    String location = tuple.getString(0);
    val.put(location, MapUtils.getInteger(val, location, 0) + 1);
  }

  @Override
  public void complete(Map<String, Integer> val, TridentCollector collector) {
    collector.emit(new Values(val));
  }
}

This Aggregator is very simple: the idea is to process every batch of tuples and obtain a map of counts by location. Through this example we see the Aggregator interface, observe how Trident will call init() at the beginning of each batch, aggregate() for every Tuple in the batch and complete() at the end of the batch. We can use the collector anytime, but we chose to use it only at the end for efficiency. We could use the output of this aggregator to update a database, for instance.

We can test it by using the aggregate() method. The aggregate() method is a repartitioning operation too. It will aggregate all the Tuples of a batch in a single, random process. To minimize possible data sent by network, and if our aggregation logic allows us to do that, we can also use CombinerAggregator. By now let’s stick to the lower Aggregation interface and test it:

topology.newStream("spout", spout)
  .aggregate(new Fields("location"), new LocationAggregator(), new Fields("location_counts"))
  .each(new Fields("location_counts"), new Utils.PrintFilter());

We get results like:

[{USA=3, Spain=1, UK=1}]
[{USA=3, Spain=2}]
[{France=1, USA=4}]
[{USA=4, Spain=1}]
[{USA=5}]

Observe how the sum of all counts is always 5. This is because the default batch size of the spout we are using is 5.

Let’s increase the batch size to 100:

FakeTweetsBatchSpout spout = new FakeTweetsBatchSpout(100);

And let’s modify a little bit the topology and try to guess what will happen:

topology.newStream("spout", spout)
  .partitionBy(new Fields("location"))
  .partitionAggregate(new Fields("location"), new LocationAggregator(), new Fields("location_counts"))
  .parallelismHint(3)
  .each(new Fields("location_counts"), new Utils.PrintFilter());

Now we get results like:

[{France=10, Spain=5}]
[{USA=63}]
[{UK=22}]

Indeed, partitionAggregate() is not a repartitioning operation. Instead, it runs an Aggregation function on the part of the batch that each partition is managing. We have partitioned by location, we have 3 partitions, and have only 4 location in our dataset. Because all that, the partitioning is assigning France and Spain to one partition, USA to another and UK to another.

The previous examples can be a little mind-blowing but these sort of experiments are key to really understanding a tool, so be patient. What comes next is a bit more intuitive.

groupBy

The following piece of code will appear more intuitive to you:

topology.newStream("spout", spout)
  .groupBy(new Fields("location"))
  .aggregate(new Fields("location"), new Count(), new Fields("count"))
  .each(new Fields("location", "count"), new Utils.PrintFilter());

But what does this do, really?

...
[France, 25]
[UK, 2]
[USA, 25]
[Spain, 44]
[France, 26]
[UK, 3]
...

This code produces per-country counts even with no parallelism at all. Indeed, we used a fairly simple Aggregator: built-in Count(), much simpler than the one we built before. groupBy() creates a GroupedStream, logically grouped by some Fields. This grouping modifies the behavior of the following aggregate() method. Instead of aggregating the whole batch, it will aggregate each group independently. So it is like we have split our Stream into multiple Streams, as many as different groups are in the batch – if thinking it like this helps at all.

Keep in mind, however, that groupBy() is not a repartitioning operation per se. groupBy() followed by aggregation() is, but groupBy() followed by partitionAggregation() is not. As homework, think on this and experiment with this.

Conclusion

Up to here we covered a few fundamental primitives in Trident. There are several things we didn’t talk about, such as the state API. Hoping, however, that the few concepts we showed were introduced in a clear way.

You can play with the github project and implement several toy examples:

  • Per-hashtag counts.
  • Last three tweets for every actor.
  • Most used words per actor.
  • Most used words.
  • Trending hashtags in a window of time.

Some of them will involve maintaining some state. The Trident state encapsulation is an option (we might cover it in another post). Another option is to just connect to a database directly in an Aggregator or Function. Yet another option is to maintain state in-memory in the processes, keeping in mind that this option is not really fault-tolerant and discouraged for production purposes.


Viewing all articles
Browse latest Browse all 10

Latest Images

Trending Articles





Latest Images