(This is the first post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).
In this post we’ll present an example Big Data use case for analyzing and indexing a large amount of Apache logs from an e-Commerce website, and being able to serve them in a low-latency “customer service” web application that needs fine-grained, detailed per-user information for troubleshooting and performing “loyalty campaigns”. For that we will marry Cascading, an agile Java high-level Hadoop framework with Splout SQL, a highly-performant, low-latency partitioned SQL for Hadoop. We’ll see how to develop a solution which is totally scalable both in processing and serving and develop it in barely 200 lines. We’ll also provide a simple JavaScript frontend that will use Splout SQL’s REST API via jQuery.
Cascading
Cascading is a framework that allows Java programmers to easily build complex Hadoop Big Data flows by abstracting them from Map/Reduce. We like Cascading because it allows us to prototype quickly. There are other non-Java tools on top of Cascading which are commonly used such as Cascalog or Scalding. Through this example you will get to see a few of Cascading’s primitives and how to put them together.
Splout SQL
We have already presented Splout SQL and showed an example lambda architecture with it. Splout SQL allows serving Big Data in low-latency under high load just like any other fast NoSQL – but it is SQL. Unlike Dremel-like engines which are for ad-hoc, exploratory offline analysis, Splout SQL is for the web, providing consistent performance under concurrent scenarios. And in addition, as you will see through this example, it allows to deploy Hadoop-generated datasets very easily.
The e-Commerce “customer service” webapp
This imaginary e-Commerce website has many users and several product categories – the product category can always be parsed from the product URL. There is a “customer service” department that takes care of troubleshooting, answering user calls and maintaining client loyalty. We need to build the backend for a new web application they will use for which the requirements are:
- For any user, retrieve the exact sequence of events that this user made in the website within a certain timeframe. (This helps in detecting the root cause of an issue the user might have had, and can also be valuable information for the technical department in detecting and fixing new bugs).
Image may be NSFW.
Clik here to view.
- For any user, be able to “visualize” an activity “footprint” overview for performing “loyalty actions” or campaigns. (For instance, knowing the top 5 categories the user interacted with in the past days allows the “customer service” to offer discounts or any other promotional products on interesting categories for the user).
Clik here to view.

- Scalable both in processing and serving. The amount of data to be queried by the webapp is as big as the amount of data to be analyzed as input (logs).
- Simple to implement.
- Flexible – we can add / change statistics, change the processing business logic and recompute everything easily.
- Reasonably priced.
The solution
In the solution – which is on github – the Apache logs are parsed and analyzed using Cascading which produces two output files: one with the raw parsed logs and one with a consolidated “groupBy” (user, category, date). Both output files can be then transformed into SQL tables in a Splout SQL tablespace and queried in real-time by the “customer service” webapp.
Using Cascading for the processing allows us to develop and iterate fast. Using Splout SQL for serving the output allows us to perform flexible SQL queries over the analyzed datasets and scale horizontally without having a complex and expensive system underneath.
We’ll explain part by part each component of the solution.
Log processing
Logs that are produced by the webapp need to be saved in the HDFS for being processed by Hadoop. This part is out of the scope for the example – but if the simple approach of uploading partial log files to the HDFS is not enough, there are tools that can help like Flume, Scribe, Kafka or Storm.
Once in the HDFS, they can be analyzed using Cascading quite easily. We will just parse the logs and output them, together with a daily aggregation by user and category. This is all the Java code we need to implement the processing business logic:
public void indexLogs(String inputPath, String outputPathLogs, String outputPathAnalytics) { // define what the input file looks like, "offset" is bytes from beginning TextLine scheme = new TextLine(new Fields("offset", "line")); // create SOURCE tap to read a resource from the local file system, if input is not an URL Tap logTap = inputPath.matches("^[^:]+://.*") ? new Hfs(scheme, inputPath) : new Lfs(scheme, inputPath); // declare the field names we will parse out of the log file Fields apacheFields = new Fields("ip", "user", "time", "method", "category", "page", "code", "size"); // define the regular expression to parse the log file with String apacheRegex = "^([^ ]*) +[^ ]* +([^ ]*) +\\[([^]]*)\\] +\\\"([^ ]*) /([^/]*)/([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$"; // declare the groups from the above regex we want to keep. each regex group will be given // a field name from 'apacheFields', above, respectively int[] allGroups = { 1, 2, 3, 4, 5, 6, 7, 8 }; // create the parser RegexParser parser = new RegexParser(apacheFields, apacheRegex, allGroups); // create the input analysis Pipe Pipe parsePipe = new Each("logs", new Fields("line"), parser, Fields.RESULTS); // parse the date and split it into day + month + year parsePipe = new Each(parsePipe, new Fields("time"), new DateParser( new Fields("day", "month", "year"), new int[] { Calendar.DAY_OF_MONTH, Calendar.MONTH, Calendar.YEAR }, "dd/MMM/yyyy:HH:mm:ss"), Fields.ALL); // aggregate by date, user and category Pipe analyzePipe = new GroupBy("analyze", parsePipe, new Fields("day", "month", "year", "user", "category")); // count each aggregation analyzePipe = new Every(analyzePipe, new Count()); // create a SINK tap to write to the default filesystem // To use the output in Splout, save it in binary (SequenceFile). // In this way integration is both efficient and easy (no need to re-parse the file again). Tap remoteLogTap = new Hfs(new SequenceFile(Fields.ALL), outputPathLogs, SinkMode.REPLACE); Tap remoteAnalyticsTap = new Hfs(new SequenceFile(Fields.ALL), outputPathAnalytics, SinkMode.REPLACE); // set the current job jar Properties properties = new Properties(); AppProps.setApplicationJarClass(properties, LogIndexer.class); Map<String, Tap> sinks = new HashMap<String, Tap>(); sinks.put("logs", remoteLogTap); sinks.put("analyze", remoteAnalyticsTap); // connect the assembly to the SOURCE and SINK taps Flow parsedLogFlow = new HadoopFlowConnector(properties).connect(logTap, sinks, parsePipe, analyzePipe); // start execution of the flow (either locally or on a cluster) parsedLogFlow.start(); // block until the flow completes parsedLogFlow.complete(); }
The “parsePipe” is responsible for parsing the logs, and the “analyzePipe” performs an aggregation over them. We use both as sink outputs for the flow (sinks). Parsing the logs is easy with RegexParser function.
Note how reasoning about your Big Data flow is easy with Each, GroupBy and Every operations. This is similar to what we have seen in Storm’s Trident in the previous post. In some way, Trident is to stream processing what Cascading is to batch processing.
Log indexing and serving
To be able to serve efficiently all the structured data that came out of Cascading we need some scalable database with fast lookup times and consistent performance. By using Splout SQL we also can use the expresiveness of SQL over the data so that we don’t need to pre-compute a lot of things beforehand. We can get Google-analytics-like variable timelines for all categories and users for free.
The tradeoff is that, in order to use Splout SQL, we need to partition our dataset. But in this case this is not a problem since we will provide a per-user panel, so we can safely partition by “user”.
Clik here to view.

public void deployToSplout(String outputPathLogs, String outputPathAnalytics, String qNode, int nPartitions) throws Exception { // add sqlite native libs to DistributedCache if(!FileSystem.getLocal(conf).equals(FileSystem.get(conf))) { SploutHadoopConfiguration.addSQLite4JavaNativeLibsToDC(conf); } // delete tablespace-generated files if they already exist FileSystem outputPathFileSystem = new Path(outputPath).getFileSystem(conf); Path outputToGenerator = new Path(outputPath + "-generated"); if(outputPathFileSystem.exists(outputToGenerator)) { outputPathFileSystem.delete(outputToGenerator, true); } TablespaceBuilder builder = new TablespaceBuilder(); // build a Table instance of each table using the builder TableBuilder logsTable = new TableBuilder("logs", getConf()); String[] logsColumns = new String[] { "ip", "user", "time", "method", "category", "page", "code", "size", "day", "month", "year" }; logsTable.addCascadingTable(new Path(outputPathLogs), logsColumns, conf); logsTable.partitionBy("user"); TableBuilder analyticsTable = new TableBuilder("analytics", getConf()); String[] analyticsColumns = new String[] { "day", "month", "year", "user", "category", "count" }; analyticsTable.addCascadingTable(new Path(outputPathAnalytics), analyticsColumns, conf); analyticsTable.partitionBy("user"); builder.add(logsTable.build()); builder.add(analyticsTable.build()); // define number of partitions builder.setNPartitions(nPartitions); // instantiate and call the TablespaceGenerator with the output fo the TablespaceBuilder TablespaceGenerator viewGenerator = new TablespaceGenerator(builder.build(), outputToGenerator, this.getClass()); viewGenerator.generateView(conf, SamplingType.DEFAULT, new DefaultSamplingOptions()); // finally, deploy the generated files StoreDeployerTool deployer = new StoreDeployerTool(qNode, conf); List specs = new ArrayList(); specs.add(new TablespaceDepSpec("cascading_splout_logs_example", outputToGenerator.toString(), 1, null)); deployer.deploy(specs); }
Note how we used the method addCascadingTable in TableBuilder for being able to index binary Cascading files directly. This way is the most efficient way of exporting the output of a Cascading process and we avoid needing to output a text file that would need to be parsed again.
In this example we used the Java API to export to Splout SQL, but we could have used command-line tools as well. In the Java code we are invoking the generation and the deployment process. To do the same using command-line tools and no Java at all, we would need to execute the following commands:
hadoop jar splout-*-hadoop.jar generate -tf file:///`pwd`/cascading-export.json -o out-cascading-export hadoop jar splout-*-hadoop.jar deploy -root out-cascading-export -ts cascading_splout_logs_example -q http://localhost:4412
Where the “cascading-export.json” tablespace descriptor would look like:
{ "name": "cascading_splout_logs_example", "nPartitions": 2, "partitionedTables": [{ "name": "logs", "partitionFields": "user", "tableInputs": [{ "inputType": "CASCADING", "cascadingColumns": "ip,user,time,method,category,page,code,size,day,month,year", "paths": [ "out-clogs-logs" ] }] },{ "name": "analytics", "partitionFields": "user", "tableInputs": [{ "inputType": "CASCADING", "cascadingColumns": "day,month,year,user,category,count", "paths": [ "out-clogs-analytics" ] }] }] }
Try it!
In this e-Commerce webapp example we have illustrated the tight integration between Splout SQL and well-known Hadoop tools such as Cascading. In next posts we will show how other tools such as Pig or Hive integrate also seamlessly with Splout SQL. The project and the instructions for trying out this example are on github, so try it!