Cascading + Splout SQL for log analysis and serving: A Big Data love story

(This is the first post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).

In this post we’ll present an example Big Data use case for analyzing and indexing a large amount of Apache logs from an e-Commerce website, and being able to serve them in a low-latency “customer service” web application that needs fine-grained, detailed per-user information for troubleshooting and performing “loyalty campaigns”. For that we will marry Cascading, an agile Java high-level Hadoop framework with Splout SQL, a highly-performant, low-latency partitioned SQL for Hadoop. We’ll see how to develop a solution which is totally scalable both in processing and serving and develop it in barely 200 lines. We’ll also provide a simple JavaScript frontend that will use Splout SQL’s REST API via jQuery.

Cascading

Cascading is a framework that allows Java programmers to easily build complex Hadoop Big Data flows by abstracting them from Map/Reduce. We like Cascading because it allows us to prototype quickly. There are other non-Java tools on top of Cascading which are commonly used such as Cascalog or Scalding. Through this example you will get to see a few of Cascading’s primitives and how to put them together.

Splout SQL

We have already presented Splout SQL and showed an example lambda architecture with it. Splout SQL allows serving Big Data in low-latency under high load just like any other fast NoSQL – but it is SQL. Unlike Dremel-like engines which are for ad-hoc, exploratory offline analysis, Splout SQL is for the web, providing consistent performance under concurrent scenarios. And in addition, as you will see through this example, it allows to deploy Hadoop-generated datasets very easily.

The e-Commerce “customer service” webapp

This imaginary e-Commerce website has many users and several product categories – the product category can always be parsed from the product URL. There is a “customer service” department that takes care of troubleshooting, answering user calls and maintaining client loyalty. We need to build the backend for a new web application they will use for which the requirements are:

For any user, retrieve the exact sequence of events that this user made in the website within a certain timeframe. (This helps in detecting the root cause of an issue the user might have had, and can also be valuable information for the technical department in detecting and fixing new bugs).

Image may be NSFW.
Clik here to view.

For any user, be able to “visualize” an activity “footprint” overview for performing “loyalty actions” or campaigns. (For instance, knowing the top 5 categories the user interacted with in the past days allows the “customer service” to offer discounts or any other promotional products on interesting categories for the user).

Image may be NSFW.
Clik here to view.

We need a solution which is:

Scalable both in processing and serving. The amount of data to be queried by the webapp is as big as the amount of data to be analyzed as input (logs).
Simple to implement.
Flexible – we can add / change statistics, change the processing business logic and recompute everything easily.
Reasonably priced.

The solution

In the solution – which is on github – the Apache logs are parsed and analyzed using Cascading which produces two output files: one with the raw parsed logs and one with a consolidated “groupBy” (user, category, date). Both output files can be then transformed into SQL tables in a Splout SQL tablespace and queried in real-time by the “customer service” webapp.

Using Cascading for the processing allows us to develop and iterate fast. Using Splout SQL for serving the output allows us to perform flexible SQL queries over the analyzed datasets and scale horizontally without having a complex and expensive system underneath.

We’ll explain part by part each component of the solution.

Log processing

Logs that are produced by the webapp need to be saved in the HDFS for being processed by Hadoop. This part is out of the scope for the example – but if the simple approach of uploading partial log files to the HDFS is not enough, there are tools that can help like Flume, Scribe, Kafka or Storm.

Once in the HDFS, they can be analyzed using Cascading quite easily. We will just parse the logs and output them, together with a daily aggregation by user and category. This is all the Java code we need to implement the processing business logic:

	public void indexLogs(String inputPath, String outputPathLogs, String outputPathAnalytics) {
		// define what the input file looks like, "offset" is bytes from beginning
		TextLine scheme = new TextLine(new Fields("offset", "line"));

		// create SOURCE tap to read a resource from the local file system, if input is not an URL
		Tap logTap = inputPath.matches("^[^:]+://.*") ? new Hfs(scheme, inputPath) : new Lfs(scheme,
		    inputPath);

		// declare the field names we will parse out of the log file
		Fields apacheFields = new Fields("ip", "user", "time", "method", "category", "page", "code", "size");

		// define the regular expression to parse the log file with
		String apacheRegex = "^([^ ]*) +[^ ]* +([^ ]*) +\\[([^]]*)\\] +\\\"([^ ]*) /([^/]*)/([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$";

		// declare the groups from the above regex we want to keep. each regex group will be given
		// a field name from 'apacheFields', above, respectively
		int[] allGroups = { 1, 2, 3, 4, 5, 6, 7, 8 };

		// create the parser
		RegexParser parser = new RegexParser(apacheFields, apacheRegex, allGroups);

		// create the input analysis Pipe
		Pipe parsePipe = new Each("logs", new Fields("line"), parser, Fields.RESULTS);

		// parse the date and split it into day + month + year
		parsePipe = new Each(parsePipe, new Fields("time"), new DateParser(
		    new Fields("day", "month", "year"), new int[] { Calendar.DAY_OF_MONTH, Calendar.MONTH,
		        Calendar.YEAR }, "dd/MMM/yyyy:HH:mm:ss"), Fields.ALL);

        // aggregate by date, user and category
		Pipe analyzePipe = new GroupBy("analyze", parsePipe, new Fields("day", "month", "year", "user",
		    "category"));
        // count each aggregation
		analyzePipe = new Every(analyzePipe, new Count());

		// create a SINK tap to write to the default filesystem
		// To use the output in Splout, save it in binary (SequenceFile).
		// In this way integration is both efficient and easy (no need to re-parse the file again).
		Tap remoteLogTap = new Hfs(new SequenceFile(Fields.ALL), outputPathLogs, SinkMode.REPLACE);
		Tap remoteAnalyticsTap = new Hfs(new SequenceFile(Fields.ALL), outputPathAnalytics, SinkMode.REPLACE);

		// set the current job jar
		Properties properties = new Properties();
		AppProps.setApplicationJarClass(properties, LogIndexer.class);

		Map<String, Tap> sinks = new HashMap<String, Tap>();
		sinks.put("logs", remoteLogTap);
		sinks.put("analyze", remoteAnalyticsTap);

		// connect the assembly to the SOURCE and SINK taps
		Flow parsedLogFlow = new HadoopFlowConnector(properties).connect(logTap, sinks, parsePipe,
		    analyzePipe);

		// start execution of the flow (either locally or on a cluster)
		parsedLogFlow.start();

		// block until the flow completes
		parsedLogFlow.complete();
	}

The “parsePipe” is responsible for parsing the logs, and the “analyzePipe” performs an aggregation over them. We use both as sink outputs for the flow (sinks). Parsing the logs is easy with RegexParser function.

Note how reasoning about your Big Data flow is easy with Each, GroupBy and Every operations. This is similar to what we have seen in Storm’s Trident in the previous post. In some way, Trident is to stream processing what Cascading is to batch processing.

Log indexing and serving

To be able to serve efficiently all the structured data that came out of Cascading we need some scalable database with fast lookup times and consistent performance. By using Splout SQL we also can use the expresiveness of SQL over the data so that we don’t need to pre-compute a lot of things beforehand. We can get Google-analytics-like variable timelines for all categories and users for free.

The tradeoff is that, in order to use Splout SQL, we need to partition our dataset. But in this case this is not a problem since we will provide a per-user panel, so we can safely partition by “user”.

Image may be NSFW.
Clik here to view.

In order to do that we need to use the splout-hadoop API. We can use the splout-starter project in github as a starting point for that. The code for indexing and deploying the data looks like that:

	public void deployToSplout(String outputPathLogs, String outputPathAnalytics, String qNode,
	    int nPartitions) throws Exception {
		// add sqlite native libs to DistributedCache
		if(!FileSystem.getLocal(conf).equals(FileSystem.get(conf))) {
			SploutHadoopConfiguration.addSQLite4JavaNativeLibsToDC(conf);
		}

		// delete tablespace-generated files if they already exist
		FileSystem outputPathFileSystem = new Path(outputPath).getFileSystem(conf);
		Path outputToGenerator = new Path(outputPath + "-generated");
		if(outputPathFileSystem.exists(outputToGenerator)) {
			outputPathFileSystem.delete(outputToGenerator, true);
		}

		TablespaceBuilder builder = new TablespaceBuilder();
		// build a Table instance of each table using the builder
		TableBuilder logsTable = new TableBuilder("logs", getConf());
		String[] logsColumns = new String[] { "ip", "user", "time", "method", "category", "page", "code", "size", "day", "month", "year" };
		logsTable.addCascadingTable(new Path(outputPathLogs), logsColumns, conf);
		logsTable.partitionBy("user");

		TableBuilder analyticsTable = new TableBuilder("analytics", getConf());
		String[] analyticsColumns = new String[] { "day", "month", "year", "user", "category", "count" };
		analyticsTable.addCascadingTable(new Path(outputPathAnalytics), analyticsColumns, conf);
		analyticsTable.partitionBy("user");

		builder.add(logsTable.build());
		builder.add(analyticsTable.build());
		// define number of partitions
		builder.setNPartitions(nPartitions);

		// instantiate and call the TablespaceGenerator with the output fo the TablespaceBuilder
		TablespaceGenerator viewGenerator = new TablespaceGenerator(builder.build(), outputToGenerator,
		    this.getClass());
		viewGenerator.generateView(conf, SamplingType.DEFAULT, new DefaultSamplingOptions());

		// finally, deploy the generated files
		StoreDeployerTool deployer = new StoreDeployerTool(qNode, conf);
		List specs = new ArrayList();
		specs.add(new TablespaceDepSpec("cascading_splout_logs_example", outputToGenerator.toString(), 1, null));

		deployer.deploy(specs);
	}

Note how we used the method addCascadingTable in TableBuilder for being able to index binary Cascading files directly. This way is the most efficient way of exporting the output of a Cascading process and we avoid needing to output a text file that would need to be parsed again.

In this example we used the Java API to export to Splout SQL, but we could have used command-line tools as well. In the Java code we are invoking the generation and the deployment process. To do the same using command-line tools and no Java at all, we would need to execute the following commands:

  hadoop jar splout-*-hadoop.jar generate -tf file:///`pwd`/cascading-export.json -o out-cascading-export
  hadoop jar splout-*-hadoop.jar deploy -root out-cascading-export -ts cascading_splout_logs_example -q http://localhost:4412

Where the “cascading-export.json” tablespace descriptor would look like:

{
	"name": "cascading_splout_logs_example",
	"nPartitions": 2,
	"partitionedTables": [{
		"name": "logs",
		"partitionFields": "user",
		"tableInputs": [{
			"inputType": "CASCADING",
			"cascadingColumns": "ip,user,time,method,category,page,code,size,day,month,year",
			"paths": [ "out-clogs-logs" ]
		}]
	},{
		"name": "analytics",
		"partitionFields": "user",
		"tableInputs": [{
			"inputType": "CASCADING",
			"cascadingColumns": "day,month,year,user,category,count",
			"paths": [ "out-clogs-analytics" ]
		}]
	}]
}

Try it!

In this e-Commerce webapp example we have illustrated the tight integration between Splout SQL and well-known Hadoop tools such as Cascading. In next posts we will show how other tools such as Pig or Hive integrate also seamlessly with Splout SQL. The project and the instructions for trying out this example are on github, so try it!

Cascading + Splout SQL for log analysis and serving: A Big Data love story

Cascading

Splout SQL

The e-Commerce “customer service” webapp

The solution

Log processing

Log indexing and serving

Try it!

Trending Articles

Gospel Music Artist DeWayne Woods Alleaged Male Lover Reveals All On...

Carola Pucci Manuel Pellegrini’s Wife

[Album] 中森明菜 – Akina Box 1982-1991 (2020.12.09/Hi-Res FLAC/RAR)

Download – The Last Ship 1ª Temporada RMVB Dublado – MEGA

गर्मी पर स्टेटस – Funny Summer Status in Hindi for Whatsapp

Karimnagar District Police Office Mobile Numbers List in Telangana State

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Management Reporter 2012

Possible false positive errors from AOMEI Partition Assistant CHKDSK scan?

Cross Forest Trust failing one direction suddenly - 0xc000005e "An Error...

Practice Sheet of Right form of verbs for HSC Students

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Remove-MsolUser Access Denied

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Asianet plus schedule – list of programs , movie timings etc

100 Broken Heart Quotes Status for Whatsapp in English

達明一派 (Tat Ming Pair) –達明一派 Project 30 – SACD Collection Boxset (7 SACD + 2...

Sony Vaio SVE1512B4E-SVE151J11M bios unlock (advanced menu)

What Position Is Faffy Iannarella Playing In The Philly Mafia? Scarfo Era...

JUSTIN R SANDASSIE Arrested by Miami-Dade County Corrections on Mar 05, 2017