(This is the last post of a series of three posts presenting Splout SQL 0.2.2 native integration with main Hadoop processing tools: Cascading, Hive and Pig).
In this post we’ll present an example Big Data use case for a big food retail store. We will summarize individual client’s purchases using Apache Pig and we will dump the analysis into Splout SQL for being able to query it in real-time. Then, we will be able to combine the summarized information with a list of promotional products and suggest discounts on particular products for every client, in real-time. This information could be easily used by a coupon printing system for increasing client loyalty.
Combining an agile Big Data processing tool such as Pig with a flexible, low-latency SQL querying system such as Splout SQL provides a simple yet effective solution to this problem, which we will be able to simulate through this post almost with no effort.
Requirements
In order to follow the steps in this post you should:
- Have Hadoop and Apache Pig installed in your computer. We tested this with Hadoop CDH3 and Pig 0.9.2.
- Have Python installed. The data generation scripts are written in Python.
- Download Pangool’s core JAR version 0.6.2 or superior. The JAR can be downloaded from Maven central. We will use it in Pig for calling a custom StoreFunc function.
- Have Splout SQL version 0.2.2 or superior in your computer. For using Splout SQL you just need to download its latest distribution, unzip it and start both QNode and DNode daemons. You can find more information in the “Getting started” section of the official webpage.
Preparing the input data
The input data for this example will be two-fold:
- A list of products to be promoted. Discounts on these products need to be offered to clients.
- The database of historical payments in the retail. This is the Big Data part of the system: for a big retail, this dataset could be enormous. We will generate only a few example payments.
For generating the input datasets we will use the retail_data_generator tool. It contains two python scripts and one JSON file. The JSON file describes the categories and sub-categories that are sold in the food retail store. The python scripts generate each of the datasets. After decompressing the tool, we can run the following commands:
python gen_prod_list.py > promotional_products.txt python gen_tickets.py > tickets_sample.txt
(Tip: you can change the parameters of each of the generators by changing the variables inside them. The default values should give you an illustrative example, though.)
We can now upload the generated files to the Hadoop HDFS:
hadoop fs -put promotional_products.txt promotional_products.txt hadoop fs -put tickets_sample.txt tickets_sample.txt
Summarizing client’s purchases in Pig
To keep the example simple enough, we will summarize all-time data for each client so as to be able to obtain the “top categories purchased” by each client. We will obtain both the top categories and the top sub categories, as it will allow us to better understand client’s behavior. The code in Pig latin for the summary is:
a = LOAD 'tickets_sample.txt' USING PigStorage('\t') AS (date, client_id, category, sub_category); b = GROUP a BY (client_id, sub_category); c = FOREACH b GENERATE FLATTEN(group) AS (client_id, sub_category), COUNT(a) AS count; d = FOREACH c GENERATE (chararray)client_id, (chararray)sub_category, count; top_subcategories = ORDER d BY client_id, count DESC; e = GROUP a BY (client_id, category); f = FOREACH e GENERATE FLATTEN(group) AS (client_id, category), COUNT(a) AS count; g = FOREACH f GENERATE (chararray)client_id, (chararray)category, count; top_categories = ORDER g BY client_id, count DESC;
We can use the useful Pig’s “illustrate” command to see if the resulting data is indeed what we need. The ILLUSTRATE tool picks a sample of the input dataset and quickly applies the data flow to it:
ILLUSTRATE top_categories; ILLUSTRATE top_subcategories;
Finally, in order to run the flow for real, we can use the STORE command, and for that we need to load the Pangool’s core JAR into Pig (substitute as needed):
REGISTER /home/.../pangool/pangool-core-0.60.2.jar;
The STORE commands:
STORE top_subcategories INTO 'retail_top_subcategories' USING com.datasalt.pangool.pig.PangoolStoreFunc('retail_top_subcategories', 'client_id', 'sub_category', 'count'); STORE top_categories INTO 'retail_top_categories' USING com.datasalt.pangool.pig.PangoolStoreFunc('retail_top_categories', 'client_id', 'category', 'count');
Note how we used a custom Pangool function called “PangoolStoreFunc“. This function allows us to save the output of a Pig process into a binary Tuple file which has a defined schema. This file can then be easily processed by Splout SQL without needing to redefine its input schema. Observe how the first argument to PangoolStoreFunc is the table name, and the rest of the arguments are column names. SQL tables in Splout will be named after that.
Deploying the analysis to Splout SQL
Now that we have summarized each client’s purchases, we will deploy the resulting tables together with the promotional products to Splout SQL. In this way, we will be able to perform SQL queries over the three tables in real-time. For convenience, we will create a tablespace JSON descriptor in Splout’s home folder called “retail_tablespace.json” as follows:
{ "name": "retail_example", "nPartitions": 2, "partitionedTables": [{ "name": "retail_top_categories", "partitionFields": "client_id", "tableInputs": [{ "inputType": "TUPLE", "paths": [ "retail_top_categories" ] }] },{ "name": "retail_top_subcategories", "partitionFields": "client_id", "tableInputs": [{ "inputType": "TUPLE", "paths": [ "retail_top_subcategories" ] }] }], "replicateAllTables": [{ "name": "promotional_products", "schema": "category:string, sub_category:string, product:string", "tableInputs": [{ "paths": [ "promotional_products.txt" ] }] }] }
Note how we have specified two partitioned tables (retail_top_categories, retail_top_subcategories) and one “replicate-all” table (promotional_products). The first two tables come from Pig and are Big Data. Splout SQL uses data partitioning for scaling, so we need to partition our Big Data. Because we are just interested in “point queries” that will impact a single client, we can safely partition by client_id. We specified here two partitions, but in a real-life scenario we would need to increase the number of partitions. On the other hand, the third table is just a relatively small file that can be safely replicated in each partition. We can leverage that for being able to perform real-time joins in any partition.
Note that tables exported from Pig are imported using the type “TUPLE”. This is because we saved them in Pangool’s binary Tuple format. Also note how we made use of some default configurations in “promotional_products” table: by default, inputType is “TEXT”, and it is splitted by a tabulation character, which is exactly what we wanted.
We can now invoke the appropriated generate & deploy tools in Splout SQL:
hadoop jar splout-*-hadoop.jar generate -tf file:///`pwd`/retail_tablespace.json -o out-retail-example hadoop jar splout-hadoop-*-hadoop.jar deploy -root out-retail-example -ts retail_example -q http://localhost:4412
The first command creates the binary indexed SQLite files that are later deployed to Splout SQL using the second command.
Implementing the coupon generation system
For offering coupons to clients we can peform a join query as follows:
SELECT product FROM promotional_products, retail_top_categories WHERE client_id = 0 AND retail_top_categories.category = promotional_products.category ORDER BY count DESC LIMIT 5;
This query will select 5 products in the promotional products list based on previous client purchases. The join query is based on the product category, but we could have used the sub-category as well:
SELECT product FROM promotional_products, retail_top_subcategories WHERE client_id = 0 AND retail_top_subcategories.sub_category = promotional_products.sub_category ORDER BY count DESC LIMIT 5;
Generally speaking, both queries will return different results. Taking a look at customer’s summarized activity by category or by sub-category offers us different points of view. You can test both queries in Splout SQL’s administration webapp (at localhost:4412).
In the following attached HTML we have implemented a simple Javascript app that queries Splout SQL for printing discounts.
Conclusion
By combining Pig with Splout SQL we have implemented a simple coupon generation system with almost no effort. This system is able to scale to huge amounts of input data. Even for tremendous amounts of clients purchasing products for long periods of time we can analyse our Big Data in a scalable way using Hadoop (Pig) and yet make it available for real-time querying by Splout SQL. The implemented example is very straight-forward, but it is easy to imagine other complex variants of it like providing a mobile application to each client for checking the purchase activity evolution over time.