In this post we will see how easy it is to integrate a Pangool MapReduce Job with MongoDB, the famous document-oriented NoSQL database. For that, we will perform a review scraping task on Qype HTML pages. The resultant reviews will be then persisted into MongoDB.
Introduction

Pangool

MongoDB

Parsing Qype
There are various approaches for obtaining information about people’s opinions and sentiment. A common approach for web pages that can be applied to Qype (a page where people can review venues and rate them) consists in scraping the HTML and parsing the reviews from it.The architecture we suggest is as follows:
There must be some sort of scalable scraping process (which could be implemented using Storm, take a look at this blog post for an example) which stores the scraped HTML into Hadoop’s HDFS. A periodic Hadoop job is executed for parsing the newly scraped pages. Then, parsed reviews are persisted to a MongoDB database directly from the Hadoop process, and a web page is used to render the data in MongoDB. A few things to note:
- A key characteristic of this architecture is that scraping and parsing are isolated. This is necessary, since we will frequently change our algorithm to include more fields in the review. For that, having all the historic scraped pages in a distributed file system helps us rebuilding our entire data periodically.
- Every review will be identified by a “review id” (which we can parse from Qype directly), and this id will be used to add or update Mongo. Mongo itself will handle the uniqueness of reviews in this system.
- We will use a very simple approach for parsing the reviews: Java regexes. We will specify the regexes in a separated file for convenience. Another approach could have been to parse the DOM structure with a flexible HTML parser like TagSoup.
The code
The implemented example can be found in github.
To sum up, and skipping less relevant parts, what we do is instantiate a Pangool MapOnlyJobBuilder and add the input folder, associated to an inline MapOnlyMapper. Note how we use Pangool’s configuration by instance to avoid having to define a separate class for the Mapper, thus making the code more concise. The Mapper will just append every line it sees into a StringBuffer and parse the reviews in the cleanup() method.
The parsing process consists in applying two “master” regexes: the start and end of one review, and a set of regexes into each review’s text. We also parse the “place id” once for every HTML page. Finally, we create a BSONObject which we can use for persisting the data to MongoDB.
MapOnlyJobBuilder builder = new MapOnlyJobBuilder(conf); builder.addInput(new Path(inputFolder), new HadoopInputFormat(TextInputFormat.class), new MapOnlyMapper<LongWritable, Text, Text, BSONObject>() { StringBuffer inMemoryHtml = new StringBuffer(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { inMemoryHtml.append(value.toString()); } @Override protected void cleanup(Context context) throws IOException, InterruptedException { String html = inMemoryHtml.toString(); Matcher startMatcher = startPattern.matcher(html); Matcher endMatcher = endPattern.matcher(html); Text documentId = new Text(); Matcher placeMatcher = placePattern.matcher(html); placeMatcher.find(); String placeId = placeMatcher.group(1); while(startMatcher.find()) { BSONObject review = new BasicBSONObject(); review.put("place_id", placeId); int reviewStart = startMatcher.start(); endMatcher.find(); int reviewEnd = endMatcher.start(); String reviewText = html.substring(reviewStart, reviewEnd); for(Map.Entry<String, Pattern> parsingProperty : parsingConfig.entrySet()) { Matcher matcher = parsingProperty.getValue().matcher(reviewText); if(matcher.find()) { review.put(parsingProperty.getKey(), matcher.group(1).trim()); } } documentId.set((String) review.get("review_id")); context.write(documentId, review); } } });
And now for the most relevant part of it. How do we configure the Pangool MapReduce Job for making it persist to MongoDB? Well, mostly just like we would do with a regular Hadoop MapReduce Job:
MongoConfigUtil.setOutputURI(conf, "mongodb://localhost/test.qype"); builder.setOutput(new Path(outPath), new HadoopOutputFormat(MongoOutputFormat.class), Text.class, BSONObject.class);
With these two lines, we are saying the the output of the Job will be persisted into MongoDB, host “localhost“, database “test” and collection “qype“. The Job will use Text as key output (which will be the Mongo ID for every object) and a BSONObject for the rest of the properties of the document. Note the use of Pangool’s “HadoopOutputFormat” wrapper for using Hadoop-native OutputFormats.
Conclusion
In only a few lines we have implemented a scalable processor for scraped HTML web pages, which in turn persists its results in MongoDB. But if we were to evolve this project into a real-life thing, we would need to take into account a few things:
- Every mapper will process only one HTML file. This might be OK if we want to perform very fancy and CPU-intensive processing on top of the text, like complex NLP and such. But generally speaking, if we process a lot of HTML files in every batch, there will be too many quick-finishing Mappers and the process will be sub-optimal. A simple solution to overcome this: the scraping process could buffer several HTML files in a single file, and the Pangool job would need to take that into account when parsing the HTML.
- Every output record of this Pangool job will hit MongoDB, so we need to take care not to impact query serving too much. For that, an incremental approach where we only process the newly scraped HTML might be enough, taking into account that we might often need to re-parse every HTML when modifying our algorithms. There exists other approaches for serving data generated from Hadoop. One which is both low-latency, scalable and doesn’t affect query serving: Splout SQL.
- If we need a more real-time approach, parsing logic could be moved into the scraper with care. For instance, adding a processing layer on a Storm topology. In any case, we would still need to save every raw HTML into the HDFS so as to be able to re-process everything when we need to. In order to avoid duplicate logic between layers, we should wrap all the processing logic into a single Java library.