eFORCE
Blogs Home | Corporate Website

Some random thoughts on using Hadoop for database driven applications

‘Can Hadoop help shorten duration of my 48 hr batch process?’, ‘can Hadoop cluster help replace my cluster of high-end servers?’ so you ask. We have heard a number (or variations) of these questions from our clients in past several weeks.   Some of our clients are in the finance or publishing industries – they house large set of databases ranging from MySQL to Oracle to maintain their array of data and run various data warehousing, cleansing and mining jobs. As data sizes have grown exponentially, so have database cluster and server configuration – as well as time it takes to run daily/weekly batch processing jobs and reports.

MapReduce algorithm has gained lot of popularity (and not to mention lot of press coverage as well) over last few months. The algorithm’s inventor, Google, is using it for its own search technology. Hadoop, which is its open source version has been heavily supported by Yahoo and used at its various projects. So obviously, the technology has been proven to efficiently handle datasets that ranges from terabytes to petabytes. Question is whether it can be applied to RDBMS/data warehousing domain? Can it solve the problem of long running ETL jobs Or can it be used to truncate time it takes to run some data mapping jobs? Can it help DBA/IT Developers to get out of constant loop of performance tuning, indexing, tuning and indexing?

All good questions.

But alas, trying to utilize Hadoop for database driven application is not as straightforward as it may sound.  And it may turn out to be not as efficient either. For starters, Hadoop platform is built on top of regular file system. Its architecture has two main components: Hadoop distributed file system (HDFS), which distributes & stores data over several machines, and mapreduce programming framework, which can be used to divide long running jobs into several smaller (map) tasks and then combine results during reduce phase.  Hadoop usually serializes and stores its objects on a regular file system. Since Hadoop itself is built on top of regular file system and not say DBMS, it is difficult to utilize its platform to scale your traditional RDBMS applications. Companies like AsterData and GreenPlum has databases built on top of their own mapreduce  framework.

 Having said that, there are several options to use Hadoop for your database jobs:
  • Check out Hadoop DBInputFormat API (http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/db/package-summary.html), which is part of v0.19 and later. It allows users to make JDBC calls from within Hadoop framework.
  • Frameworks built on top of Hadoop platform: Hive, PIG (and Cascading, which is open source but not part of ASF umbrella). These frameworks are designed to provide support for structured data analysis. In addition, Hive uses SQL based syntax. These frameworks can be used to load data from your databases inside Hadoop, perform analysis and generate results or even export results into flat files and load it back into your database.

One must also give careful consideration to what jobs can be transferred to Hadoop.  One rule of thumb we try to follow is analyzing whether a particular job spends more time executing SQL statements or in other computation activities like data analysis, transformation or aggregation.  In most cases, later type of jobs will benefit more from being executed in hadoop cluster environment. Otherwise, firing SQL statements from your 100s-1000s of mapper tasks would be a sure way to get your DBA’s attention!
 
We will try to share some more insights into this topic over next couple of blogs. Stay tuned!

Print | posted on Tuesday, May 05, 2009 2:53 PM

Feedback

Gravatar

# re: Some random thoughts on using Hadoop for database driven applications

Cloudera is organizing and hosting a conference, Hadoop World: NYC, in a few weeks to support the growing Apache Hadoop community. Facebook, Yahoo, Amazon Web Services and IBM will all be making presentations about how they use the technology to support large volumes of data.
3/9/2010 6:56 AM | 10 des meilleurs casinos
Gravatar

# re: Some random thoughts on using Hadoop for database driven applications

You can also check the HIHO framework for moving data between RDBMS and Hadoop. Its Apache licensed open source. More details can be found at:
http://code.google.com/p/hiho/
5/8/2010 5:03 AM | Sonal

 re: Some random thoughts on using Hadoop for database driven applications

HIHO for Hadoop is piece of junk. Don't bother trying it out. There are better tools out there.
6/28/2010 7:40 PM | Hadoop

Post Comment

Title  
Name  
Email
Url
Comment   
Please add 8 and 2 and type the answer here:
Home
Contact
RSS 2.0 Feed
Login
November, 2009 (2)
October, 2009 (2)
May, 2009 (1)

Powered by: