the BPM freak !!

Home » BigData » Big Data & HADOOP Distibuted File System

Big Data & HADOOP Distibuted File System

Follow me on Twitter


Whenever we come across the word “data” or “database”, the first thing that strikes our mind is Oracle, SQL, Queries, SQL Server, DB2 or Datawarehousing….

But, hope everyone will agree to the fact that the RDBMS-Relational Database Management System, is not always the only solution, there are also threshold points and areas where they might fail.

“Problem Statement”:

– Scaling up the Application :

Lets take a scenario – there are thousands of terabyte/petabyte of data, and we have a single system to operate and execute. We cannot say blindly it wont work, it will indeed but the time taken might be around 10-15 days(which may not add value – with the pace the business runs).

– Reliability Issues :

Suppose for fixing the above problem, we got 1000 odd cheap computers, which will reduce the response time (from days to minutes). But the reliability of the cluster and the size is something which might pose as the concern areas.

Solution :

So, what we need to deal with such a situation is to have an Efficient, Reliable and Usable Framework.

And the answer is Hadoop!! Distributed File System

What is HADOOP ?

  • Open Source Apache Project
  • Written in Java
  • Batch & Offline oriented
  • Data & I/O intensive
  • General purpose framework for creating distributed applications that process huge amount of data
  • Runs on
    • Linux, Mac OS/X, Windows, and Solaris Commodity hardware

Signifcance of the Name “hadoop” :

For all the Tech savvy people, it does not have any specific meaning as such. But quite quite weird and funny though, the name “hadoop” has been named after Doug Cutting’s (the creator of Lucene Search – Apache) son’s stuff elephant 🙂

And as an open source, everyone has the freedom to name anything 🙂

HADOOP is NOT :

  • A relational DB
  • Online Transaction Processing System (OLTP)
  • Structured Data Store of any kind.

Who uses Hadoop?

  • Amazon/A9
  • Linked In
  • Facebook
  • Google
  • IBM
  • Joost
  • Last.fm
  • New York Times
  • PowerSet
  • Veoh
  • Yahoo!

HADOOP vs RDBMS

Hadoop

RDBS

Scale Out

Scale Up

Key/Value Pair

Tables

Say how to process the data

Say what you want – SQL

Offline / Batch

Online / Real-time

Major Hadoop Core components include:

  • Distributed File System(DFS) – distributes data
    • The HDFS stores data on nodes in the cluster with the goal of providing greater bandwidth across the cluster.
  • Map/Reduce – distributes application
    • It is actually a computational paradigm , which takes an application and divides it into multiple fragments of work, each of which can be executed on any node in the cluster

In simple words “HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information”

Hitory of HDFS

  • Inspired and based on the Google File System – GFS
  • Motivated by the Redundant storage of massive amounts of data on cheap and unreliable computers

Here is a quick snapshot of the Map/Reduce Implementation :

Practical Implementation of Hadoop– in our Day to Day life.

With all these theories and jargons….it all looks wow!!.

But the very simple question that arises in everyone’s mind is , where exactly can I see/use it, why only a handful list of clients, Can I implement hadoop in my enterprise!!

Here you go!!

Everyday we logon to Facebook, Google, Youtube, LinkedIn, Tweeter  and many other social networking or community/company sites.

Millions of users login at the same time and keep on posting or browsing the net, but this is what happens at the client end.

Now if we think from the company’s perspective there are GBs and TBs of data getting pumped as logs, info  or trackers based on the User’s navigation/posts.

The company can very well dump/get rid of all these data by running a batch job. But, if we see from a company’s view point these data info or logs can actually fetch some valuable info by coming up with a pattern and concluding the psychology of the customer and using it for Business Benefits.

For e.g : whenever we search for a video in Youtube or a book in Amazon.com, with our 2nd HIT, it provides us with options like “Recommended for you”,  “People who liked this also liked…”, “Clubbing similar posts in networking sites”

How does all these happen!!

Is this Magic ?? – of course not!!

And very important point not to be missed, “The Data in these logs/ archived info” does not have any structure/pattern. (e.g : some might have searched as awesome / awsum /  awesome… / awsm!!!…)

So, we cannot expect a data to be queried based on these un-structured data which does not sense anything … using  direct SQL query.

But, here is where HADOOP rocks!!...It just keeps on dividing the data and splits across the different clustered servers to come up with some pattern / finding. So to be precise, with SQL we just give a command what we want and it hits the RDBMS and fetches us the result.

In case of HADOOP, we design our own execution process and splitting algorithm, we can either go with the round-robin or any customized logic for implementation.

Business Benefits!!

Many Business Benefit that can be drawn with its implementation, if we can find a pattern from the logs/dumped data.

Lets say Mr XYZ always logson to the site every day, and looks for some Fiction Books in the Online Store, and is a very choosy about discounts. So, in order to retain the Customer, the company can target to offer some goodies/offers based on the analysis and the study.

Not specifically online store, it can be extended across different sectors of companies for identifying prospective customers.

This also clarifies why we have such a small client list of companies which actually deal with  huge and huge amount of Data on a daily basis!!.

Highlight of one of the Implementation @ NewYork Times :

  • Needed offline conversion of public domain articles from 1851-1922.
  • Used Hadoop to convert scanned images to PDF
  • Ran 100 Amazon EC2 instances for around 24 hours
  • 4 TB of input
  • 1.5 TB of output

Reference Links :

Happy Learning!! 🙂

Advertisements

1 Comment

  1. Pallab Rath says:

    Nice … When data grows out of control, we need to explore the possibilities with Hadoop. And its certainly technique work in distributed environment.
    http://java-pallab.blogspot.com/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: