There is no counting to the amount of data organisations get. Even more now in this age where we have the hardware capable of storing petabytes to exabytes of data (yep, Google looking right at ya). For many businesses, their critical data used to be limited to their transaction databases and data warehouses. In these kinds of systems, data was organised into orderly rows and columns, where every byte of information was well understood in terms of its nature and its business value.
Also, the variety of the data collected is incredible, popular searches, transactions, the trends of products ordered at Amazon, ratings of your movies, open data initiatives from public and private entities have made massive troves of raw data available for analysis. The earlier tools were a poor match to deal with the complexity of this data and that’s where Hadoop comes in.
Why Big Data Complex ?
Now, the understanding the nature of Big Data is complex and essentially the reason to use Hadoop is because of the “3 V’s of the big data” :
Velocity : Data that enters with a limited window of time, a limited time span that makes it difficult to transform and load the data to the data pool for analysis. Especially the data in the financial sector. The higher is the volume of the data entering the organisation the bigger the velocity challenges.
Volume : The volumes of data Google receives.
Variety : Disorganised data with different and multiple structures, ranging from raw text to log files or in rows or columns or just could be a list of numbers separated by commas (with exceptional cases) . Really its random and chaotic.
Each of these criteria clearly poses its own, distinct challenge to someone wanting to analyse the information.
“So, this Hadoop thing can handle the heat ?”
Yes, it can. At its core, Hadoop is a framework for storing data on large clusters of commodity computer hardware that is affordable and easily available and running applications against that data. A cluster is a group of computers interconnected to each other. These working computers are called as nodes. These computers are capable of working on the same problems individually. They share the computational workloads and take advantage of the large aggregate bandwidth across the cluster. A cluster consists of a master node, which looks over the storage and processing of the incoming data, and under these master nodes work *ahem* slaves, (Yep that’s what its called don’t look at me) which store all the cluster’s data and here the data also gets processed.
Kinda like the group project you work in but its not just you working. Using networks of affordable compute resources to acquire business insight is the key value proposition of Hadoop.
So, there are two components in Hadoop, MapReduce & HDFS. MapReduce is a distributed processing framework and HDFS is the Hadoop Distributed File System.
At its core, if there is an application to be performed, it gets divided among these interconnected machines (or nodes) in a cluster, and the file distributions system (yes HDFS) will store the data that is processed. A cluster with more machines can processes through a data with much more speed as the amount of parallel computing increases. A Hadoop cluster system with thousands of machines, where HDFS stores data, and MapReduce jobs do their processing near the data, which keeps input-output costs low. MapReduce being flexible in nature enables the development of a wide Hadoop: The Definitive Guidevariety of applications.
“Fine! I’ll tell you about MapReduce.”
What MapReduce essentially does is Distributed Processing, this involves the processing of a number of operations on a number of distributed data sets. The data is converted to a <key,value> pair , and the computations have two phases: mapping phase and reducing phase.
During the Map Phase, input data is separated into large fragments, each assigned to a map task.
These map tasks are distributed to across the cluster.
Each map task processes the <key,value> pairs from its assigned fragment and produces a set of intermediate <key,value> pairs.
The intermediate data set is sorted by the key, and the sorted data is partitioned into a number of fragment that matches the number of reduce tasks.
During the Reduce Phase, each reduce task processes the data fragment that was assigned to it and produces an output <key,value> pair.
These reduce tasks are also distributed across the cluster and write their output to HDFS when finished.