What is Big Data? What is Hadoop?
This article is for those people who are learning Hadoop and who are just starting to learn about Hadoop. I’ll try to make it easy for those people who are not related to IT (information technology), so that they can also learn if they want.
What is Big Data?
To clarify this let me describe what is Data. Data can be defined as a collection of raw facts or information from which we get conclusions. For Example .doc, .docx. xls files are common. These files can be found in any PC or Laptop. These files contain some Information which is basically called Data.
Let’s take a dive in history. From the beginning of time till 2003, the amount of data produced by us was 5 billion gigabytes, if you load up the data in the form of disks then it contains lot of data of same amount. This same amount of data is being generated at massive rate because every minute we are generating some billion gigabytes of data. Next concept is that how we are producing this data? Various sources like- our Social Networking sites, e-commerce sites are used to produce this data. Sensors are also used in NASA (NASA generates PetaBytes of data) to produce data. Hence, these amount of data contains some meaningful information which is used by some organisation.
Next point is that how can we extract useful information from huge amount of data. Some traditional techniques like MySql, IBM DB2, and Oracle which are used to extract information. But these traditional techniques are used to analyse Structural data (A Relational Data which contains Rows and columns). But Unstructured data does not have pre-defined data model and it is also not structured in a pre-defined manner). Example: Server log files, PDF files, Word etc.
Other main point of the big data is the “velocity”. In Relational databases information can be extracted easily because the Rig which is being used is decent, plus the amount of data is not voluminous, but how about extracting some data from a 1 Petabyte of file can we extract. Yes we can but the query which will be processed in the background will give us the result after some period of time. Hence that is useless, so velocity also plays a big role in the problem Big Data.
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella in 2006. Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters. Genesis of Hadoop came from a Research paper from google i.e. Map Reduce. MapReduce is a software framework in which an application is broken down into many small parts. Any of these parts, which are also called fragments or blocks, can be run on any node in the cluster. I’ll explain this in depth, Hadoop uses HDFS(Hadoop distributed file system), HDFS is a distributed file system that is designed to run on large clusters i.e.( thousands of computers) of small computer machines in a reliable, fault-tolerant manner(If one computer fails then other can continue to work hence not failing the entire process). HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system Meta data (data about which file block is stored in which node, a Directory like tree structure) and one or more slave Data Nodes that store the actual data. A file in HDFS is splitted into several blocks(Splitting the files in 4 parts default size 64mb in Apache Hadoop) and those blocks are stored inside each data nodes and there replicas are created in other Data nodes. HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. Now map reduce plays a key role in analysing the large amount of data and gives suitable result. The Job tracker which performs the map reduce operations on the data nodes, and it keeps track of the process by the help of the task tracker which sends heartbeat signals time to time.
I have covered all the basic things that are required to know about Big Data. Please suggest me any improvement or questions and it will guide us to improve in upcoming articles.