What exactly is Hadoop ?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
When we talk about Hadoop it has 2 things: a. Storage b. Processing / Computation
The Storage is provided by HDFS i.e Hadoop distributed file system and the Processing is done by Map Reduce.
Why Hadoop?
The world has grown enormously in terms of people, culture, emotions, etc and so do science & technology.
Revolutionizing the same in terms of the computer has grown tremendously, In early 1900 IBM has developed a 5MB of a disk which was so huge in terms of hardware. We are now in the age of microchips containing data that started from a floppy disk, Compact Disk, flash drives. We are now surrounded by much much much more data altogether around.
So the question is where we can store them now ??? If I have a huge amount of data then how we have been storing it till now? Also what made me think beyond the normal standpoint of my data storage.
Let me first answer the question: how am I storing my data till now?
Data can be stored anywhere be it in your flash drive, CD, normal file in your computer but do all these provide me the distributed and centralized way of accessing it from anywhere in the world and do some kind of analysis on it.
Consider a store owner who wants to see the number of items in his inventory/items sold in North Carolina is in Virginia.
So if storekeeper in North Carolina has the data in its local system will it be accessible to a person in Virginia?
Definitely not? We need an approach to access the same data from anywhere in the world.
This is where we have something called Relational Database Management System(RDBMS) that comes into the picture to analyze my data.
Now consider what happens if the store owner expands his store units to 100 to 1000 to 10k to 100k 1000k small stores all over the world. Do my RDBMS is now capable of storing a large amount of data ??
The answer is Yes, but it would incur a lot of cost of purchasing the hardware and its corresponding software components
So, what is the new role now ? “Hadoop” is the answer, my dear friend.
Now lets talk about Hadoop under the hood: Click here to know more about Component of Hadoop
Components of Hadoop
Components of Hadoop:
Before the boom of Hadoop, the storage and processing of big data was a big challenge. But now that Hadoop is available, companies have realized the business impact of Big Data and how understanding this data will drive the growth.
Big Data has many useful and insightful applications.
Hadoop is the straight answer for processing Big Data. Hadoop ecosystem is a combination of technologies that have a proficient advantage in solving business problems.
Let us understand the components in the Hadoop Ecosystem: Core Hadoop:HDFS:
HDFS stands for Hadoop Distributed File System for managing big data sets with High Volume, Velocity, and Variety. HDFS implements the master-slave architecture. Master is Name node and slave is data node.HDFS is well known for Big Data storage.
Map Reduce: Map Reduce is a programming model designed to process high volume distributed data. The platform is built using Java for better exception handling. Map Reduce includes two daemons, Job Tracker and Task Tracker.
YARN: YARN stands for Yet Another Resource Negotiator. It is also called as MapReduce 2(MRv2). The two major functionalities of Job Tracker in MRv1, resource management, and job scheduling/ monitoring are split into separate daemons which are ResourceManager, NodeManager, and ApplicationMaster.
Pig: Apache Pig is a high-level language built on top of MapReduce for analyzing large datasets with simple ad-hoc data analysis programs. Pig is also known as Data Flow language. Pig scripts internally will be converted to map-reduce programs.
Hive: Apache Hive is another high-level query language and data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Mahout: Apache Mahout is a scalable machine learning library designed for building predictive analytics on Big Data.
Apache Sqoop: Apache Sqoop is a tool designed for bulk data transfers between relational databases and Hadoop.
What does it do? Import and export to and from HDFS.Import and export to and from Hive. Import and export to HBase.