Futuregen Skill | Online Courses - Bootcamp & R&D Platform |Online Courses - Learn Anything | No.1 Online Training in Haldwani, Uttrakhand ,India

Best Java Training Insitute with 100% Learning & Implementation

Apache Hadoop


Apache Kafka

How did Hadoop come into picture?

There is massive amounts of data generated is difficult to store and process using traditional database system. Traditional database management system is used to store and process relational and structured data only. However, in today's world there are lots of unstructured data getting generated like images, audio files, videos; hence traditional system will fail to store and process these kinds of data. Effective solution for this problem is Hadoop.Hadoop is a framework to process Big Data. It is a framework that enables you to store and process large data sets in parallel and distributed fashion.

HDFS stores files across many nodes in a cluster.

Hadoop follows Master-Slave architecture and hence HDFS being its core component also follows the same architecture.NameNode and DataNode are the core components of HDFS

NameNode:

Maintains and Manages DataNodes.
Records Metadata i.e. information about data blocks e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc.
Receives status and block report from all the DataNodes.

DataNode:

Slave daemons. It sends signals to NameNode.
Stores actual It stores in data blocks.
Serves read and write requests from the clients.
Secondary NameNode:

This is NOT a backup NameNode. However, it is a separate service that keeps a copy of both the edit logs (edits) and filesystem image (fsimage) and merging them to keep the file size reasonable.MetaData of NameNode is managed by two files: fsimage and edit logs.


Fsimage: This file contains all the modifications that happens across the Hadoop namespace or HDFS when the NameNode starts. It's stored in the local disk of the

NameNode machine.
Edit logs: This file contains the most recent modification. Its a small file comparatively to the fsimage. Its stored in the RAM of the NameNode machine.

Secondary NameNode performs the task of Checkpointing.

Checkpointing is the process of combining edit logs with fsimage (edit logs + fsimage).Secondary NameNode creates copy of edit logs and fsimage from the

NameNode to create final fsimage as shown in the above figure.

Checkpointing happens periodically. (default 1 hour).

Advantages of HDFS:

Fault ToleranceEach data blocks are replicated thrice ((everything is stored on three machines/DataNodes by default) in the cluster. This helps to protect the data against DataNode (machine) failure.

Space Just add more datanodes and re-balance the size if you need more disk space.

Scalability Unlike traditional database system that can't scale to process large datasets; HDFS is highly scalable because it can store and distribute very large datasets across many nodes that can operate in parallel.

Flexibility It can store any kind of data, whether its structured, semi-structured or unstructured.

Cost-effective HDFS has direct attached storage and shares the cost of the network and computers it runs on with the MapReduce. It's also an open source software.

Have Queries?

Talk to our Career Counselor for more Guidance on picking the right Career for you! .

ENQUIRE NOW
7.png
shape3.png