HDFS Part -2

In this post, we will learn how HDFS works and then we will learn some basic Hadoop commands in following posts.

As we learn in the previous post, HDFS has three components name node, data nodes, and secondary name node. All functioning are done by data nodes and they are more in numbers while name node keeps the address of the data nodes and lists of its tasks.

Data is stored in HDFS in chunks which are called as blocks. The minimum size of a block is 128MB. Consider we have a file of 1GB and we want to copy that file to HDFS. This file will be divided into blocks each of size 128 MB. Data will be stored in 8 blocks ( 1024 MB/128 = 8 Blocks). So data will be divided into 8 blocks. 

Suppose, we have a file of 1100 MB, how data will be stored now. 
Data will be stored in 9 blocks( 8 blocks x 128 MB = 1024 MB, remaining 76 MB will be stored in the 9th block).

           
What will happen to remaining 52 MB in 9th blocks?
This 52 MB will be used by other file and block size will be fully used. So, no memory is wasted by Hadoop ecosystems as compare to other.




How data is retrieved in HDFS.
When we want to retrieve the data from the file, name node will check the address of the data nodes which is keeping the file and send the commands to data nodes to retrieve data.

How data is stored by data nodes.
In Hadoop, we use replication factor to store the data. Usually, the replication factor of three is used. This means three copies of data will be created each with different data nodes.

What will happen if data node goes down?
As data is stored with replication, if one data node goes down, data will be retrieved from another active data node.

How name node knows which data node is active.
Data nodes constantly send the heartbeat to the name node. If name node did not receive the healthy signals from the data nodes, it considers either network error is there or data node is dead. If waits for some time if data node does not come up, it considers it as data node failure.

How data is placed in different replicas.
The cluster is divided into the racks and each rack having multiple data nodes. When we create three copies of data, one copy will be placed in one rack and other will be on separate racks. This is used to avoid the data loss in case of rack failure.

Comments

Popular posts from this blog

DistCp2

Yarn Apache Hadoop

Big Data Intro

Big Data Intro Part-2

DistCP

HDFS Part -3

Hadoop Learning

HDFS Part -1