Posts

Showing posts from 2017

Yarn Apache Hadoop

Image
Yarn (MapReduce 2.0 or MRv2) is a resource negotiator, which is upgraded version of MapReduce is a cluster management technology. Yarn is set up by splitting the functionalities of Resource Manager and Job Scheduling/Monitoring into different daemons. Resource Manager : - There is a single RM per cluster which manages the resources across the cluster. Node Manager : - Node Manager runs on all the nodes of the cluster. The main task of node manager is to launch and monitor the containers.              RM has two main components mainly named as Scheduler and Application Master (AM) . The main task of RM Scheduler is to allocate the resources to the jobs/applications submitted to the cluster. The RM scheduler is termed as pure scheduler as it just performs allocating the resource task and does not perform any monitoring or tracking of the jobs/applications running on the cluster. The scheduler does not offer any guarantee in restar...

DistCp2

In this post, we are going to cover the remaining part of Hadoop Distcp command.           Distcp has one disadvantage of not having the option to merge the data. The three ways come with the option of either copying the part that is missing or to overwrite the whole data. Updated version of Distcp command with - append option which can be used by the update, but even it is working pursuing the update data operation. To skip the file size check skip check operation can be used with Hadoop Distcp. There are a few limitations with Hadoop distcp command, these are as below. When copying the data from multiple sources, the Distcp command with fail with an error in case of two sources collides, but we can avoid this scenario at destination level by using certain options. By default, the files at destination level are skipped to copy.  There are a few limitations with Hadoop Distcp command, these are as below.       ...

DistCP

In this post, we are going to learn about Distcp in Hadoop and various aspects of Distcp. What is Distcp                 Distcp(Distributed Copy) is a tool used for copying a large set of data for Inter/Intra-cluster copying. It uses Map Reduce for distribution, error handling, recovery, and reporting. It expands the list of the files into a map task which is used to copy a partition file from source to destination cluster. How DistCp Works             The DistCp command will compare the file size at source and destination. If the destination file size is different from source file size, it will copy the file else if the file size is same, it will skip copying the file. Why prefer discp over cp, get and put commands               DistCp is the fastest way to copy the data as compare to get/put command as it uses map for the processing. Also, we cannot use put/get...

HDFS Part -3

In this post, we are going to learn about some of the basic HDFS commands. We can use most of the unix commands in the Hadoop also, but we need to use prefix hadoop fs  of hdfs dfs before shooting the commands for HDFS. Consider an example, we wish to list all the files on a path, we can use ls/ll/ls -ltr commands in Unix for this purpose, but in Hadoop, we can use only ls command with the prefix hadoop fs. The hdfs command will look like hadoop fs -ls. We need to mention '-' before all hadoop commands. Syntax:- hadoop fs <arg> Below are few of HDFS commands you can try with their significance:      Arg Usage -ls List all the directories available at the path. -put Copy file from Local to HDFS. -mkdir Create Folder. -put –f Copy files from Local to HDFS and overwrite previous file. -get Copy files from HDFS to Local. -rm –r  ...