Posts

Showing posts from January, 2017

DistCP

In this post, we are going to learn about Distcp in Hadoop and various aspects of Distcp. What is Distcp                 Distcp(Distributed Copy) is a tool used for copying a large set of data for Inter/Intra-cluster copying. It uses Map Reduce for distribution, error handling, recovery, and reporting. It expands the list of the files into a map task which is used to copy a partition file from source to destination cluster. How DistCp Works             The DistCp command will compare the file size at source and destination. If the destination file size is different from source file size, it will copy the file else if the file size is same, it will skip copying the file. Why prefer discp over cp, get and put commands               DistCp is the fastest way to copy the data as compare to get/put command as it uses map for the processing. Also, we cannot use put/get...

HDFS Part -3

In this post, we are going to learn about some of the basic HDFS commands. We can use most of the unix commands in the Hadoop also, but we need to use prefix hadoop fs  of hdfs dfs before shooting the commands for HDFS. Consider an example, we wish to list all the files on a path, we can use ls/ll/ls -ltr commands in Unix for this purpose, but in Hadoop, we can use only ls command with the prefix hadoop fs. The hdfs command will look like hadoop fs -ls. We need to mention '-' before all hadoop commands. Syntax:- hadoop fs <arg> Below are few of HDFS commands you can try with their significance:      Arg Usage -ls List all the directories available at the path. -put Copy file from Local to HDFS. -mkdir Create Folder. -put –f Copy files from Local to HDFS and overwrite previous file. -get Copy files from HDFS to Local. -rm –r  ...