DistCp2

February 05, 2017

In this post, we are going to cover the remaining part of Hadoop Distcp command.

Distcp has one disadvantage of not having the option to merge the data. The three ways come with the option of either copying the part that is missing or to overwrite the whole data.

Updated version of Distcp command with -append option which can be used by the update, but even it is working pursuing the update data operation.

To skip the file size check skip check operation can be used with Hadoop Distcp.

There are a few limitations with Hadoop distcp command, these are as below.

When copying the data from multiple sources, the Distcp command with fail with an error in case of two sources collides, but we can avoid this scenario at destination level by using certain options. By default, the files at destination level are skipped to copy.

There are a few limitations with Hadoop Distcp command, these are as below.

Side Effects of DistCp are as follows, in case a map fails:-

Unless -i is specified, the logs generated by that task attempt will be replaced by the previous attempt.

Unless -overwrite is specified, files successfully copied by a previous map on a re-execution will be marked as “skipped”.

If a map fails mapred.map.max.attempts times, the remaining map tasks will be killed (unless -i is set).

If mapred.speculative.execution is set final and true, the result of the copy is undefined.

The different options that can be used for Distcp commands are as follows:-

Options	Description	Comments
-p[rbugp]	Preserve r: replication number b: block size u: user g: group p: permission	Modification time will not be Preserved. When using update options, synchronize will not work until file size will differ.
-i	Ignore Failure	This option is for keeping the logs in case of failure, which would help in debugging in case of failure. A failure map will not cause the failure of the job until all the splits get completed.
log <logpath>	Write logs to log path	It keeps the log of each it attempt to copy as a map. If the map fails, job output will not be retained in case of re-execution of the job.
-m<num_maps>	Maximum number of simultaneous copies	Specify the number of maps to copy data. Is it not necessary that more maps improve throughput.
- overwrite	Overwrite Destination	If -i is not specified and a map fails, all the files in the split, not only those that failed, will be recopied. It also changes the semantics for generating destination paths, so users should use this carefully. Pass the file name in case of overwriting, else the split files will be created outside the file you desire.
-update	Overwrite id source and destination size is different	If the file size of the source is greater than the destination, it will copy the files which are missing at the destination location. Pass the file name in case of overwriting, else the split files will be created outside the file you desire
-f <urilist>	Use list at <urilist> as source list	This is equivalent to listing each source on the command line.
-filelimit <n>	Number of files should be <= n
-sizelimit <n>	Total size of the file should be <= n bytes
-delete	Delete the file at the destination.	This will delete the file at destination level and will the source file. We can use trash to recover the file.

Covering up the summary of Hadoop Distcp command, it is a powerful tool the data present at one Hadoop hdfs location to another or within the same location. Distcp can be used to copy data from one version of Cloudera CDH (e.g. CDH-4 to CDH-5 etc.). Three types distcp.can be performed: direct, update or overwrite. The direct method is used to.copy the data if data is not available at the destination location. Update and overwrite can be used if data is already present at the destination location. The update will check the file size at source and destination, if the file size of the source is greater than the destination, it will copy the files which are missing at the destination location. Overwrite will check the file size at source and destination, if the file size of the source is greater than the destination, it will overwrite the file of destination with the source.

Distcp command must be run from destination cluster.

From next post, we are going to start on the processing tool of Hadoop ecosystem.

Kindly share your feedback and comments to make this blog more attractive and easy to learn Hadoop for the users.

Search This Blog

Abhinav's Blog

DistCp2

Comments

Post a Comment

Popular posts from this blog

Yarn Apache Hadoop

Big Data Intro

Big Data Intro Part-2

DistCP

HDFS Part -3

Hadoop Learning

HDFS Part -1

HDFS Part -2