DistCP

In this post, we are going to learn about Distcp in Hadoop and various aspects of Distcp.
  • What is Distcp
                    Distcp(Distributed Copy) is a tool used for copying a large set of data for Inter/Intra-cluster copying. It uses Map Reduce for distribution, error handling, recovery, and reporting. It expands the list of the files into a map task which is used to copy a partition file from source to destination cluster.
  • How DistCp Works
                The DistCp command will compare the file size at source and destination. If the destination file size is different from source file size, it will copy the file else if the file size is same, it will skip copying the file.
  • Why prefer discp over cp, get and put commands
                  DistCp is the fastest way to copy the data as compare to get/put command as it uses map for the processing. Also, we cannot use put/get command to copy files for Intracluster.
  • Where We Can Run DistCp Command
                  We can run distcp command on both secure(Kerberos enable) and insecure cluster. To run Distcp the on secure cluster we need to run kinit command. kinit will load the kerberize tickets for the user.
  • How To Run DistCp Command
    There are different ways to run DistCp for inter and Intra cluster.
    Intra Cluster :-
    1) Without Any Option
                  hadoop distcp hdfs://<Source_Name_Node>:8020/<File_Path>/<File_Name> hdfs://<Destination_Name_Node>:8020/<File_Path>
          2) Update Option

                          hadoop distcp -update hdfs://<Source_Name_Node>:8020/<File_Path>/<File_Name> hdfs://<Destination_Name_Node>:8020/<File_Path>/<File_name>

          3) Overwrite Option

                           hadoop -overwrite distcp hdfs://<Source_Name_Node>:8020/<File_Path>/<File_Name> hdfs://<Destination_Name_Node>:8020/<File_Path>/<File_Name>

Source_Name_Node :- Address of the cluster name node from where data needs to be copy.

Destination_Name_Node: - Address of the cluster name node to which data needs to be copy.

File_Path: - Path where file resides at source cluster/File need to copy at the destination cluster.
File_Name: - Name of the file which needs to be copy.

         We need to pass the full address of the name node. In the case of HA clusters where Nameservice is activated, if we use Nameservice in place of name node, distcp will throw an error.

              In the method one, we do not need to pass the file name at destination path also, else it will create a sub folder with the file name and copy all the portioned files inside that folder.

  • What is Update and Overwrite Option Do
               Update: - If update option is used in DistCp command, it will compare the file name, file size and contents of the file at source and destination. It will copy the difference to the destination path and will skip remaining portion.
                Overwrite: - If overwrite option is used in the DistCp command, it will compare the file name, file size and contents of the file at source and destination. If the files size are not same, it will overwrite the data at destination else it will skip the copy.


  • ·         How To Run DistCp on Intra Cluster

Intra cluster means the version of the cluster are different. Consider cloudera, if data needs to be moved from one CDH version (e.g. CDH4) to another (e.g. CDH5). We need to use HTFP protocol.
The command must be used in below manner.

      hadoop distcp hftp://cdh4-<namenode>:50070/ hdfs://CDH5-<namenode>/

 hadoop distcp hftp://cdh4-<namenode>:50070/<File_Path> hdfs://CDH5-<namenode>/<File_Path>

The HFTP protocol allows using FTP resources in an HTTP request. When copying the data using distcp across different versions of CDH, use hftp:// for the source file system and hdfs:// for the destination file system, and run distcp from the destination cluster. The default port for HFTP is 50070 and the default port for HDFS is 8020.


HFTP is a read-only protocol, which could be used for source cluster not for destination cluster. HFTP cannot be used for to copy the data from insure to secure cluster.

Comments

Popular posts from this blog

DistCp2

Yarn Apache Hadoop

Big Data Intro

Big Data Intro Part-2

HDFS Part -3

Hadoop Learning

HDFS Part -1

HDFS Part -2