Posts

Yarn Apache Hadoop

Image
Yarn (MapReduce 2.0 or MRv2) is a resource negotiator, which is upgraded version of MapReduce is a cluster management technology. Yarn is set up by splitting the functionalities of Resource Manager and Job Scheduling/Monitoring into different daemons. Resource Manager : - There is a single RM per cluster which manages the resources across the cluster. Node Manager : - Node Manager runs on all the nodes of the cluster. The main task of node manager is to launch and monitor the containers.              RM has two main components mainly named as Scheduler and Application Master (AM) . The main task of RM Scheduler is to allocate the resources to the jobs/applications submitted to the cluster. The RM scheduler is termed as pure scheduler as it just performs allocating the resource task and does not perform any monitoring or tracking of the jobs/applications running on the cluster. The scheduler does not offer any guarantee in restar...

DistCp2

In this post, we are going to cover the remaining part of Hadoop Distcp command.           Distcp has one disadvantage of not having the option to merge the data. The three ways come with the option of either copying the part that is missing or to overwrite the whole data. Updated version of Distcp command with - append option which can be used by the update, but even it is working pursuing the update data operation. To skip the file size check skip check operation can be used with Hadoop Distcp. There are a few limitations with Hadoop distcp command, these are as below. When copying the data from multiple sources, the Distcp command with fail with an error in case of two sources collides, but we can avoid this scenario at destination level by using certain options. By default, the files at destination level are skipped to copy.  There are a few limitations with Hadoop Distcp command, these are as below.       ...

DistCP

In this post, we are going to learn about Distcp in Hadoop and various aspects of Distcp. What is Distcp                 Distcp(Distributed Copy) is a tool used for copying a large set of data for Inter/Intra-cluster copying. It uses Map Reduce for distribution, error handling, recovery, and reporting. It expands the list of the files into a map task which is used to copy a partition file from source to destination cluster. How DistCp Works             The DistCp command will compare the file size at source and destination. If the destination file size is different from source file size, it will copy the file else if the file size is same, it will skip copying the file. Why prefer discp over cp, get and put commands               DistCp is the fastest way to copy the data as compare to get/put command as it uses map for the processing. Also, we cannot use put/get...

HDFS Part -3

In this post, we are going to learn about some of the basic HDFS commands. We can use most of the unix commands in the Hadoop also, but we need to use prefix hadoop fs  of hdfs dfs before shooting the commands for HDFS. Consider an example, we wish to list all the files on a path, we can use ls/ll/ls -ltr commands in Unix for this purpose, but in Hadoop, we can use only ls command with the prefix hadoop fs. The hdfs command will look like hadoop fs -ls. We need to mention '-' before all hadoop commands. Syntax:- hadoop fs <arg> Below are few of HDFS commands you can try with their significance:      Arg Usage -ls List all the directories available at the path. -put Copy file from Local to HDFS. -mkdir Create Folder. -put –f Copy files from Local to HDFS and overwrite previous file. -get Copy files from HDFS to Local. -rm –r  ...

HDFS Part -2

Image
In this post, we will learn how HDFS works and then we will learn some basic Hadoop commands in following posts. As we learn in the previous post, HDFS has three components name node, data nodes, and secondary name node. All functioning are done by data nodes and they are more in numbers while name node keeps the address of the data nodes and lists of its tasks. Data is stored in HDFS in chunks which are called as blocks. The minimum size of a block is 128MB. Consider we have a file of 1GB and we want to copy that file to HDFS. This file will be divided into blocks each of size 128 MB. Data will be stored in 8 blocks ( 1024 MB/128 = 8 Blocks). So data will be divided into 8 blocks.  Suppose, we have a file of 1100 MB, how data will be stored now.  Data will be stored in 9 blocks( 8 blocks x 128 MB = 1024 MB, remaining 76 MB will be stored in the 9th block).             What will happen to remaini...

HDFS Part -1

From now onward we will start learning about the various ecosystems of the Hadoop. The focus will be on the theory as well as implementations. This blog will provide you all the basic details of the Hadoop with easy learning. HDFS stands for Hadoop Distributed File System is a distributed file system and runs on commodity hardware.It is fault tolerance and design to be deployed on low-cost hardware. HDFS is suitable for applications having large data sets. In other words, we can consider HDFS as storing space for the Hadoop-related data.  HDFS runs on master-slave architecture and it mainly consists of three components:- Name Node  -- Acts as Master and keeping cluster storage track Data Node  -- Take care for working of different ecosystems of Hadoop   Secondary Name Node  -- Acts as a backup node, become main in case active name node goes down.                          P...

Big Data Intro Part-2

In the previous post, we have learned about basics of big data and its 3V's. In this post, we are going to learn about the ecosystems of Hadoop which, which will discuss in deep at later posts. Below are the list of a few Hadoop ecosystems:- HDFS  is the file system of Hadoop ecosystem used to keep all the data. Map Reduce is used to process and generate large sets of data with a parallel distributed algorithm. Hive  gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Impala  is an open source parallel processing engine, perform quick analysis on data, low latency. Pig  is a platform for analyzing large data sets consisting of high level language. Sqoop is a tool designed to transfer bulk amount of data between HDFS and relational databases. Oozie  is a scheduler used to manage Hadoop jobs. Scala  is a scalable language. Scala code is compiled t...