Posts

Showing posts from 2016

HDFS Part -2

Image
In this post, we will learn how HDFS works and then we will learn some basic Hadoop commands in following posts. As we learn in the previous post, HDFS has three components name node, data nodes, and secondary name node. All functioning are done by data nodes and they are more in numbers while name node keeps the address of the data nodes and lists of its tasks. Data is stored in HDFS in chunks which are called as blocks. The minimum size of a block is 128MB. Consider we have a file of 1GB and we want to copy that file to HDFS. This file will be divided into blocks each of size 128 MB. Data will be stored in 8 blocks ( 1024 MB/128 = 8 Blocks). So data will be divided into 8 blocks.  Suppose, we have a file of 1100 MB, how data will be stored now.  Data will be stored in 9 blocks( 8 blocks x 128 MB = 1024 MB, remaining 76 MB will be stored in the 9th block).             What will happen to remaini...

HDFS Part -1

From now onward we will start learning about the various ecosystems of the Hadoop. The focus will be on the theory as well as implementations. This blog will provide you all the basic details of the Hadoop with easy learning. HDFS stands for Hadoop Distributed File System is a distributed file system and runs on commodity hardware.It is fault tolerance and design to be deployed on low-cost hardware. HDFS is suitable for applications having large data sets. In other words, we can consider HDFS as storing space for the Hadoop-related data.  HDFS runs on master-slave architecture and it mainly consists of three components:- Name Node  -- Acts as Master and keeping cluster storage track Data Node  -- Take care for working of different ecosystems of Hadoop   Secondary Name Node  -- Acts as a backup node, become main in case active name node goes down.                          P...

Big Data Intro Part-2

In the previous post, we have learned about basics of big data and its 3V's. In this post, we are going to learn about the ecosystems of Hadoop which, which will discuss in deep at later posts. Below are the list of a few Hadoop ecosystems:- HDFS  is the file system of Hadoop ecosystem used to keep all the data. Map Reduce is used to process and generate large sets of data with a parallel distributed algorithm. Hive  gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Impala  is an open source parallel processing engine, perform quick analysis on data, low latency. Pig  is a platform for analyzing large data sets consisting of high level language. Sqoop is a tool designed to transfer bulk amount of data between HDFS and relational databases. Oozie  is a scheduler used to manage Hadoop jobs. Scala  is a scalable language. Scala code is compiled t...

Big Data Intro

Big Data is a term used for processing the large or complex set of data. Data is growing rapidly at a very fast rate. To keep track of the large set of data was becoming a tedious task. In traditional approach, it was not feasible to process the nonstructural data. To overcome the so much huge amount of data and to process structural and nonstructural data big data came into the picture. Three V's that has been included in big data architecture are:-   Volume    : - Quantity of generated and stored data.       Variety     : - Type and Nature of data.     Velocity     : -Speed at which data is generated and processing to meet the demands.

Hadoop Learning

This blog is for those who want to learn the concept of big data Hadoop. Big Data is one of the emerging technologies and very interesting to learn. There are too much scope in the field of Big Data and it will grow as the time progress.  Its more challenging as well as interesting field . I will take you through from the very basic to the advance concepts of Hadoop and its eco systems which will help you to grow in the field of Big Data.  Lets walk through Hadoop. Hadoop is an open source software for distributing storage and processing of very large data sets of clusters built from commodity hardware. Commodity hardware are the low cost computer machines that are easily available to use. Hadoop basically works on the processing of the data by spiting it into multiple sets of small data sets. There are different ecosystems of Hadoop such as HDFS, Map Reduce(MR), Hive, Impala, Oozie, Kafka, Flume, Spark, Pig etc. We will lear...