Thursday 25 August 2016

Hadoop


3Vs (volume, velocity and variety) are three major dimensions of big data. Volume indicates the amount of data, variety indicates the types of data and velocity stands for the speed of data processing. 
Hadoop subprojects
Pig, Hive, HBase, MapReduce, HDFS
HDFS?
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications
rack

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.


Replica Placement Policy

When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.


Secondary Namenode
The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down. This is called Hadoop Single Point Of Failure (SPOF).



MapReduce
two parts – ‘map’ and ‘reduce’. Maps and reduces are programs for processing data. ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output.

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.
The number of maps is equal to the number of input splits because we want the key and value pairs of all the input splits.

Spilt is created for the file. The file is placed on datanodes in blocks. For each split, a map is needed.

Hadoop configuration files

1. core-site.xml
2. hdfs-site.xml
3. mapred-site.xml

Input Formats
Following 3 are most common InputFormats defined in Hadoop
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
TextInputFormat is the hadoop default.


TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.

KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

RecordReader
. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat

After the Map phase finishes, the hadoop framework does "Partitioning, Shuffle and sort".
Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them.
- Shuffle
After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.
- Sort
Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer
What is a Combiner?
The Combiner is a "mini-reduce" process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
















No comments:

Post a Comment