3Vs (volume, velocity and variety) are three major dimensions of big data. Volume indicates the amount of data, variety indicates the types of data and velocity stands for the speed of data processing.
Hadoop subprojects
Pig, Hive, HBase, MapReduce, HDFS
HDFS?
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications
rack
Rack is a storage area with all the datanodes put
together. These datanodes can be physically located at different places. Rack
is a physical collection of datanodes which are stored at a single location.
There can be multiple racks in a single location.
Replica Placement Policy
When the client is ready to load a file into the cluster,
the content of the file will be divided into blocks. Now the client consults
the Namenode and gets 3 datanodes for every block of the file which indicates
where the block should be stored. While placing the datanodes, the key rule
followed is “for every block of data, two copies will exist in one rack, third
copy in a different rack“. This rule is known as “Replica Placement Policy“.
If both rack2 and datanode present in rack 1 fails then
there is no chance of getting data from it. In order to avoid such situations,
we need to replicate that data more number of times instead of replicating only
thrice. This can be done by changing the value in replication factor which is
set to 3 by default.
Secondary Namenode
The secondary Namenode constantly reads the data from the
RAM of the Namenode and writes it into the hard disk or the file system. It is
not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop
system goes down. This is called Hadoop Single Point Of Failure (SPOF).
MapReduce
two parts – ‘map’ and ‘reduce’. Maps and reduces
are programs for processing data. ‘Map’ processes the data first to give some
intermediate output which is further processed by ‘Reduce’ to generate the
final output.
Namenode takes the input and divide it into parts and
assign them to data nodes. These datanodes process the tasks assigned to them
and make a key-value pair and returns the intermediate output to the Reducer.
The reducer collects this key value pairs of all the datanodes and combines
them and generates the final output.
The number of maps is equal to the number of
input splits because we want the key and value pairs of all the input splits.
Spilt is created for the file. The file is placed on
datanodes in blocks. For each split, a map is needed.
Hadoop configuration files
1. core-site.xml
2. hdfs-site.xml
3. mapred-site.xml
Input Formats
Following 3 are most common InputFormats defined in Hadoop
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
TextInputFormat is the hadoop default.
TextInputFormat: It reads lines of text files and provides
the offset of the line as key to the Mapper and actual line as Value to the
mapper.
KeyValueInputFormat: Reads text file and parses lines into
key, val pairs. Everything up to the first tab character is sent as key to the
Mapper and the remainder of the line is sent as value to the mapper.
RecordReader
. The RecordReader class actually loads the data
from its source and converts it into (key, value) pairs suitable for reading by
the Mapper. The RecordReader instance is defined by the InputFormat
After the Map phase finishes, the hadoop framework does
"Partitioning, Shuffle and sort".
Partitioning is the process of determining which reducer
instance will receive which intermediate keys and values. Each mapper must determine
for all of its output (key, value) pairs which reducer will receive them.
- Shuffle
After the first map tasks have completed, the nodes may
still be performing several more map tasks each. But they also begin exchanging
the intermediate outputs from the map tasks to where they are required by the
reducers. This process of moving map outputs to the reducers is known as shuffling.
- Sort
Each reduce task is responsible for reducing the values
associated with several intermediate keys. The set of intermediate keys on a
single node is automatically sorted by Hadoop before they are presented to the
Reducer
What is a Combiner?
The Combiner is a "mini-reduce" process which
operates only on data generated by a mapper. The Combiner will receive as input
all data emitted by the Mapper instances on a given node. The output from the
Combiner is then sent to the Reducers, instead of the output from the Mappers.
No comments:
Post a Comment