Hadoop HA – integral, Reliable & Simple

High Availability (HA) in traditional RDBMS, like Oracle, is a complicated business – not for the faint of heart. Also, HA in RDBMS,  comes with high capital and operational cost as it is considered a separate component. In contrast, HA  In Hadoop, seems like a natural progression of its cluster and configuration oriented architecture. HA in Hadoop is a collaborative arrangement among active NameNode, JournalNodes and Standby NameNode. The journal node polices conflicts of interest between active and Standby NameNodes while allowing edits and help ensure their integrity
NameNode
At the heart of Hadoop’s HA is the NameNode. The NameNode, in an active state,  manages the entire filesystem in the cluster and metadata associated therein. fsimage and edits are the file systems within its metadata directory are critically important, for all its client operations. The configuration parameter “dfs.namenode.name.dir”  dictates the location of the namenode metadata directory.
File System
Description
fsimage
contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID.
edits
a log that lists each file system change (file creation, deletion or modification),  made after the most recent fsimage

 
Below is the NameNode metadata directory structure:

The “VERSION” file contains:
Property
Description
layoutVersion

The version of the HDFS format
namespaceID
Unique portion of the namespace
clusterID

Logical identifier of entire cluster
blockpoolID

Unique set of blockpools managed by the namenode
storageType
Possible values are: NAME_NODE and JOURNAL_NODE (used in HA setup)
cTime
Creation of filesystem state

File system
Description

edits_[start transaction id>-<end transaction id]
finalized transaction log. The log contains transactions in the range defined by file name
edits_inprogress_[start transaction id]
This is the incoming edit log transaction in progress
fsimage_[end transaction id]

Complete metadata image. Each fsimage file also contains a corresponding MD5 checksum to help protect from corruption

JournalNode
Hadoop’s HA require JournalNodes NameNode to ensure integrity of edit logs. “dfs.journalnode.edits.dir” dictates the location of the JournalNodes.  The Standby NameNode reads and applies the edits from the JournalNode to its own namespace.
Last-promised-epoch
whenever a NameNode becomes active, it first generates epoch number. The first active NameNode starts with epoch number 1. Any fail over or restarts result in an increment of the epoch number. Epoch number ensures fencing mechanism by which, a new active NameNode ensures the previous NameNode is not able to make changes to the cluster metadata. When a NameNode sends any message (RPC) to a JournalNode, it includes its epoch number against locally stored values called the promised epoch. If the request is coming from a newer epoch, then it records that new epoch as its promised epoch. If instead, the request is coming from an older epoch, then it rejects the request whenever a NameNode becomes active, it first generates epoch number. The first active NameNode starts with epoch number 1. Any fail over or restarts result in an increment of the epoch number. Epoch number ensures fencing mechanism by which, a new active NameNode ensures the previous NameNode is not able to make changes to the cluster metadata. When a NameNode sends any message (RPC) to a JournalNode, it includes its epoch number against locally stored values called the promised epoch. If the request is coming from a newer epoch, then it records that new epoch as its promised epoch. If instead, the request is coming from an older epoch, then it rejects the request

Hardware Requirement
There must be at least be 3 JournalNodes, since edit log modifications must be written to a majority of JournalNodes. Since JournalNode daemons are lightweight, these can be collocated on machines with other nodes/daemons.
Configuration Details
Parameter
Description
dfs.nameservices
The logical name for the name service
dfs.ha.namenodes.<nameservices>
Unique identifier for each NameNode in the nameservice
dfs.ha.namenodes.<nameservices>
Unique identifier for each NameNode in the nameservice
dfs.namenode.rpc-address
The fully qualified RPC address for each NameNode to listen on
dfs.namenode.shared.edits.dir
The URI which identifies the group of JNs where the NameNodes will write/read edits

dfs.client.failover.proxy.provider.[nameservice ID]
The Java class that HDFS client use to contact the Active NameNode
dfs.ha.fencing.methods
A list of scripts or Java classes which will be used to fence the Active NameNode during a failover:
sshfence – SSHes to target node and uses fuser (may also include a non-standard username and port to perform the SSH)to kill the process listening on the services TCP port
shell: Run an arbitrary shell command to fence the Active NameNode
dfs.journalnode.edits.dir
The path where the JournalNode daemon will store its local state
Value

Implementation

  1. After all the necessary configuration options have been set, start the JournalNode daemons on the machines where they will run. This can be done by running the command “Hadoop-daemons.sh start journalnode” and waiting for the daemons to start on each of the relevant machines.
  2. Synchronize NamesNode’s on-disk metadata
    1. Copy over the contents from active namenode metadata directories to the Standby NameNode
    2. Run the command “hdfs namenode -bootstrapStandby” on the unformatted (new) NameNode. This command ensures that the JournalNodes (dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both the NameNodes.
  3. Start HA NameNodes as would start a normal NameNode