Deep-drive: Understanding NameNode

May 31, 2015June 17, 2015 Bikash Sen Hadoop 2.0 active namenode, checkpoint, checkpointing, editlog, edits, fsimage, Hadoop High Availability, hdfs federation, hdfs filesystem, HDFS HA, metadata, namenode, namenode restart, namespace, secondary namenode, standby namenode

The NameNode is the most critical piece of the system of an HDFS file system. The NameNode manages the entire HDFS file system metadata (i.e owners of files, file permission, no of blocks, block locations, size etc.) and maintained it in main memory. Clients first contact point is the NameNode for file metadata and then perform actual file I/O directly with the DataNodes. If something goes wrong with the NameNode, then whatever metadata was there in main memory would get lost permanently.

Hadoop HDFS Federation

May 31, 2015May 31, 2015 Bikash Sen Hadoop 2.0 block management, block pool, cluster federation, hadoop federation, hadoop2, hdfs federation, hdfs isolation, namenode scalability, namespace, namespace management, namespace partition

In Hadoop 1.x, there is only one NameNode (i.e allow only one Active NameNode) in a cluster, which maintains a single namespace (single directory structure) for the entire cluster. Regarding that Hadoop cluster is becoming larger and larger one enterprise platform and stores the entire file system metadata is in NameNode memory (RAM), when there are more data nodes with many files, the NameNode memory will reach its limit and it becomes the limiting factor for cluster scaling (limiting number of files store in the cluster). Hadoop 1.x, the namespace can only be vertically (add more RAM) scaled on a single NameNode.

YARN : NextGen Hadoop Data Processing Framework

March 27, 2015June 26, 2015 Bikash Sen Hadoop 2.0 application master, container, hadoop, hadoop 2, map reduce, node manager, resource, resource manager, resource request, scheduler, yarn, yarn vs map reduce

In this BigData world, massive data storage and faster processing is a big challenge. Hadoop is the solution to this challenge. Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters (thousands of machines) of commodity (low cost) hardware. Hadoop has two core components, HDFS and MapReduce. HDFS (Hadoop Distributed File System) store massive data into commodity machines in a distributed manner. MapReduce is a distributed data processing framework to work with this massive data.

Hadoop HDFS High Availability

February 19, 2015June 11, 2015 Bikash Sen Hadoop 2.0 Automatic failover, HA, Hadoop Cluster HA, Hadoop High Availability, HDFS HA, High Availability, Journaling, Network Cluster, Quorum based

See: The Glossary

Prior to Hadoop 2.x (Hadoop 1.x), the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

This reduced the total availability of the HDFS cluster in two major ways:

Hadoop ABCD

Let's Do Big Data…

Hadoop 2.0

Deep-drive: Understanding NameNode

Hadoop HDFS Federation

YARN : NextGen Hadoop Data Processing Framework

Hadoop HDFS High Availability