Basic Understanding of HBase

In this post, I am keeping some bullet points which consists of frequently used terms (in blue color) in HBase. So, it will give you an overall idea of HBase.

  • HBase is a NoSQL column-oriented database. Columns in HBase table are grouped into column families. HBase table contains one or more column families. Each row in a table identified by row key. Each column is identified by a qualifying name i.e Column Family Name + Column Name.
  • Row keys are unique in a Table and are always treated as a byte[].
  • There’s no design-time way to specify row keys.
  • HBase is an open-source implementation of Google’s Big Table architecture.
  • HBase is schema-less.
  • HBase tables are broken up into horizontal partitions called regions. Regions are subset of the table’s data as collection of rows and distributed across HDFS DataNodes.

Read More…

Hive Architecture

In this post, I tried to show most of the Hive components and their dependencies from old Hive version to new Hive version. I made a single architecture diagram which may help you to visualize complete Hive overall architecture including common client interfaces. I tried to keep post contents very little other than a big diagram. So, it will help you to visualize instead of regular reading and forget (in my case 🙂 ). It included HiverServer1 and HiveServer2 as well. HiveServer2 is a rewrite of HiveServer1 (sometimes called HiveServer or Thrift Server) that addresses Multi-client concurrency and authentication problems which I will discuss later in this post, starting with Hive 0.11.0. Use of HiveServer2 is recommended.

Read More…

Data Warehouse: Classic Use Cases for Hadoop in DW

Enterprise Data Warehousing (EDW) has been a mainstay of many major corporations for the last 20 years. However, with the tremendous growth of data (doubling every two years), the enterprise data warehouses are exceeding their capacity too quickly. Load processing windows are similarly being maxed out, adversely affecting service and threatening the delivery of critical business insights. So it becomes very expensive for organisations to process and maintain large datasets.

Read More…

How MapReduce Works

  1. Write a MapReduce Java program and bundle it in a JAR file. You can have a look in my previous post how to create a MapReduce program in Java using Eclipse  and bundle a JAR file “First Example Project using Eclipse“.
  2. Client submit the job to the JobTracker by running the JAR file ($ hadoop jar ….). Actually the driver program (WordCountDriver.java) act as a client which will submit the job by calling “JobClient.runJob(conf);“. The program can run on any node (as a separate JVM) in the Hadoop cluster or outside cluster. In our example, we are running the client program on the same machine where JobTracker is running usually NameNode. The job submission steps includes:

    Read More…

CAP Theorem: How a distributed system can provide C + A without P?

The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency – all nodes always give the same result.
  • Availability – a guarantee that nodes always answer queries and accept updated.
  • Partition tolerance – system continues working even if one or more nodes become silent or not responsive.

Read More…

Hadoop: MapReduce Vs Spark

Sometimes I came across a question “Is Apache Spark going to replace Hadoop MapReduce?“. It depends based on your use cases. Here I tried to explained features of Apache Spark and Hadoop MapReduce as data processing. I hope this blog post will help to answer some of your questions which might have coming to your mind these days.

Read More…

Hadoop HDFS Federation

In Hadoop 1.x, there is only one NameNode (i.e allow only one Active NameNode) in a cluster, which maintains a single namespace (single directory structure) for the entire cluster. Regarding that Hadoop cluster is becoming larger and larger one enterprise platform and stores the entire file system metadata is in NameNode memory (RAM), when there are more data nodes with many files, the NameNode memory will reach its limit and it becomes the limiting factor for cluster scaling (limiting number of files store in the cluster). Hadoop 1.x, the namespace can only be vertically (add more RAM) scaled on a single NameNode.

Read More…

Data Warehouse: Teradata Vs Hadoop

Teradata is a fully horizontal scalable relational database management system (RDBMS). In other words, Massively Parallel Processing (MPP) database systems based on a cluster of commodity hardware (computers) called “shared-nothing” nodes (each node has separate CPU, memory, and disks to process data locally) connected through a high-speed interconnect. Horizontal partitioning of relational tables, along with the parallel execution of SQL queries.

Read More…

ZooKeeper: Distributed Coordination Service

Apache ZooKeeper (latest version 3.5.0 ) is an open-source distributed coordination system for maintaining centralize configuration information, naming service, providing distributed synchronization that was originally developed at Yahoo and written in Java. Back in 2006, Google published a paper on “Chubby“, ZooKeeper, not surprisingly, is a close clone of Chubby.

Read More…

Storm: Real-Time Data Processing

Apache Storm (latest version 0.9.3 ) is an open-source distributed real-time computation system for extremely fast processing large volumes of data that was originally developed at Backtype (later acquired by  Twitter) and written in Clojure. Today’s slogan – Data expire fast. Very fast. Before expired, we need to process them.

Read More…

Kafka: Building a Real-Time Data Pipeline

Apache Kafka (latest version 0.8.2.1) is an open-source distributed publish-subscribe messaging system for data integration that was originally developed at LinkedIn and written in Scala. The project aims to provide collecting and delivering huge volume of log data with low latency for handling real-time data feeds through data pipeline (data motion from one point to another). The design is heavily influenced by log processing.

Read More…

YARN : NextGen Hadoop Data Processing Framework

In this BigData world, massive data storage and faster processing is a big challenge.  Hadoop is the solution to this challenge. Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters (thousands of machines) of commodity (low cost) hardware. Hadoop has two core components, HDFS and MapReduce. HDFS (Hadoop Distributed File System) store massive data into commodity machines in a distributed manner. MapReduce is a distributed data processing framework to work with this massive data.

Read More…

HDFS File Blocks Distribution in DataNodes

Background

When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (in my case – left as the default which is 64MB). Each block is stored on one or more nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.

Read More…

HDFS File Block and Input Split

Blocks are physical division and input splits are logical division. One input split can be map to multiple physical blocks. When Hadoop submits jobs, it splits the input data logically and process by each Mapper task. The number of Mappers are equal to the number of splits. One important thing to remember is that InputSplit doesn’t contain actual data but a reference (storage locations) to the data. A split basically has 2 things :

Read More…

Hadoop HDFS High Availability

See: The Glossary

Prior to Hadoop 2.x (Hadoop 1.x), the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

This reduced the total availability of the HDFS cluster in two major ways:

Read More…