Basic Understanding of HBase

November 13, 2016November 15, 2016 Bikash Sen HBase block cache, cell, column family, HBase, hfile, hlog, hotspotting, memory store, memstore, meta table, opentsdb, region server, row key, salting, version, wal

In this post, I am keeping some bullet points which consists of frequently used terms (in blue color) in HBase. So, it will give you an overall idea of HBase.

HBase is a NoSQL column-oriented database. Columns in HBase table are grouped into column families. HBase table contains one or more column families. Each row in a table identified by row key. Each column is identified by a qualifying name i.e Column Family Name + Column Name.
Row keys are unique in a Table and are always treated as a byte[].
There’s no design-time way to specify row keys.
HBase is an open-source implementation of Google’s Big Table architecture.
HBase is schema-less.
HBase tables are broken up into horizontal partitions called regions. Regions are subset of the table’s data as collection of rows and distributed across HDFS DataNodes.

Hive Architecture

October 26, 2016October 26, 2016 Bikash Sen Hive Beeline, Beeswax, hadoop, Hive, Hive Architecture, Hive CLI, Hive MetaStore, Hive REST, HiveServer1, HiveServer1 Concurrency Problem, HiveServer1 vs HiveServer2, HiveServer2, Hue, Thrift, Thrift Server, WebHCat

In this post, I tried to show most of the Hive components and their dependencies from old Hive version to new Hive version. I made a single architecture diagram which may help you to visualize complete Hive overall architecture including common client interfaces. I tried to keep post contents very little other than a big diagram. So, it will help you to visualize instead of regular reading and forget (in my case 🙂 ). It included HiverServer1 and HiveServer2 as well. HiveServer2 is a rewrite of HiveServer1 (sometimes called HiveServer or Thrift Server) that addresses Multi-client concurrency and authentication problems which I will discuss later in this post, starting with Hive 0.11.0. Use of HiveServer2 is recommended.

Data Warehouse: Classic Use Cases for Hadoop in DW

July 11, 2015July 11, 2015 Bikash Sen Community big data, cold data, data warehouse, dw to hadoop, edw, etl, etl offload, etl to hadoop, hadoop, hadoop dw use case

Enterprise Data Warehousing (EDW) has been a mainstay of many major corporations for the last 20 years. However, with the tremendous growth of data (doubling every two years), the enterprise data warehouses are exceeding their capacity too quickly. Load processing windows are similarly being maxed out, adversely affecting service and threatening the delivery of critical business insights. So it becomes very expensive for organisations to process and maintain large datasets.

How MapReduce Works

June 29, 2015June 30, 2015 Bikash Sen Hadoop combiner, copy phase, hadoop, input split, job tracker, map, mapper, mapreduce, merge, partitioning, reduce, reducer, shuffle, sort, spilling, task tracker

Write a MapReduce Java program and bundle it in a JAR file. You can have a look in my previous post how to create a MapReduce program in Java using Eclipse and bundle a JAR file “First Example Project using Eclipse“.
Client submit the job to the JobTracker by running the JAR file ($ hadoop jar ….). Actually the driver program (WordCountDriver.java) act as a client which will submit the job by calling “JobClient.runJob(conf);“. The program can run on any node (as a separate JVM) in the Hadoop cluster or outside cluster. In our example, we are running the client program on the same machine where JobTracker is running usually NameNode. The job submission steps includes:
Read More…

CAP Theorem: How a distributed system can provide C + A without P?

June 28, 2015June 28, 2015 Bikash Sen Community availability, Brewer’s theorem, cap theorem, consistency, distributed system, network partitions, partition tolerance

The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency – all nodes always give the same result.
Availability – a guarantee that nodes always answer queries and accept updated.
Partition tolerance – system continues working even if one or more nodes become silent or not responsive.

Hadoop: MapReduce Vs Spark

June 27, 2015June 27, 2015 Bikash Sen Community apache spark, hadoop, hadoop mapreduce, iterative query, mapreduce, mapreduce vs spark, memory based processing, rdd, spark, when to use spark

Sometimes I came across a question “Is Apache Spark going to replace Hadoop MapReduce?“. It depends based on your use cases. Here I tried to explained features of Apache Spark and Hadoop MapReduce as data processing. I hope this blog post will help to answer some of your questions which might have coming to your mind these days.

Deep-drive: Understanding NameNode

May 31, 2015June 17, 2015 Bikash Sen Hadoop 2.0 active namenode, checkpoint, checkpointing, editlog, edits, fsimage, Hadoop High Availability, hdfs federation, hdfs filesystem, HDFS HA, metadata, namenode, namenode restart, namespace, secondary namenode, standby namenode

The NameNode is the most critical piece of the system of an HDFS file system. The NameNode manages the entire HDFS file system metadata (i.e owners of files, file permission, no of blocks, block locations, size etc.) and maintained it in main memory. Clients first contact point is the NameNode for file metadata and then perform actual file I/O directly with the DataNodes. If something goes wrong with the NameNode, then whatever metadata was there in main memory would get lost permanently.

Hadoop HDFS Federation

May 31, 2015May 31, 2015 Bikash Sen Hadoop 2.0 block management, block pool, cluster federation, hadoop federation, hadoop2, hdfs federation, hdfs isolation, namenode scalability, namespace, namespace management, namespace partition

In Hadoop 1.x, there is only one NameNode (i.e allow only one Active NameNode) in a cluster, which maintains a single namespace (single directory structure) for the entire cluster. Regarding that Hadoop cluster is becoming larger and larger one enterprise platform and stores the entire file system metadata is in NameNode memory (RAM), when there are more data nodes with many files, the NameNode memory will reach its limit and it becomes the limiting factor for cluster scaling (limiting number of files store in the cluster). Hadoop 1.x, the namespace can only be vertically (add more RAM) scaled on a single NameNode.

Data Warehouse: Teradata Vs Hadoop

May 26, 2015 Bikash Sen Community data warehouse, etl, hadoop, hadoop data warehouse, hive sql, massively parallel processing, mpp database, parallel database, teradata, teradata vs hadoop

Teradata is a fully horizontal scalable relational database management system (RDBMS). In other words, Massively Parallel Processing (MPP) database systems based on a cluster of commodity hardware (computers) called “shared-nothing” nodes (each node has separate CPU, memory, and disks to process data locally) connected through a high-speed interconnect. Horizontal partitioning of relational tables, along with the parallel execution of SQL queries.

Automation Script: Install JDK on Multiple Linux Machines

May 16, 2015May 16, 2015 Bikash Sen Linux install jdk, linux automation script, passwordless ssh, ppa:webupd8team, ssh login, ubuntu

If you need to install a few programs on around 40K linux machines, and would like to know what options you have to do it as efficiently as possible. Here I tried to show you installation of JDK (6/7/8) on multiple linux machines through automation script. You can install any other applications as well through these simple steps.

Linux: Password-Less SSH Logins

May 16, 2015 Bikash Sen Linux automatic ssh login, passphrase, password less login, public and private key, ssh login

Before go directly into password less SSH login, I would like to discuss about SSH login details.

SSH (Secure SHELL) is an open source and most trusted network protocol that is used to login into remote servers for execution of commands and programs.

“Hadoop: The Definitive Guide” 3rd Edition

May 12, 2015May 12, 2015 Bikash Sen Books free hadoop book, hadoop, hadoop book, hadoop ebook, hadoop the definitive guide

Apache Hadoop ecosystem, time to celebrate! The much-anticipated, significantly updated 3rd edition of Tom White’s classic book, Hadoop: The Definitive Guide, is freely available here for all my readers.

How-to: Install Apache Ambari 2.0.0 on CentOS6

May 9, 2015May 9, 2015 Bikash Sen Ambari ambari, ambari aws, ambari centos, hadoop cluster, hadoop deployment, hadoop monitoring

Apache Ambari is an open source tool for deployment, monitoring and management of large Hadoop and HBase clusters. It provide easy to use wizard based web interface. Here I have provided step by step instructions to install Ambari 2.0.0 on CentOS6.

ZooKeeper: Distributed Coordination Service

May 2, 2015May 2, 2015 Bikash Sen ZooKeeper apache zookeeper, chubby, coordination service, ensemble, ephemeral, follower, leader, leader election, permanent znode, Quorum based, voting, watcher, znode, zookeeper cluster, zookeeper locking

Apache ZooKeeper (latest version 3.5.0 ) is an open-source distributed coordination system for maintaining centralize configuration information, naming service, providing distributed synchronization that was originally developed at Yahoo and written in Java. Back in 2006, Google published a paper on “Chubby“, ZooKeeper, not surprisingly, is a close clone of Chubby.

Storm: Real-Time Data Processing

April 25, 2015May 1, 2015 Bikash Sen Storm apache storm, bolt, executor, executor thread, nimbus, spout, stream, supervisor node, task, trident, tuple, worker process

Apache Storm (latest version 0.9.3 ) is an open-source distributed real-time computation system for extremely fast processing large volumes of data that was originally developed at Backtype (later acquired by Twitter) and written in Clojure. Today’s slogan – Data expire fast. Very fast. Before expired, we need to process them.

Kafka: Building a Real-Time Data Pipeline

April 11, 2015May 24, 2015 Bikash Sen Kafka apache kafka, consumer group, data pipeline, distributed messaging, kafka, kafka cluster, kafka consumer, kafka producer, message offset, messaging, stream processing, topic partition

Apache Kafka (latest version 0.8.2.1) is an open-source distributed publish-subscribe messaging system for data integration that was originally developed at LinkedIn and written in Scala. The project aims to provide collecting and delivering huge volume of log data with low latency for handling real-time data feeds through data pipeline (data motion from one point to another). The design is heavily influenced by log processing.

YARN : NextGen Hadoop Data Processing Framework

March 27, 2015June 26, 2015 Bikash Sen Hadoop 2.0 application master, container, hadoop, hadoop 2, map reduce, node manager, resource, resource manager, resource request, scheduler, yarn, yarn vs map reduce

In this BigData world, massive data storage and faster processing is a big challenge. Hadoop is the solution to this challenge. Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters (thousands of machines) of commodity (low cost) hardware. Hadoop has two core components, HDFS and MapReduce. HDFS (Hadoop Distributed File System) store massive data into commodity machines in a distributed manner. MapReduce is a distributed data processing framework to work with this massive data.

HDFS File Blocks Distribution in DataNodes

March 17, 2015 Bikash Sen HDFS block distribution, hadoop, hdfs block, hdfs block write, hdfs rack awareness, replica placement, replication pipeline

Background

When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (in my case – left as the default which is 64MB). Each block is stored on one or more nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.

HDFS File Block and Input Split

March 10, 2015November 3, 2016 Bikash Sen HDFS block size, file block, hadoop file block, hadoop input split, hdfs, input split

Blocks are physical division and input splits are logical division. One input split can be map to multiple physical blocks. When Hadoop submits jobs, it splits the input data logically and process by each Mapper task. The number of Mappers are equal to the number of splits. One important thing to remember is that InputSplit doesn’t contain actual data but a reference (storage locations) to the data. A split basically has 2 things :

Hadoop HDFS High Availability

February 19, 2015June 11, 2015 Bikash Sen Hadoop 2.0 Automatic failover, HA, Hadoop Cluster HA, Hadoop High Availability, HDFS HA, High Availability, Journaling, Network Cluster, Quorum based

See: The Glossary

Prior to Hadoop 2.x (Hadoop 1.x), the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

This reduced the total availability of the HDFS cluster in two major ways:

Hadoop ABCD

Let's Do Big Data…

Author: Bikash Sen

Basic Understanding of HBase

Hive Architecture

Data Warehouse: Classic Use Cases for Hadoop in DW

How MapReduce Works

CAP Theorem: How a distributed system can provide C + A without P?

Hadoop: MapReduce Vs Spark

Deep-drive: Understanding NameNode

Hadoop HDFS Federation

Data Warehouse: Teradata Vs Hadoop

“Hadoop: The Definitive Guide” 3rd Edition

How-to: Install Apache Ambari 2.0.0 on CentOS6

ZooKeeper: Distributed Coordination Service

Storm: Real-Time Data Processing

Kafka: Building a Real-Time Data Pipeline

YARN : NextGen Hadoop Data Processing Framework

HDFS File Blocks Distribution in DataNodes

HDFS File Block and Input Split

Hadoop HDFS High Availability