The Hadoop Ecosystem Table

List of major projects/tools surrounding hadoop with their category which build up Enterprise Data Platform. It is growing at a rapid pace to keeping in mind three Vs of Big Data: Volume (Big), Velocity (Fast) and Variety (Smart). Find the table below:

Remarks

Latest stable release

Initially developed by

Project Owner

Written in

Data Storage

HDFS

Redundant and Reliable massive data storage

v2.6.0 / 18-Nov- 2014

Introduced by Google

Apache

Java

Data Processing: Batch

Map Reduce

Distributed data processing framework

v2.6.0 / 18-Nov- 2014

Introduced by Google

Apache

Java

YARN

Cluster resource management framework

v2.6.0 / 18-Nov- 2014

Apache

Apache

 Java

Data Processing : Real-Time

YARN

Cluster resource management framework

v2.6.0 / 18-Nov- 2014

Apache

Apache

 Java

Storm

Stream based task parallelism

v0.9.3 / 25-Nov-2014

Twitter

Apache

Clojure

Spark

Stream based data parallelism

v1.2.0 / 18-Dec- 2014

Berkeley

Apache

Scala 

Data Access

Map Reduce

Java API.

v2.6.0 / 18-Nov- 2014 Introduced by Google Apache Java

Hive

Framework to run SQL-like query HiveQL

v0.14.0/ 12-Nov-2014 Facebook

Apache

Java

Pig

Framework to run script language Pig Latin

v0.14.0/ 20-Nov-2014

Yahoo

Apache

SQL on Hadoop

Hive

SQL-like language  HiveQL v0.14.0/ 12-Nov-2014

Facebook

Apache

Java

Impala

Interactive SQL support and much faster than Hive v2.1.0 / Dec-2014 Introduced by Google

Cloudera

C++

Stinger

Advance SQL support  (Full ACID support) than Hive. 100x faster than Hive

v0.14.0/ 12-Nov-2014

Facebook

Hortonworks

HCatalog

Relational table view of data in HDFS

HCatalog Merged With Hive (in March of 2013) Apache

Apache

Java

Database

HBase

NoSQL column oriented

v0.98.4 / 21-Jul-2014

Google’s BigTable

Apache

Java

Casandra

NoSQL column oriented

v2.1.2 / 10-Nov-2014 Facebook

Apache

Java

Data Ingestion and Integration

Flume

Import/Export unstructure or semi-structure data into HDFS. Data ingestion into HDFS.

v1.5.2 / 18-Nov-2014

Apache

Apache

 Java

Sqoop

Tool designed for efficiently transferring bulk structured data (RDBMS) into HDFS and vies versa.

v1.4.4 / 31-Jul- 2013 Apache Apache  Java

Kafka

Distributed publish-subscribe messaging system for data integration

v0.8.1.1 / 29-Apr-2014

LinkedIn

Apache

Scala

Hadoop Administration

Cloudera Manager

Web based cluster management UI

Cloudera

Cloudera

Ambari

Web based cluster management UI

v1.7.0 / 15-Dec-2014 Hortonworks

Apache

Data Serialization

Avro

Data serialization framework

v1.7.7 / 23-Jul-2014

 Yahoo

Apache

Java

Data Mining

Mahut

Library of machine learning algorithms

v0.9 / 01-Feb- 2014

Apache

Apache

Java

Workflow

Oozie

Define collection of jobs with their execution sequence and schedule
time

 v4.0.1 / 30-Sep-2014 Apache Apache Java

Security

Sentry

Role based authorization of data stored on an Apache Hadoop cluster. v1.3.0 / 15-May-2014 Cloudera

Apache

Ranger

Role based authorization of data stored on an Apache Hadoop cluster. v0.4.0 / 17-Nov-2014 Hortonworks

Apache

Operational Services

Hue

Web based UI (like IDE but not) to work with hadoop
ecosystems

v3.7.0 / 09-Oct-2014

Apache Apache

Python

Zookeeper

Coordination service between hadoop
ecosystems. Centralize configuration and synchronization  system for all
hadoop ecosystems

v3.4.6 / 10-Mar- 2014

Yahoo

Apache

Java

**Note**

Hadoop 1.0 & Hadoop 2.0:

Hadoop 1.0 – MR1 > HDFS

Hadoop 2.0 – MR2 > YARN > HDFS

Batch vs Real-Time vs Interactive vs Streaming:  Batch processing means waiting to do everything at once. Real-time processing, on the other hand, how fresh data (time interval between data collected and processing) is for processing. Interactive (ad-hoc) means, query cannot be determined prior to the moment the query made up on the spot. Streaming means, processes data immediately upon received of incoming data (e.g Stock market data, cricket score).

Therefore, what the today’s market needs is fast ad-hoc queries (Interactive) on streaming data (continuously flowing data).

Pattern vs framework: Pattern is the way of doing something. Other-hand, framework is implementation of the pattern.

Task Parallelism vs Data Parallelism: Task parallelism is a form of parallelization of computer code across multiple processors (CPUs) in parallel computing environments.

As a simple example, if we are running code on a 2-processor system (CPUs “a” & “b”) in a parallel environment and we wish to do tasks “A” and “B”, it is possible to tell CPU “a” to do task “A” and CPU “b” to do task “B” simultaneously.

Data parallelism is achieved when each processor (CPU) performs the same task on different pieces of distributed data.

For instance, consider a 2-processor system (CPUs A and B) in a parallel environment, and we wish to do a task on some data “d”. It is possible to tell CPU A to do that task on one part of “d” and CPU B on another part simultaneously.

Leave a comment