The Hadoop Ecosystem Table

List of major projects/tools surrounding hadoop with their category which build up Enterprise Data Platform. It is growing at a rapid pace to keeping in mind three Vs of Big Data: Volume (Big), Velocity (Fast) and Variety (Smart). Find the table below:

	Remarks	Latest stable release	Initially developed by	Project Owner	Written in
Data Storage
HDFS	Redundant and Reliable massive data storage	v2.6.0 / 18-Nov- 2014	Introduced by Google	Apache	Java
Data Processing: Batch
Map Reduce	Distributed data processing framework	v2.6.0 / 18-Nov- 2014	Introduced by Google	Apache	Java
YARN	Cluster resource management framework	v2.6.0 / 18-Nov- 2014	Apache	Apache	Java
Data Processing : Real-Time
YARN	Cluster resource management framework	v2.6.0 / 18-Nov- 2014	Apache	Apache	Java
Storm	Stream based task parallelism	v0.9.3 / 25-Nov-2014	Twitter	Apache	Clojure
Spark	Stream based data parallelism	v1.2.0 / 18-Dec- 2014	Berkeley	Apache	Scala
Data Access
Map Reduce	Java API.	v2.6.0 / 18-Nov- 2014	Introduced by Google	Apache	Java
Hive	Framework to run SQL-like query HiveQL	v0.14.0/ 12-Nov-2014	Facebook	Apache	Java
Pig	Framework to run script language Pig Latin	v0.14.0/ 20-Nov-2014	Yahoo	Apache
SQL on Hadoop
Hive	SQL-like language HiveQL	v0.14.0/ 12-Nov-2014	Facebook	Apache	Java
Impala	Interactive SQL support and much faster than Hive	v2.1.0 / Dec-2014	Introduced by Google	Cloudera	C++
Stinger	Advance SQL support (Full ACID support) than Hive. 100x faster than Hive	v0.14.0/ 12-Nov-2014	Facebook	Hortonworks
HCatalog	Relational table view of data in HDFS	HCatalog Merged With Hive (in March of 2013)	Apache	Apache	Java
Database
HBase	NoSQL column oriented	v0.98.4 / 21-Jul-2014	Google’s BigTable	Apache	Java
Casandra	NoSQL column oriented	v2.1.2 / 10-Nov-2014	Facebook	Apache	Java
Data Ingestion and Integration
Flume	Import/Export unstructure or semi-structure data into HDFS. Data ingestion into HDFS.	v1.5.2 / 18-Nov-2014	Apache	Apache	Java
Sqoop	Tool designed for efficiently transferring bulk structured data (RDBMS) into HDFS and vies versa.	v1.4.4 / 31-Jul- 2013	Apache	Apache	Java
Kafka	Distributed publish-subscribe messaging system for data integration	v0.8.1.1 / 29-Apr-2014	LinkedIn	Apache	Scala
Hadoop Administration
Cloudera Manager	Web based cluster management UI		Cloudera	Cloudera
Ambari	Web based cluster management UI	v1.7.0 / 15-Dec-2014	Hortonworks	Apache
Data Serialization
Avro	Data serialization framework	v1.7.7 / 23-Jul-2014	Yahoo	Apache	Java
Data Mining
Mahut	Library of machine learning algorithms	v0.9 / 01-Feb- 2014	Apache	Apache	Java
Workflow
Oozie	Define collection of jobs with their execution sequence and schedule time	v4.0.1 / 30-Sep-2014	Apache	Apache	Java
Security
Sentry	Role based authorization of data stored on an Apache Hadoop cluster.	v1.3.0 / 15-May-2014	Cloudera	Apache
Ranger	Role based authorization of data stored on an Apache Hadoop cluster.	v0.4.0 / 17-Nov-2014	Hortonworks	Apache
Operational Services
Hue	Web based UI (like IDE but not) to work with hadoop ecosystems	v3.7.0 / 09-Oct-2014	Apache	Apache	Python
Zookeeper	Coordination service between hadoop ecosystems. Centralize configuration and synchronization system for all hadoop ecosystems	v3.4.6 / 10-Mar- 2014	Yahoo	Apache	Java

**Note**

Hadoop 1.0 & Hadoop 2.0:

Hadoop 1.0 – MR1 > HDFS

Hadoop 2.0 – MR2 > YARN > HDFS

Batch vs Real-Time vs Interactive vs Streaming: Batch processing means waiting to do everything at once. Real-time processing, on the other hand, how fresh data (time interval between data collected and processing) is for processing. Interactive (ad-hoc) means, query cannot be determined prior to the moment the query made up on the spot. Streaming means, processes data immediately upon received of incoming data (e.g Stock market data, cricket score).

Therefore, what the today’s market needs is fast ad-hoc queries (Interactive) on streaming data (continuously flowing data).

Pattern vs framework: Pattern is the way of doing something. Other-hand, framework is implementation of the pattern.

Task Parallelism vs Data Parallelism: Task parallelism is a form of parallelization of computer code across multiple processors (CPUs) in parallel computing environments.

As a simple example, if we are running code on a 2-processor system (CPUs “a” & “b”) in a parallel environment and we wish to do tasks “A” and “B”, it is possible to tell CPU “a” to do task “A” and CPU “b” to do task “B” simultaneously.

Data parallelism is achieved when each processor (CPU) performs the same task on different pieces of distributed data.

For instance, consider a 2-processor system (CPUs A and B) in a parallel environment, and we wish to do a task on some data “d”. It is possible to tell CPU A to do that task on one part of “d” and CPU B on another part simultaneously.

Hadoop ABCD

Let's Do Big Data…

The Hadoop Ecosystem Table

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply