List of major projects/tools surrounding hadoop with their category which build up Enterprise Data Platform. It is growing at a rapid pace to keeping in mind three Vs of Big Data: Volume (Big), Velocity (Fast) and Variety (Smart). Find the table below:
Remarks |
Latest stable release |
Initially developed by |
Project Owner |
Written in |
|
Data Storage |
|||||
HDFS |
Redundant and Reliable massive data storage |
v2.6.0 / 18-Nov- 2014 |
Introduced by Google |
Apache |
Java |
Data Processing: Batch |
|||||
Map Reduce |
Distributed data processing framework |
v2.6.0 / 18-Nov- 2014 |
Introduced by Google |
Apache |
Java |
YARN |
Cluster resource management framework |
v2.6.0 / 18-Nov- 2014 |
Apache |
Apache |
Java |
Data Processing : Real-Time |
|||||
YARN |
Cluster resource management framework |
v2.6.0 / 18-Nov- 2014 |
Apache |
Apache |
Java |
Storm |
Stream based task parallelism |
v0.9.3 / 25-Nov-2014 |
|
Apache |
|
Spark |
Stream based data parallelism |
v1.2.0 / 18-Dec- 2014 |
Berkeley |
Apache |
Scala |
Data Access |
|||||
Map Reduce |
Java API. |
v2.6.0 / 18-Nov- 2014 | Introduced by Google | Apache | Java |
Hive |
Framework to run SQL-like query HiveQL |
v0.14.0/ 12-Nov-2014 |
Apache |
Java |
|
Pig |
Framework to run script language Pig Latin |
v0.14.0/ 20-Nov-2014 |
Yahoo |
Apache |
|
SQL on Hadoop |
|||||
Hive |
SQL-like language HiveQL | v0.14.0/ 12-Nov-2014 |
|
Apache |
Java |
Impala |
Interactive SQL support and much faster than Hive | v2.1.0 / Dec-2014 | Introduced by Google |
Cloudera |
C++ |
Stinger
|
Advance SQL support (Full ACID support) than Hive. 100x faster than Hive |
v0.14.0/ 12-Nov-2014 |
|
Hortonworks |
|
HCatalog |
Relational table view of data in HDFS |
HCatalog Merged With Hive (in March of 2013) | Apache |
Apache |
Java |
Database |
|||||
HBase |
NoSQL column oriented |
v0.98.4 / 21-Jul-2014 |
Google’s BigTable |
Apache |
Java |
Casandra |
NoSQL column oriented |
v2.1.2 / 10-Nov-2014 |
Apache |
Java |
|
Data Ingestion and Integration |
|||||
Flume |
Import/Export unstructure or semi-structure data into HDFS. Data ingestion into HDFS. |
v1.5.2 / 18-Nov-2014 |
Apache |
Apache |
Java |
Sqoop |
Tool designed for efficiently transferring bulk structured data (RDBMS) into HDFS and vies versa. |
v1.4.4 / 31-Jul- 2013 | Apache | Apache | Java |
Kafka |
Distributed publish-subscribe messaging system for data integration |
v0.8.1.1 / 29-Apr-2014 |
Apache |
Scala |
|
Hadoop Administration |
|||||
Cloudera Manager |
Web based cluster management UI |
Cloudera |
Cloudera |
||
Ambari |
Web based cluster management UI |
v1.7.0 / 15-Dec-2014 | Hortonworks |
Apache |
|
Data Serialization |
|||||
Avro |
Data serialization framework |
v1.7.7 / 23-Jul-2014 |
Yahoo |
Apache |
Java |
Data Mining |
|||||
Mahut |
Library of machine learning algorithms |
v0.9 / 01-Feb- 2014 |
Apache |
Apache |
Java |
Workflow |
|||||
Oozie |
Define collection of jobs with their execution sequence and schedule |
v4.0.1 / 30-Sep-2014 | Apache | Apache | Java |
Security |
|||||
Sentry |
Role based authorization of data stored on an Apache Hadoop cluster. | v1.3.0 / 15-May-2014 | Cloudera |
Apache |
|
Ranger |
Role based authorization of data stored on an Apache Hadoop cluster. | v0.4.0 / 17-Nov-2014 | Hortonworks |
Apache |
|
Operational Services |
|||||
Hue |
Web based UI (like IDE but not) to work with hadoop |
v3.7.0 / 09-Oct-2014 |
Apache | Apache |
Python |
Zookeeper |
Coordination service between hadoop |
v3.4.6 / 10-Mar- 2014 |
Yahoo |
Apache |
Java |
**Note**
Hadoop 1.0 & Hadoop 2.0:
Hadoop 1.0 – MR1 > HDFS
Hadoop 2.0 – MR2 > YARN > HDFS
Batch vs Real-Time vs Interactive vs Streaming: Batch processing means waiting to do everything at once. Real-time processing, on the other hand, how fresh data (time interval between data collected and processing) is for processing. Interactive (ad-hoc) means, query cannot be determined prior to the moment the query made up on the spot. Streaming means, processes data immediately upon received of incoming data (e.g Stock market data, cricket score).
Therefore, what the today’s market needs is fast ad-hoc queries (Interactive) on streaming data (continuously flowing data).
Pattern vs framework: Pattern is the way of doing something. Other-hand, framework is implementation of the pattern.
Task Parallelism vs Data Parallelism: Task parallelism is a form of parallelization of computer code across multiple processors (CPUs) in parallel computing environments.
As a simple example, if we are running code on a 2-processor system (CPUs “a” & “b”) in a parallel environment and we wish to do tasks “A” and “B”, it is possible to tell CPU “a” to do task “A” and CPU “b” to do task “B” simultaneously.
Data parallelism is achieved when each processor (CPU) performs the same task on different pieces of distributed data.
For instance, consider a 2-processor system (CPUs A and B) in a parallel environment, and we wish to do a task on some data “d”. It is possible to tell CPU A to do that task on one part of “d” and CPU B on another part simultaneously.