Data Warehouse: Teradata Vs Hadoop

Teradata is a fully horizontal scalable relational database management system (RDBMS). In other words, Massively Parallel Processing (MPP) database systems based on a cluster of commodity hardware (computers) called “shared-nothing” nodes (each node has separate CPU, memory, and disks to process data locally) connected through a high-speed interconnect. Horizontal partitioning of relational tables, along with the parallel execution of SQL queries.

The idea behind horizontal partitioning is to distribute the rows of a relational table across the nodes of the cluster so they can be processed in parallel. For example, partitioning a 10-million-row table across a cluster of 50 nodes, each with four disks, would place 50,000 rows on each of the 200 disks.

Hadoop enables distributed parallel processing of huge amounts of data across inexpensive (commodity hardware), that both store and process the data locally, and can horizontally scale without limits. A Hadoop cluster is commonly referred to as “shared-nothing“.

From the above definition, only one feature which is missing in Hadoop is “SQL queries“. Hive filled the gap in the Hadoop platform. Hive because of its SQL like query language is used as the interface to Hadoop based data warehouse.

From the very broader sense, we can say they are similar i.e. Teradata = Hadoop + Hive. Definitely there are much more.

I would like to explore some scenarios, discussing the ramifications of using one system over another:

ETL Process

Hadoop system is essentially “cooking” raw data into useful information. Hence, an Hadoop system can also be considered a general-purpose parallel ETL system.

For parallel DBMSs (Teradata), many products perform ETL (mostly are commercial), including Ascential, Informatica, Jaspersoft, and Talend. The market is large, as almost all major enterprises use ETL systems to load large quantities of data into data warehouses. One reason for this symbiotic relationship is the clear distinction as to what each class of system provides to users: DBMSs do not try to do ETL, and ETL systems do not try to do DBMS services.

Complex Analytics

In many datamining and data-clustering applications, the program must make multiple passes over the data. Such applications cannot be structured as single SQL aggregate queries, requiring instead a complex dataflow program where the output of one part of the application is the input of another. Hadoop is a good candidate for such applications.

Semi-structured Data

Unlike a DBMS, Hadoop system do not require users to define a schema for their data. Thus, Hadoop system easily store and process what is known as “semi-structured” data. In our experience, such data often looks like key-value pairs, where the number of attributes present in any given record varies; this style of data is typical of Web traffic logs derived from disparate sources.

Limited-budget Cost

Another strength of Hadoop system is that most are open source projects available for free. Parallel DBMSs (Teradata), are expensive;

2 thoughts on “Data Warehouse: Teradata Vs Hadoop

Leave a comment