HomeHDFS

HDFS File Blocks Distribution in DataNodes

March 17, 2015 Bikash Sen HDFS block distribution, hadoop, hdfs block, hdfs block write, hdfs rack awareness, replica placement, replication pipeline

Background

When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (in my case – left as the default which is 64MB). Each block is stored on one or more nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.

HDFS File Block and Input Split

March 10, 2015November 3, 2016 Bikash Sen HDFS block size, file block, hadoop file block, hadoop input split, hdfs, input split

Blocks are physical division and input splits are logical division. One input split can be map to multiple physical blocks. When Hadoop submits jobs, it splits the input data logically and process by each Mapper task. The number of Mappers are equal to the number of splits. One important thing to remember is that InputSplit doesn’t contain actual data but a reference (storage locations) to the data. A split basically has 2 things :

Hadoop ABCD

Let's Do Big Data…

HDFS

HDFS File Blocks Distribution in DataNodes

HDFS File Block and Input Split