Hadoop: First Example Project using Eclipse: Part-4

Download and Install Eclipse here.

Create New Java Project

Add Hadoop Dependencies JARs

After downloading Hadoop here, add all jar files in lib folder. Right click on project properties and select Java build path

The Word count example

We’re going to create a simple word count example. Given a text file, one should be able to count all occurrences of each word in it. In general, the program consists of three classes:

WordCountMapper.java the mapper.
WordCountReducer.java the reducer.
WordCountDriver.java the driver. Some configurations (input type, output type, job…) are done here.

WordCountMapper.java

 package com.hadoop.sample;  
   
 import java.io.IOException;  
 import java.util.StringTokenizer;  
   
 import org.apache.hadoop.io.*;  
 import org.apache.hadoop.mapred.*;  
   
 public class WordCountMapper extends MapReduceBase implements  
           Mapper<LongWritable, Text, Text, IntWritable> {  
        
      // hadoop supported data types  
      private final static IntWritable one = new IntWritable(1);  
      private Text word = new Text();  
   
      // map method that performs the tokenizer job and framing the initial key value pairs  
      public void map(LongWritable key, Text value,  
                OutputCollector<Text, IntWritable> output, Reporter reporter)  
                throws IOException {  
           // taking one line at a time and tokenizing the same  
           String line = value.toString();  
           StringTokenizer tokenizer = new StringTokenizer(line);  
   
           // iterating through all the words available in that line and forming the key value pair  
           while (tokenizer.hasMoreTokens()) {  
                word.set(tokenizer.nextToken());  
                // sending to output collector which inturn passes the same to reducer  
                output.collect(word, one);  
           }  
      }  
 }

WordCountReducer.java

 package com.hadoop.sample;  
   
 import java.io.IOException;  
 import java.util.Iterator;  
   
 import org.apache.hadoop.io.*;  
 import org.apache.hadoop.mapred.*;  
   
 public class WordCountReducer extends MapReduceBase implements  
           Reducer<Text, IntWritable, Text, IntWritable> {  
      /*  
       * reduce method accepts the Key Value pairs from mappers, do the  
       * aggregation based on keys and produce the final out put  
       */  
        
      public void reduce(Text key, Iterator<IntWritable> values,  
                OutputCollector<Text, IntWritable> output, Reporter reporter)  
                throws IOException {  
           int sum = 0;  
           /*  
            * iterates through all the values available with a key and add them  
            * together and give the final result as the key and sum of its values  
            */  
           while (values.hasNext()) {  
                sum += values.next().get();  
           }  
           output.collect(key, new IntWritable(sum));  
      }  
 }

WordCountDriver.java

 package com.hadoop.sample;  
   
 import org.apache.hadoop.fs.Path;  
 import org.apache.hadoop.conf.*;  
 import org.apache.hadoop.io.*;  
 import org.apache.hadoop.mapred.*;  
 import org.apache.hadoop.util.*;  
   
 public class WordCountDriver extends Configured implements Tool {  
      public int run(String[] args) throws Exception {  
           // creating a JobConf object and assigning a job name for identification purposes  
           JobConf conf = new JobConf(getConf(), WordCountDriver.class);  
           conf.setJobName("WordCount");  
   
           // Setting configuration object with the Data Type of output Key and Value  
           conf.setOutputKeyClass(Text.class);  
           conf.setOutputValueClass(IntWritable.class);  
   
           // Providing the mapper and reducer class names  
           conf.setMapperClass(WordCountMapper.class);  
           conf.setReducerClass(WordCountReducer.class);  
   
           // the hdfs input and output directory to be fetched from the command line  
           FileInputFormat.addInputPath(conf, new Path(args[0]));  
           FileOutputFormat.setOutputPath(conf, new Path(args[1]));  
   
           JobClient.runJob(conf);  
           return 0;  
      }  
   
      public static void main(String[] args) throws Exception {  
           int res = ToolRunner.run(new Configuration(), new WordCountDriver(), args);  
           System.exit(res);  
      }  
 }

When all finished, you should end up with something like this:

Export the project to a JAR file (samplehadoop.jar). Right click on the project and select “Export”

Follow the steps to execute the job in HadoopNameNode

Copy the samplehadoop.jar and PLACES.TXT (copy the downloaded file content from places-txt.docx to PLACES.TXT file) input file from windows to a location in linux LFS (Local File System) through FileZilla (/home/ubuntu/wordcount/)

Create an input directory in HDFS (Hadoop Distributed File System)

$ hadoop fs -mkdir /home/ubuntu/wordcount/input/

Copy the input file from linux LFS to HDFS

$ hadoop fs -copyFromLocal /home/ubuntu/wordcount/PLACES.TXT /home/ubuntu/wordcount/input/

Execute the jar

$ hadoop jar /home/ubuntu/wordcount/samplehadoop.jar com.hadoop.sample.WordCountDriver /home/ubuntu/wordcount/input/ /home/ubuntu/wordcount/output/

Once the job shows a success status we can see the output file in the output directory (part-00000)

$ hadoop fs -ls /home/ubuntu/wordcount/output/

For any further investigation of output file we can retrieve the data from HDFS to LFS and from there to the desired windows location

$ hadoop fs -copyToLocal /home/ubuntu/wordcount/output/ /home/ubuntu/wordcount/

From linux LFS to the desired windows location

**DO NOT forget to terminate (Delete) all 4 EC2 instances which you have created.

Select all 4 EC2 instances and right click, select Instance State > Terminate

That’s it for this four series article, hope you find it useful and enjoyed.

Happy Hadoop Year!!

2 thoughts on “Hadoop: First Example Project using Eclipse: Part-4”

Setting up Hadoop Multi-Node Cluster on Amazon EC2 | Hadoop ABCD says:

January 5, 2015 at 11:55 am

[…] Part 4 – We now turn into a sample Hadoop MapReduce word count example using Eclipse. […]

LikeLike

How MapReduce Works | Hadoop ABCD says:

June 29, 2015 at 8:12 pm

[…] Write a MapReduce Java program and bundle it in a JAR file. You can have a look in my previous post how to create a MapReduce program in Java using Eclipse and bundle a JAR file “First Example Project using Eclipse“. […]

LikeLike

Hadoop ABCD

Let's Do Big Data…

Hadoop: First Example Project using Eclipse: Part-4

2 thoughts on “Hadoop: First Example Project using Eclipse: Part-4”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “Hadoop: First Example Project using Eclipse: Part-4”

Leave a comment Cancel reply