Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Monday, 23 May 2016

How Sqoop can be used in a Java program ?



  • The Sqoop jar in classpath should be included in the java code.
  • After this, the method Sqoop.runTool() method must be invoked.
  • The necessary parameters should be created to Sqoop programmatically just like for command line.


is it mandatory to set input and output formats in MapReduce job ?


No, it is not mandatory to set the input and output formats in MapReduce.
By default : Input and output format are 'text'

Job.submit() vs. Job.waitForCompletion()


  • Job Submit internally creates submitter instance and submit the job
  • waitforcompletion polls progress at regular interval of one second.If job gets executed successfully, it displays successful message on console else display a relevant error message.


What is the use of Partitioner ?


It makes sure that all the value of a single key goes to the same reducer.

It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.

How JobTracker works ?


When client application submits jobs to the JobTracker :

  • JobTracker talks to the NameNode to find the location of the data
  • It locates TaskTracker nodes with available slots for data and assigns the work to the chosen TaskTracker nodes
  • The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then.
  • JobTracker may resubmit the task on another node or it may mark that task to avoid.

Friday, 20 May 2016

How many instances of a JobTracker run on Hadoop cluster ?


Only one JobTracker process runs on any Hadoop cluster.
JobTracker runs it within its own JVM process.

How to write a custom partitioner for a Hadoop job ?


1. Create a new class that extends Partitioner Class
2. Override method getPartition()
3. Add the custom partitioner to the job programmatically using method setPartitionerClass
    or
    Add the custom partitioner to the job as a config file

Wednesday, 18 May 2016

What is the use of Combiners ?


Combiners 

  • used to increase efficiency of MapReduce Program
  • reduces the amount of data which to be transferred to Reducer
  • Reducers can be used as-it-is as a Combiner if operation performed is commutative and associative.

What is the use of chain-mapper and chain-reducer ?


Chain mapper : enables multiple mapper to execute before the reducer
Chain reducer : enables multiple mapper to execute after the reducer

What are the main configuration parameters needed to run a Mapreduce Job ?


  • Input path
  • Output path
  • Input format (Optional)
  • Output format (Optional)
  • Mapper class : Class containing the map function
  • Reduce class : Class containing the reduce function
  • JAR : containing the Mapper, Reducer and Driver classes


What are the methods of Mapper class ?


1. protected void setup
2. protected void cleanup
3. protected void map
4. void run

What protocol is used in shuffle & sort in MapReduce ?


HTTP

What is utility of heartbeat ?


The heartBeat from the taskTracker to the jobTracker give information about status of the task.

The TaskTracker sends a heart beat to the JobTracker at regular intervals, and also indicates that it can take new tasks for execution.
Then JobTracker consults the Scheduler to assign tasks to the TaskTracker and sends back the list of tasks as the heartbeat response to TaskTracker.

What will happen if JobTracker doesn't recieve any heartbeat ?
It assume that TaskTracker is down and resubmit the corresponding task to other node in the cluster.

What will happen if JobTracker doesn't recieve any heartbeat from TaskTracker ?


It assume that TaskTracker is down and resubmit the corresponding task to other node in the cluster.

What are the methods of Reducer class ?


1. protected void setup
2. protected void cleanup
3. protected void reduce
4. void run

What is "Partitioning, Shuffle and sort" phase after finishing Map phase ?



  • Partitioning
    • determines which reducer instance will receive which intermediate keys and values.
    • It is necessary that for any key (regardless of which mapper instance generated it) the destination partition is the same.
  • Shuffle
    • Process of moving map outputs to the reducers.
    • After the first map tasks have completed (the nodes may still be performing several more map tasks each) but they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers.
  • Sort
    • The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.


What is the usage of Context object ?


Context object

  • has configuration details for the job
  • allows mappers to interact with other Hadoop systems.
  • used for :
    • updating counters
    • to report the progress
    • to provide any application level status updates


What are the most commonly defined input formats in Hadoop ?


1. Text Input Format : Default input format ; Key=Line offset , Value=Line

2. Key Value Input Format : for plain text files where the lines are broken into key and value.

3. Sequence File Input Format : used for reading sequence files

Tuesday, 17 May 2016

Namenode vs. Backup node vs. Checkpoint Namenode



  • NameNode
    • manages the metadata i.e. info about all the files present in the HDFS on a hadoop cluster
    • uses 2 files for the namespace :
      • FS image : keeps track of the latest checkpoint of the namespace
      • edit logs : A log of changes that have been made to the namespace since checkpoint.
  • Checkpoint Node
    • Same structure as Namenode
    • creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it
    • keeps track of the latest checkpoint
  • Backup node
    • Also provides checkpoint functionality as the Checkpoint node
    • Additionally, maintains up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

Input split vs. Block


  • Block is physical division of data which doesn't consider logical boundaries of records
  • Input split is the logical division which considers logical boundaries of records as well.