Code-Searching: Hadoop

Showing posts with label Hadoop. Show all posts

Monday, 23 May 2016

How Sqoop can be used in a Java program ?

The Sqoop jar in classpath should be included in the java code.

After this, the method Sqoop.runTool() method must be invoked.

The necessary parameters should be created to Sqoop programmatically just like for command line.

is it mandatory to set input and output formats in MapReduce job ?

No, it is not mandatory to set the input and output formats in MapReduce.
By default : Input and output format are 'text'

Job.submit() vs. Job.waitForCompletion()

Job Submit internally creates submitter instance and submit the job

waitforcompletion polls progress at regular interval of one second.If job gets executed successfully, it displays successful message on console else display a relevant error message.

What is the use of Partitioner ?

It makes sure that all the value of a single key goes to the same reducer.

It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.

How JobTracker works ?

When client application submits jobs to the JobTracker :

JobTracker talks to the NameNode to find the location of the data

It locates TaskTracker nodes with available slots for data and assigns the work to the chosen TaskTracker nodes

The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then.

JobTracker may resubmit the task on another node or it may mark that task to avoid.

Friday, 20 May 2016

How many instances of a JobTracker run on Hadoop cluster ?

Only one JobTracker process runs on any Hadoop cluster.
JobTracker runs it within its own JVM process.

How to write a custom partitioner for a Hadoop job ?

1. Create a new class that extends Partitioner Class
2. Override method getPartition()
3. Add the custom partitioner to the job programmatically using method setPartitionerClass
or
Add the custom partitioner to the job as a config file

Wednesday, 18 May 2016

What is the use of Combiners ?

Combiners

used to increase efficiency of MapReduce Program
reduces the amount of data which to be transferred to Reducer
Reducers can be used as-it-is as a Combiner if operation performed is commutative and associative.

What is the use of chain-mapper and chain-reducer ?

Chain mapper : enables multiple mapper to execute before the reducer
Chain reducer : enables multiple mapper to execute after the reducer

What are the main configuration parameters needed to run a Mapreduce Job ?

Input path
Output path
Input format (Optional)
Output format (Optional)
Mapper class : Class containing the map function
Reduce class : Class containing the reduce function
JAR : containing the Mapper, Reducer and Driver classes

What are the methods of Mapper class ?

1. protected void setup
2. protected void cleanup
3. protected void map
4. void run

What protocol is used in shuffle & sort in MapReduce ?

HTTP

What is utility of heartbeat ?

The heartBeat from the taskTracker to the jobTracker give information about status of the task.

The TaskTracker sends a heart beat to the JobTracker at regular intervals, and also indicates that it can take new tasks for execution.
Then JobTracker consults the Scheduler to assign tasks to the TaskTracker and sends back the list of tasks as the heartbeat response to TaskTracker.

What will happen if JobTracker doesn't recieve any heartbeat ?
It assume that TaskTracker is down and resubmit the corresponding task to other node in the cluster.

What will happen if JobTracker doesn't recieve any heartbeat from TaskTracker ?

It assume that TaskTracker is down and resubmit the corresponding task to other node in the cluster.

What are the methods of Reducer class ?

1. protected void setup
2. protected void cleanup
3. protected void reduce
4. void run

What is "Partitioning, Shuffle and sort" phase after finishing Map phase ?

Partitioning

determines which reducer instance will receive which intermediate keys and values.
It is necessary that for any key (regardless of which mapper instance generated it) the destination partition is the same.

Shuffle

Process of moving map outputs to the reducers.
After the first map tasks have completed (the nodes may still be performing several more map tasks each) but they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers.

Sort

The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.

What is the usage of Context object ?

Context object

has configuration details for the job
allows mappers to interact with other Hadoop systems.

used for :

updating counters
to report the progress
to provide any application level status updates

What are the most commonly defined input formats in Hadoop ?

1. Text Input Format : Default input format ; Key=Line offset , Value=Line

2. Key Value Input Format : for plain text files where the lines are broken into key and value.

3. Sequence File Input Format : used for reading sequence files

Tuesday, 17 May 2016

Namenode vs. Backup node vs. Checkpoint Namenode

NameNode

manages the metadata i.e. info about all the files present in the HDFS on a hadoop cluster
uses 2 files for the namespace :

FS image : keeps track of the latest checkpoint of the namespace
edit logs : A log of changes that have been made to the namespace since checkpoint.

Checkpoint Node

Same structure as Namenode
creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it
keeps track of the latest checkpoint

Backup node

Also provides checkpoint functionality as the Checkpoint node
Additionally, maintains up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

Input split vs. Block

Block is physical division of data which doesn't consider logical boundaries of records
Input split is the logical division which considers logical boundaries of records as well.