Code-Searching: May 2016

Tuesday, 31 May 2016

How to run Logstash using a configuration file ?

Go to installation location of logtash.

> cd /home/osadmin/ELK/logstash-2.3.2

> ./bin/logstash -f myapp-logstash.conf

If configuration is well, logstash should startup.

Note : To verify, modify the file ( Add some log lines in log file specified “"/var/log/test.log” )

Log messages in JSON format should appear on console.

How to install Logstash ?

Install Logstash

USING YUM

a. Download the RPM file

> rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
b. Add the following in file logstash.repo (with a .repo suffix) under directory /etc/yum.repos.d/

[logstash-2.2]

name=Logstash repository for 2.2.x packages

baseurl=http://packages.elastic.co/logstash/2.2/centos

gpgcheck=1

gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch

enabled=1

c. Install

> yum install logstash

USING ZIP

a. a. Download ZIP and extract.

[root@CSRToolTest1 logstash-2.3.2]# pwd

/home/osadmin/ELK/logstash-2.3.2

[root@CSRToolTest1 logstash-2.2.0]# ll

total 160

drwxrwxrwx 2 osadmin osadmin 4096 Feb 12 07:36 bin

-rwxrwxrwx 1 osadmin osadmin 100382 Feb 12 07:36 CHANGELOG.md

-rwxrwxrwx 1 osadmin osadmin 2249 Feb 12 07:36 CONTRIBUTORS

-rwxrwxrwx 1 osadmin osadmin 3844 Feb 12 07:36 Gemfile

-rwxrwxrwx 1 osadmin osadmin 22304 Feb 12 07:36 Gemfile.jruby-1.9.lock

drwxrwxrwx 4 osadmin osadmin 4096 Feb 12 07:36 lib

-rwxrwxrwx 1 osadmin osadmin 589 Feb 12 07:36 LICENSE

-rwxrwxrwx 1 osadmin osadmin 149 Feb 12 07:36 NOTICE.TXT

-rwxrwxrwx 1 osadmin osadmin 88 Feb 12 10:29 test-logstash-filter.conf

drwxrwxrwx 4 osadmin osadmin 4096 Feb 12 07:37 vendor

Monday, 30 May 2016

What will happen if ES is installed with other non-root user and try to start with root user ?

Problem

Following error occurred :

[root@CSRToolTest1 bin]# Exception in thread "main" java.lang.RuntimeException: don't run elasticsearch as root.

Possible cause

You must installed ES with other user.

Solution

Start ES with non-root user.

What will happen if ES is installed with root user and try to start with other non-root user ?

Problem

Following error occurred :

java.io.FileNotFoundException: /home/shaan/ELK/elasticsearch-2.2.0/logs/my-application-shaandev.log (Permission denied)

Possible cause

You must have started ES with a non-root user (In this case ES is installed using the root user), which made <cluster.name>.log file with root user.

Now, when you try to start ES with proper user (non-root) , it’s not able to access this log file created by root.

Solution

Delete the file <cluster.name>.log (which is my-application-shaandev.log) or change the permission of it, then start it again.

How to create GROK patterns ?

Use online Grok Debugger : http://grokdebug.herokuapp.com/

Provide input and pattern, it will parse and provide fields in the JSON string.

Pattern #1

%{GREEDYDATA:data}

Output

{

"data": [

[

      "2016-03-11 09:25:21,165|INFO |com.shaan.Logging [log]|India|143011|Admin|TECHNICAL|cashout|Cash Out|0(success)|34|MSISDN: 8883039090"

]

}

Pattern #2

%{TIMESTAMP_ISO8601:timestamp}\|%{GREEDYDATA:data}

Output

{

"timestamp": [

[

      "2016-03-11 09:25:21,165"

]

"YEAR": [

[

"2016"

]

"MONTHNUM": [

[

"03"

]

...

"data": [

[

"INFO |com.shaan.Logging [log]|India|143011|Admin|TECHNICAL|cashout|Cash Out|0(success)|34|MSISDN: 8883039090"

]

}

Pattern #3

%{TIMESTAMP_ISO8601:timestamp}\|%{DATA:logLevel}\|%{GREEDYDATA:data}

Output

{

"timestamp": [

[

      "2016-03-11 09:25:21,165"

]

"YEAR": [

[

"2016"

]

"MONTHNUM": [

[

"03"

]

...

  "logLevel": [

[

"INFO "

]

"data": [

[

]

}

Pattern #4

%{TIMESTAMP_ISO8601:timestamp}\|%{DATA:logLevel}\|%{DATA:eventSource}\|%{DATA:country}\|%{DATA:userId}\|%{DATA:role}\|%{DATA:logType}\|%{DATA:operation}\|%{INT:resultCode}$%{DATA:result}$\|%{NUMBER:timeConsumed}\|%{GREEDYDATA:data}

Output

{

"timestamp": [

[

"2016-03-11 09:25:21,165"

]

"YEAR": [

[

"2016"

]

"MONTHNUM": [

[

"03"

]

...

"logLevel": [

[

"INFO "

]

"eventSource": [

[

"com.shaan.Logging [log]"

]

"country": [

[

"India"

]

"userId": [

[

"143011"

]

"role": [

[

"Admin"

]

"logType": [

[

"TECHNICAL"

]

"operation": [

[

"Cash Out"

]

"resultCode": [

[

"0"

]

"result": [

[

"success"

]

"timeConsumed": [

[

"34"

]

"data": [

[

"MSISDN: 8883039090"

]

}

Why Logstash is not pushing data to ES ?

Problem
Data not pushed by logstash (When we launch logstash using config file)

Possible issues
May be, Data already pushed to ES

Soluton

Remove the logstash index created
Launch logstash

If problem still there

Just add one line at the end of log file so that Logstash can pick the changes
Launch logstash again

How to delete an index ?

$ curl -XDELETE http://<ip>:<port>/<index>

Example
$ curl -XDELETE http://10.170.208.53:9200/logstash-2016.05.03

How to find current indices created in ES ?

$ curl http://<ip>:<port>/_cat/indices

Example

$ curl http://10.170.208.53:9200/_cat/indices

open logstash-2016.05.03 1 1 24 6 75.2mb 75.2mb

open .kibana 1 1 24 6 75.2kb 75.2kb

In above example, we have 2 indices.

Monday, 23 May 2016

How Sqoop can be used in a Java program ?

The Sqoop jar in classpath should be included in the java code.

After this, the method Sqoop.runTool() method must be invoked.

The necessary parameters should be created to Sqoop programmatically just like for command line.

is it mandatory to set input and output formats in MapReduce job ?

No, it is not mandatory to set the input and output formats in MapReduce.
By default : Input and output format are 'text'

Job.submit() vs. Job.waitForCompletion()

Job Submit internally creates submitter instance and submit the job

waitforcompletion polls progress at regular interval of one second.If job gets executed successfully, it displays successful message on console else display a relevant error message.

What is the use of Partitioner ?

It makes sure that all the value of a single key goes to the same reducer.

It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.

How JobTracker works ?

When client application submits jobs to the JobTracker :

JobTracker talks to the NameNode to find the location of the data

It locates TaskTracker nodes with available slots for data and assigns the work to the chosen TaskTracker nodes

The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then.

JobTracker may resubmit the task on another node or it may mark that task to avoid.

Friday, 20 May 2016

How many instances of a JobTracker run on Hadoop cluster ?

Only one JobTracker process runs on any Hadoop cluster.
JobTracker runs it within its own JVM process.

How to write a custom partitioner for a Hadoop job ?

1. Create a new class that extends Partitioner Class
2. Override method getPartition()
3. Add the custom partitioner to the job programmatically using method setPartitionerClass
or
Add the custom partitioner to the job as a config file

Wednesday, 18 May 2016

What is the use of Combiners ?

Combiners

used to increase efficiency of MapReduce Program
reduces the amount of data which to be transferred to Reducer
Reducers can be used as-it-is as a Combiner if operation performed is commutative and associative.

What is the use of chain-mapper and chain-reducer ?

Chain mapper : enables multiple mapper to execute before the reducer
Chain reducer : enables multiple mapper to execute after the reducer

What are the main configuration parameters needed to run a Mapreduce Job ?

Input path
Output path
Input format (Optional)
Output format (Optional)
Mapper class : Class containing the map function
Reduce class : Class containing the reduce function
JAR : containing the Mapper, Reducer and Driver classes

What are the methods of Mapper class ?

1. protected void setup
2. protected void cleanup
3. protected void map
4. void run

What protocol is used in shuffle & sort in MapReduce ?

HTTP

What is utility of heartbeat ?

The heartBeat from the taskTracker to the jobTracker give information about status of the task.

The TaskTracker sends a heart beat to the JobTracker at regular intervals, and also indicates that it can take new tasks for execution.
Then JobTracker consults the Scheduler to assign tasks to the TaskTracker and sends back the list of tasks as the heartbeat response to TaskTracker.

What will happen if JobTracker doesn't recieve any heartbeat ?
It assume that TaskTracker is down and resubmit the corresponding task to other node in the cluster.

What will happen if JobTracker doesn't recieve any heartbeat from TaskTracker ?

It assume that TaskTracker is down and resubmit the corresponding task to other node in the cluster.

What are the methods of Reducer class ?

1. protected void setup
2. protected void cleanup
3. protected void reduce
4. void run

What is "Partitioning, Shuffle and sort" phase after finishing Map phase ?

Partitioning

determines which reducer instance will receive which intermediate keys and values.
It is necessary that for any key (regardless of which mapper instance generated it) the destination partition is the same.

Shuffle

Process of moving map outputs to the reducers.
After the first map tasks have completed (the nodes may still be performing several more map tasks each) but they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers.

Sort

The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.

What is the usage of Context object ?

Context object

has configuration details for the job
allows mappers to interact with other Hadoop systems.

used for :

updating counters
to report the progress
to provide any application level status updates

What are the most commonly defined input formats in Hadoop ?

1. Text Input Format : Default input format ; Key=Line offset , Value=Line

2. Key Value Input Format : for plain text files where the lines are broken into key and value.

3. Sequence File Input Format : used for reading sequence files

Tuesday, 17 May 2016

Namenode vs. Backup node vs. Checkpoint Namenode

NameNode

manages the metadata i.e. info about all the files present in the HDFS on a hadoop cluster
uses 2 files for the namespace :

FS image : keeps track of the latest checkpoint of the namespace
edit logs : A log of changes that have been made to the namespace since checkpoint.

Checkpoint Node

Same structure as Namenode
creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it
keeps track of the latest checkpoint

Backup node

Also provides checkpoint functionality as the Checkpoint node
Additionally, maintains up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

How check file system in HDFS ?

Use fsck command which checks health of the file, block names and block locations.

# hdfs fsck /dir/hadoop-test -files -blocks -locations

Input split vs. Block

Block is physical division of data which doesn't consider logical boundaries of records
Input split is the logical division which considers logical boundaries of records as well.

How to use rank() function ?

Scenario. Rank the companies by country based on their turnover in Hive ?
companyName, country, turnover

Solution
Hive>
SELECT companyName, country, turnover,
rank() over (PARTITION BY country ORDER BY turnover DESC) as rank
FROM sales;

SORT BY vs. ORDER BY

ORDER BY performs a total ordering of the query result set

All the data is passed through a single reducer
takes long time to execute for larger data

SORT BY sorts the data within each Reducer - Local ordering

Each each reducer’s output will be sorted
doesn't achieve a total ordering on the dataset

What is a Record Reader ?

gets the data from input split
generate key/value pairs

which are sent one by one to mappers

How to use basic Elasticsearch operations ?

http://<IP>:<port>/<index-name>/_mapping
http://10.170.208.53:9200/index-2016.02.26/_mapping
Gives the description of fields in the message event, their data types and how they are configured.

http://<IP>:<port>/_template/<template_name>
http://10.170.208.53:9200/_template/temp
Gives the description of template specified.

http://<IP>:<port>/_cat/indices
http://10.170.208.53:9200/_cat/indices
Gives the list of indices currently present in elasticsearch.

http://<IP>:<port>/<index-name>/_search
http://10.170.208.53:9200/index-2016.02.29/_search
Gives the data present under specified index.

How to start, check and stop Elasticsearch ?

Start
Switch to the user which installed the ES.
Go to installation location of elasticsearch and start ES.
./bin/elasticsearch –d

Check
Check if ES is running or not
curl 'localhost:9200'
This should output the general information about your elasticsearch installation.

Stop
ps -aef | grep elasticsearch
kill -9 <processId>

How to configure Elasticsearch ?

Modify the configuration file - elasticsearch.yml

Go to installation location of elasticsearch
cd /home/ELK/elasticsearch-2.3.2/config/
vim elasticsearch.yml

Modify these highlighted values in open file :
cluster.name: my-application
node.name: node-1
bootstrap.mlockall: true
network.host: 10.170.208.53

Note : Localhost might not work if it is not available in /etc/hosts

In that case you can assign complete IP address of VM.

So, ensure to have localhost entry :

#127.0.0.1 localhost localhost.localdomain localhost4

10.170.208.53 localhost

10.170.208.53 test localhost

How to configure Elasticsearch to automatically start during boot up ?

Applicable if installed using yum

Run this command
ps -p 1

If output is something like this :
1 ? 00:00:00 init

Then run following command :
chkconfig --add elasticsearch

Otherwise run following command :
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service

What are the element defined in pom.xml ?

pom.xml elements

project : Root element

modelVersion : Set to 4.0.0
groupId : ID for project group
artifactId : Project / Artifact Id
version : Project version under given group

packaging : Packaging type - jar / war / ...
name : Name of maven project
url : Project URL

dependencies : Dependencies of the project

dependency : A dependency under dependencies element

groupId, artifactId, version
scope : Scope of Maven project :
compiled / provided / runtime / system / test

Example
<project ...>
<modelVersion>4.0.0</modelVersion>
<groupId>com.shaan.project</groupId>
<artifactId>myapp</artifactId>

<name>My sample App</name>

<url>http://maven.apache.org</url>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

</dependency>

...

</dependencies>

...

</project>

Monday, 16 May 2016

What is pom.xml and how it is different from project.xml ?

pom.xml

POM = Project Object Model
contains project info and configuration - dependencies, source dir, test dir, build dir
Maven reads pom.xml and execute the target

pom.xml vs. project.xml

Before Maven 2, project.xml is used
pom.xml is used in Maven 2 and later versions

What is the fully qualified artifact name of Maven project ?

How to check installed Maven version ?

Check maven version
mvn -version