Tuesday, 31 May 2016

How to run Logstash using a configuration file ?


Go to installation location of logtash.
> cd /home/osadmin/ELK/logstash-2.3.2
> ./bin/logstash -f  myapp-logstash.conf

If configuration is well, logstash should startup.

Note : To verify, modify the file ( Add some log lines in log file specified “"/var/log/test.log” )
Log messages in JSON format should appear on console.

How to install Logstash ?


Install Logstash

USING YUM
a. Download the RPM file
> rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
      b. Add the following in file
logstash.repo (with a .repo suffix) under directory /etc/yum.repos.d/ 

[logstash-2.2]
name=Logstash repository for 2.2.x packages
baseurl=http://packages.elastic.co/logstash/2.2/centos
gpgcheck=1
gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch
enabled=1

c. Install
> yum install logstash

OR


USING ZIP
a.  a. Download ZIP and extract.
[root@CSRToolTest1 logstash-2.3.2]# pwd
/home/osadmin/ELK/logstash-2.3.2

[root@CSRToolTest1 logstash-2.2.0]# ll
total 160
drwxrwxrwx 2 osadmin osadmin   4096 Feb 12 07:36 bin
-rwxrwxrwx 1 osadmin osadmin 100382 Feb 12 07:36 CHANGELOG.md
-rwxrwxrwx 1 osadmin osadmin   2249 Feb 12 07:36 CONTRIBUTORS
-rwxrwxrwx 1 osadmin osadmin   3844 Feb 12 07:36 Gemfile
-rwxrwxrwx 1 osadmin osadmin  22304 Feb 12 07:36 Gemfile.jruby-1.9.lock
drwxrwxrwx 4 osadmin osadmin   4096 Feb 12 07:36 lib
-rwxrwxrwx 1 osadmin osadmin    589 Feb 12 07:36 LICENSE
-rwxrwxrwx 1 osadmin osadmin    149 Feb 12 07:36 NOTICE.TXT
-rwxrwxrwx 1 osadmin osadmin     88 Feb 12 10:29 test-logstash-filter.conf
drwxrwxrwx 4 osadmin osadmin   4096 Feb 12 07:37 vendor

Monday, 30 May 2016

What will happen if ES is installed with other non-root user and try to start with root user ?


Problem
Following error occurred :
[root@CSRToolTest1 bin]# Exception in thread "main" java.lang.RuntimeException: don't run elasticsearch as root.

Possible cause 
You must installed ES with other user.

Solution


Start ES with non-root user.

What will happen if ES is installed with root user and try to start with other non-root user ?



Problem
Following error occurred :
java.io.FileNotFoundException: /home/shaan/ELK/elasticsearch-2.2.0/logs/my-application-shaandev.log (Permission denied)

Possible cause 
You must have started ES with a non-root user (In this case ES is installed using the root user), which made <cluster.name>.log file with root user.

Now, when you try to start ES with proper user (non-root) , it’s not able to access this log file created by root.

Solution

Delete the file <cluster.name>.log (which is my-application-shaandev.log) or change the permission of it, then start it again.

How to create GROK patterns ?


Use online Grok Debugger : http://grokdebug.herokuapp.com/

Provide input and pattern, it will parse and provide fields in the JSON string.
Always perform parsing in baby steps so that you will end up with correct pattern.


Grok pattern examples

Data
2016-03-11 09:25:21,165|INFO |com.shaan.Logging [log]|India|143011|Admin|TECHNICAL|cashout|Cash Out|0(success)|34|MSISDN: 8883039090

Pattern #1

%{GREEDYDATA:data}

Output
{
  "data": [
    [
      "2016-03-11 09:25:21,165|INFO |com.shaan.Logging [log]|India|143011|Admin|TECHNICAL|cashout|Cash Out|0(success)|34|MSISDN: 8883039090"
    ]
  ]
}


Pattern #2
%{TIMESTAMP_ISO8601:timestamp}\|%{GREEDYDATA:data}

Output
{
  "timestamp": [
    [
      "2016-03-11 09:25:21,165"
    ]
  ],
  "YEAR": [
    [
      "2016"
    ]
  ],
  "MONTHNUM": [
    [
      "03"
    ]
  ],

  ...

  "data": [
    [
"INFO |com.shaan.Logging [log]|India|143011|Admin|TECHNICAL|cashout|Cash Out|0(success)|34|MSISDN: 8883039090"

    ]
  ]
}


Pattern #3
%{TIMESTAMP_ISO8601:timestamp}\|%{DATA:logLevel}\|%{GREEDYDATA:data}

Output
{
  "timestamp": [
    [
      "2016-03-11 09:25:21,165"
    ]
  ],
  "YEAR": [
    [
      "2016"
    ]
  ],
  "MONTHNUM": [
    [
      "03"
    ]
  ],

  ...


  "logLevel": [
    [
      "INFO "
    ]
  ],
  "data": [
    [
"com.shaan.Logging [log]|India|143011|Admin|TECHNICAL|cashout|Cash Out|0(success)|34|MSISDN: 8883039090"
    ]
  ]
}



Pattern #4
%{TIMESTAMP_ISO8601:timestamp}\|%{DATA:logLevel}\|%{DATA:eventSource}\|%{DATA:country}\|%{DATA:userId}\|%{DATA:role}\|%{DATA:logType}\|%{DATA:operation}\|%{INT:resultCode}\(%{DATA:result}\)\|%{NUMBER:timeConsumed}\|%{GREEDYDATA:data}

Output
{
  "timestamp": [
    [
      "2016-03-11 09:25:21,165"
    ]
  ],
  "YEAR": [
    [
      "2016"
    ]
  ],
  "MONTHNUM": [
    [
      "03"
    ]
  ],

  ...

  "logLevel": [
    [
      "INFO "
    ]
  ],
  "eventSource": [
    [
      "com.shaan.Logging [log]"
    ]
  ],
  "country": [
    [
      "India"
    ]
  ],
  "userId": [
    [
      "143011"
    ]
  ],

  "role": [
    [
      "Admin"
    ]
  ],
  "logType": [
    [
      "TECHNICAL"
    ]
  ],
  "operation": [
    [
      "Cash Out"
    ]
  ],
  "resultCode": [
    [
      "0"
    ]
  ],
  "result": [
    [
      "success"
    ]
  ],
  "timeConsumed": [
    [
      "34"
    ]
  ],
   "data": [
    [
      "MSISDN: 8883039090"
    ]
  ]
}

Why Logstash is not pushing data to ES ?


Problem
Data not pushed by logstash (When we launch logstash using config file)

Possible issues
May be, Data already pushed to ES

Soluton

  • Remove the logstash index created
  • Launch logstash
    • If problem still there
      • Just add one line at the end of log file so that Logstash can pick the changes
      • Launch logstash again


How to delete an index ?


$ curl -XDELETE http://<ip>:<port>/<index>

Example
$ curl -XDELETE http://10.170.208.53:9200/logstash-2016.05.03

How to find current indices created in ES ?


$ curl http://<ip>:<port>/_cat/indices

Example
curl http://10.170.208.53:9200/_cat/indices

open logstash-2016.05.03 1 1 24 6 75.2mb 75.2mb

open .kibana 1 1 24 6 75.2kb 75.2kb

In above example, we have 2 indices.

Monday, 23 May 2016

How Sqoop can be used in a Java program ?



  • The Sqoop jar in classpath should be included in the java code.
  • After this, the method Sqoop.runTool() method must be invoked.
  • The necessary parameters should be created to Sqoop programmatically just like for command line.


is it mandatory to set input and output formats in MapReduce job ?


No, it is not mandatory to set the input and output formats in MapReduce.
By default : Input and output format are 'text'

Job.submit() vs. Job.waitForCompletion()


  • Job Submit internally creates submitter instance and submit the job
  • waitforcompletion polls progress at regular interval of one second.If job gets executed successfully, it displays successful message on console else display a relevant error message.


What is the use of Partitioner ?


It makes sure that all the value of a single key goes to the same reducer.

It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.

How JobTracker works ?


When client application submits jobs to the JobTracker :

  • JobTracker talks to the NameNode to find the location of the data
  • It locates TaskTracker nodes with available slots for data and assigns the work to the chosen TaskTracker nodes
  • The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then.
  • JobTracker may resubmit the task on another node or it may mark that task to avoid.

Friday, 20 May 2016

How many instances of a JobTracker run on Hadoop cluster ?


Only one JobTracker process runs on any Hadoop cluster.
JobTracker runs it within its own JVM process.

How to write a custom partitioner for a Hadoop job ?


1. Create a new class that extends Partitioner Class
2. Override method getPartition()
3. Add the custom partitioner to the job programmatically using method setPartitionerClass
    or
    Add the custom partitioner to the job as a config file

Wednesday, 18 May 2016

What is the use of Combiners ?


Combiners 

  • used to increase efficiency of MapReduce Program
  • reduces the amount of data which to be transferred to Reducer
  • Reducers can be used as-it-is as a Combiner if operation performed is commutative and associative.

What is the use of chain-mapper and chain-reducer ?


Chain mapper : enables multiple mapper to execute before the reducer
Chain reducer : enables multiple mapper to execute after the reducer

What are the main configuration parameters needed to run a Mapreduce Job ?


  • Input path
  • Output path
  • Input format (Optional)
  • Output format (Optional)
  • Mapper class : Class containing the map function
  • Reduce class : Class containing the reduce function
  • JAR : containing the Mapper, Reducer and Driver classes


What are the methods of Mapper class ?


1. protected void setup
2. protected void cleanup
3. protected void map
4. void run

What protocol is used in shuffle & sort in MapReduce ?


HTTP

What is utility of heartbeat ?


The heartBeat from the taskTracker to the jobTracker give information about status of the task.

The TaskTracker sends a heart beat to the JobTracker at regular intervals, and also indicates that it can take new tasks for execution.
Then JobTracker consults the Scheduler to assign tasks to the TaskTracker and sends back the list of tasks as the heartbeat response to TaskTracker.

What will happen if JobTracker doesn't recieve any heartbeat ?
It assume that TaskTracker is down and resubmit the corresponding task to other node in the cluster.

What will happen if JobTracker doesn't recieve any heartbeat from TaskTracker ?


It assume that TaskTracker is down and resubmit the corresponding task to other node in the cluster.

What are the methods of Reducer class ?


1. protected void setup
2. protected void cleanup
3. protected void reduce
4. void run

What is "Partitioning, Shuffle and sort" phase after finishing Map phase ?



  • Partitioning
    • determines which reducer instance will receive which intermediate keys and values.
    • It is necessary that for any key (regardless of which mapper instance generated it) the destination partition is the same.
  • Shuffle
    • Process of moving map outputs to the reducers.
    • After the first map tasks have completed (the nodes may still be performing several more map tasks each) but they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers.
  • Sort
    • The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.


What is the usage of Context object ?


Context object

  • has configuration details for the job
  • allows mappers to interact with other Hadoop systems.
  • used for :
    • updating counters
    • to report the progress
    • to provide any application level status updates


What are the most commonly defined input formats in Hadoop ?


1. Text Input Format : Default input format ; Key=Line offset , Value=Line

2. Key Value Input Format : for plain text files where the lines are broken into key and value.

3. Sequence File Input Format : used for reading sequence files

Tuesday, 17 May 2016

Namenode vs. Backup node vs. Checkpoint Namenode



  • NameNode
    • manages the metadata i.e. info about all the files present in the HDFS on a hadoop cluster
    • uses 2 files for the namespace :
      • FS image : keeps track of the latest checkpoint of the namespace
      • edit logs : A log of changes that have been made to the namespace since checkpoint.
  • Checkpoint Node
    • Same structure as Namenode
    • creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it
    • keeps track of the latest checkpoint
  • Backup node
    • Also provides checkpoint functionality as the Checkpoint node
    • Additionally, maintains up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

How check file system in HDFS ?


Use fsck command which checks health of the file, block names and block locations.

# hdfs fsck /dir/hadoop-test -files -blocks -locations

Input split vs. Block


  • Block is physical division of data which doesn't consider logical boundaries of records
  • Input split is the logical division which considers logical boundaries of records as well.

How to use rank() function ?


Scenario. Rank the companies by country based on their turnover in Hive ?
companyName, country, turnover

Solution
Hive>
SELECT companyName, country, turnover,
       rank() over (PARTITION BY country ORDER BY turnover DESC) as rank
FROM sales;

SORT BY vs. ORDER BY


  • ORDER BY performs a total ordering of the query result set
    • All the data is passed through a single reducer
    • takes long time to execute for larger data

  • SORT BY sorts the data within each Reducer - Local ordering
    • Each each reducer’s output will be sorted
    • doesn't achieve a total ordering on the dataset


What is a Record Reader ?


  • gets the data from input split
  • generate key/value pairs
    • which are sent one by one to mappers

How to use basic Elasticsearch operations ?


http://<IP>:<port>/<index-name>/_mapping
http://10.170.208.53:9200/index-2016.02.26/_mapping
Gives the description of fields in the message event, their data types and how they are configured.

http://<IP>:<port>/_template/<template_name>
http://10.170.208.53:9200/_template/temp
Gives the description of template specified.

http://<IP>:<port>/_cat/indices
http://10.170.208.53:9200/_cat/indices
Gives the list of indices currently present in elasticsearch.

http://<IP>:<port>/<index-name>/_search
http://10.170.208.53:9200/index-2016.02.29/_search
Gives the data present under specified index.

How to start, check and stop Elasticsearch ?


Start
Switch to the user which installed the ES.
Go to installation location of elasticsearch and start ES.
./bin/elasticsearch –d

Check
Check if ES is running or not
curl 'localhost:9200'
This should output the general information about your elasticsearch installation.

Stop
ps -aef | grep elasticsearch
kill -9 <processId>

How to configure Elasticsearch ?


Modify the configuration file - elasticsearch.yml

Go to installation location of elasticsearch
cd /home/ELK/elasticsearch-2.3.2/config/
vim elasticsearch.yml

Modify these highlighted values in open file : 
cluster.name: my-application
node.name: node-1
bootstrap.mlockall: true
network.host: 10.170.208.53


Note : Localhost might not work if it is not available in /etc/hosts
In that case you can assign complete IP address of VM.

So, ensure to have localhost entry :
#127.0.0.1   localhost localhost.localdomain localhost4
10.170.208.53 localhost
10.170.208.53 test localhost

How to configure Elasticsearch to automatically start during boot up ?


Applicable if installed using yum

Run this command
ps -p 1

If output is something like this : 
1 ?        00:00:00 init

Then run following command : 
chkconfig --add elasticsearch

Otherwise run following command : 
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service

What are the element defined in pom.xml ?


pom.xml elements

  • project : Root element
    • modelVersion : Set to 4.0.0
    • groupId : ID for project group
    • artifactId : Project / Artifact Id
    • version : Project version under given group
    • packaging : Packaging type - jar / war / ...
    • name : Name of maven project
    • url : Project URL
    • dependencies : Dependencies of the project
      • dependency : A dependency under dependencies element
        • groupId, artifactId, version
        • scope : Scope of Maven project :
                       compiled / provided / runtime / system / test


Example
<project ...>
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.shaan.project</groupId>
  <artifactId>myapp</artifactId>
  <version>1.0</version>

  <packaging>jar</packaging>
  <name>My sample App</name>
  <url>http://maven.apache.org</url>

  <dependencies>
     <dependency>
          <groupId>junit</groupId>  
          <artifactId>junit</artifactId>  
          <version>4.2.2</version>  
          <scope>test</scope>
     </dependency>
     ...
  </dependencies>  
  ...
</project>

Monday, 16 May 2016

What is pom.xml and how it is different from project.xml ?


pom.xml

  • POM = Project Object Model
  • contains project info and configuration - dependencies, source dir, test dir, build dir
  • Maven reads pom.xml and execute the target


pom.xml vs. project.xml

  • Before Maven 2, project.xml is used
  • pom.xml is used in Maven 2 and later versions


What is the fully qualified artifact name of Maven project ?


<groupId>:<artifactId>:<version>

How to check installed Maven version ?


Check maven version
mvn -version