Hadoop Kelvin

From Lawa
Jump to: navigation, search

What is Hadoop Kelvin ?

Hadoop Kelvin is a network monitoring system designed for the Hadoop Map-Reduce framework. It monitors data (not control) traffic between Hadoop nodes and provides the basis for multiple ways to store and access the stored monitoring data (the current implementation provides for log-based storage). It is designed to be easily extensible, flexible and to operate with a minimal effect on the running time of Hadoop jobs.

Purpose

The purpose of Hadoop Kelvin is to enable researches and cluster administrators to extract logs about the network usage of various jobs executed on a Hadoop cluster. It differs from external measurement tools because it reports context: It can differentiate between the different types of data traffic that occur during the flow of a Hadoop job. It also differs from the Hadoop Counter mechanism because it is both real-time (up to network latency and report frequency settings) and fine-grained: It, potentially, reports every single data transmission in the cluster, enabling the analysis of metrics such as the bandwidth available to the nodes in different stages of the computation.

The Method

Hadoop Kelvin implements a server-client architecture, with the clients collecting networking traffic reports from the various tasks and then reporting their data over HTTP to the statistic server. For a small cluster, a single statistic server can be used, and it can even be deployed on the same machine that hosts the NameNode. For large clusters the statistic server, depending on the frequency of the traffic and the reporting settings, may becomes overloaded. In such a case, the cluster should be partitioned and a statistic server deployed for each group of machines. In the simplest deployment scenario, where no central collection of the reports is required, each machine can support its own statistic server and all nodes configured to talk to 'localhost' when submitting their reports. The load of the (mostly idle) server process is minimal, and the effect on cluster performance is minor in such a configuration. All of these configurations are easily configured via an XML file added to the usual Hadoop configuration folder.

What is Collected?

Hadoop Kelvin collects data about the following data transfers:

• HDFS reads (regardless of who is performing the read).

• HDFS writes (regardless of who is the origin of the data).

• Data transfers between Mappers and Reducers during a Map-Reduce job execution.


The data collected about each transfer includes:

• Source machine.

• Destination machine.

• Starting timestamp.

• Duration of transfer in milliseconds.

• Size of the transferred data, in bytes.

• The type of transfer: HDFS Read, HDFS Write or Reducer Input.

Some Implementation Details

In order to extract this information, "hook-points" have been inserted into the Hadoop code at the locations where data flows occur. These hookpoints submit NetworkStatisticPacket packets to the HTTPStatisticClient, which periodically sends them over HTTP to a HTTPStatisticServer (We chose the Jetty HTTP Server Hadoop already uses as the basis for its implementation). The HTTPStatisticServer then inserts the packets to the data stores configured by the cluster administrator (currently, a Log4J logger is implemented, but it is possible to implement additional classes in order to store the data into databases, for example).

The list of locations where hook-points have been added can be found here:

A somewhat more detailed overview of the inner-workings of Kelcin can be found here: