Kelvin Architecture
High-Level Overview
There are two main parts in Hadoop Kelvin: These are the Statistics Server and the Statistics Client.
The Statistics Server is a program which runs on a single machine in the cluster (typically one of the master machines in the cluster if a single Statistics Server is present. Alternatively a subset of slave machines can be used if several such servers are required, or each machine can run its own server for the tasks that run on it) and serves as a sink for all the traffic reports arriving from the cluster nodes. The server operates a set of user-configurable (via XML) data storers (which are write-only), data retrievers (which are read-only) and data manipulators (which provide read-and-write access) to which measurement data is stored and from which queries about past measurement data are completed. Currently, Hadoop Kelvin provides a Log-based information store which stores all traffic reports in plaintext form via a Log4J logger. The protocol all Hadoop Kelvin traffic uses is HTTP.
== Data Storers, Data Retrievers and Data Manipulators – Why Hadoop Kelvin is More than a Logger ==
As briefly described above, the system incorporates the notions of a Data Storer, a Data Retriever and a Data Manipulator, we refer to them all as Data Handlers. The first two define a Java Interface which can be implemented by anyone seeking to expand upon the functionality of Kelvin, while the latter is simply an entity implementing both these interfaces at once. The addition of extra such elements does not require the recompilation of Hadoop (they just need to be located in a JAR file which is located on the classpath and need to be enabled in the XML configuration files), but it does require a re-start of the statistic server(s). The default Kelvin implementation supplies one Data Storer (LogStatisticStore) and one Data Manipulator (H2DBManipulator) which is also a Storer. The LogStatisticStore logs all traffic reports to a log4j log file. This is the simplest form of a Data Storer, and should be mainly used for debugging or research purposes. The log files have a tendency to grow very large rather quickly, so it is not suited for long-term, constant deployment in a production environment. The H2DBManipulator stores the traffic reports into a H2 (SQL) database. This database provides the basic building block for the future Hadoop scheduler as it allows other code to access the traffic reports collected over a period of time. The existence of the SQL database, and the extensibility of Kelvin set it apart from a simple logger interface (it in fact contains just one such logger as a default).