Experimental Methods in Computer Science – Exercise 5

Experimental Methods in Computer Science

Exercise 5 – Correlations in Data

Goals

Consider alternative ways to quantify correlations in data
Work with a largish dataset
Gain more experience in looking at data

Background

Measured data or workload data often come in records comprising several fields, representing different attributes of the measured event or workload item. In many cases, there is some form of correlation between different fields. Our goal is to learn how to uncover and maybe quantify such correlations.

The first step, as usual, is to look at the data. In this case this means to draw a scatter plot. Given two fields, we use the X axis to denote the value of one field, and the Y axis to denote the value of the other field. Each record is then represented by a dot whose coordinates are the values of the two fields. We can then see whether any pattern emerges.

Another obvious thing to do is to calculate the covariance of the two fields. This measures the degree to which they deviate from the mean in a coordinated manner. Denoting the two values of record i by X_i and Y_i, the covariance is

Cov = 1/n sum_i=1ⁿ (X_i - Xbar)(Y_i - Ybar)

Where Xbar is the average of the X_is and Ybar is the average of the Y_is. The correlation coefficient is the covariance normalized to the range [-1, 1] be dividing by the standard deviations of the X_is and Y_is.

In this exercise, we will investigate the possible correlation between the sizes and runtimes of parallel jobs. We will use a workload log from the SDSC Paragon machine from 1995. This log comes from the parallel workloads archive, and its format is explained there. All you really need to know is

This is a simple ASCII file with one line per job
Lines starting with ; are comment lines and can be ignored
Fields are separated by white space
Job runtimes are given in seconds in field 4 (counting from 1)
Job sizes (number of processors used) are given in field 5 (counting from 1)

To save disk space, you can use the data directly from the course directory using

gzip -dc ~exp/www/SDSC-Par-1995-2.1-cln.swf.gz | ...

The compressed file size is under 800KB; the full size is nearly 5MB.

Assignment

Analyze the correlation between the job sizes and runtimes in this log.

What you need to do is three things:

Draw a scatter plot of the data. Note that there are quite a lot of data points, so this might come out very crowded, and also lead to a very large image file. One possible way to avoid problems is to use random sampling and use only a small part of the data instead of all of it.
Calculate the correlation coeficient of the job sizes and runtimes.
Try to come up with some additional metric or method to see if there is any correlation between job sizes and runtimes. Hint: I think there is some correlation there. You are not expected to look for it in the same way that I would. What I want is a description of some idea you tried, and what came out of it. It's OK if the idea didn't work out so well as long as it makes sense in principle.

Submit

Submit a single pdf file that contains all the following information:

Your names, logins, and IDs.
The requested graph with your scatter plot. Don't forget to label the axes, etc.
The result of the calculation of the correlation coefficient (NOT the covariance).
A short explanation of what the results show, i.e. do you think there is any correlation there.
A short description of what else you did to try and find some correlation. Explain the rationale of why you thought this might be a good idea, and what came out of it. Specifically, did you find that there is some correlation there? Remember that the most important thing is to be truthful regarding your data, so don't be tempted to say you found something if your data doesn't show it.

Submission deadline is Tuesday morning, 22/3/11, because of Purim, But if you can submit earlier please do.

Please do the exercise in pairs.

To the course home page