Experimental Methods in Computer Science – Exercise 5

Experimental Methods in Computer Science

Exercise 5 – Correlations in Data

Goals

Background

Measured data or workload data often come in records comprising several fields, representing different attributes of the measured event or workload item. In many cases, there is some form of correlation between different fields. Our goal is to learn how to uncover and maybe quantify such correlations.

The first step, as usual, is to look at the data. In this case this means to draw a scatter plot. Given two fields, we use the X axis to denote the value of one field, and the Y axis to denote the value of the other field. Each record is then represented by a dot whose coordinates are the values of the two fields. We can then see whether any pattern emerges.

Another obvious thing to do is to calculate the covariance of the two fields. This measures the degree to which they deviate from the mean in a coordinated manner. Denoting the two values of record i by Xi and Yi, the covariance is

Cov = 1/n sumi=1n (Xi - Xbar)(Yi - Ybar)
Where Xbar is the average of the Xis and Ybar is the average of the Yis. The correlation coefficient is the covariance normalized to the range [-1, 1] be dividing by the standard deviations of the Xis and Yis.

In this exercise, we will investigate the possible correlation between the sizes and runtimes of parallel jobs. We will use a workload log from the SDSC Paragon machine from 1995. This log comes from the parallel workloads archive, and its format is explained there. All you really need to know is

To save disk space, you can use the data directly from the course directory using
gzip -dc ~exp/www/SDSC-Par-1995-2.1-cln.swf.gz | ...
The compressed file size is under 800KB; the full size is nearly 5MB.

Assignment

Analyze the correlation between the job sizes and runtimes in this log.

What you need to do is three things:

Submit

Submit a single pdf file that contains all the following information:

  1. Your names, logins, and IDs.
  2. The requested graph with your scatter plot. Don't forget to label the axes, etc.
  3. The result of the calculation of the correlation coefficient (NOT the covariance).
  4. A short explanation of what the results show, i.e. do you think there is any correlation there.
  5. A short description of what else you did to try and find some correlation. Explain the rationale of why you thought this might be a good idea, and what came out of it. Specifically, did you find that there is some correlation there? Remember that the most important thing is to be truthful regarding your data, so don't be tempted to say you found something if your data doesn't show it.
Submission deadline is Tuesday morning, 22/3/11, because of Purim, But if you can submit earlier please do.

Please do the exercise in pairs.

To the course home page