Experimental Methods in Computer Science
Exercise 5 – Correlations in Data
Goals
- Consider alternative ways to quantify correlations in data
- Work with a largish dataset
- Gain more experience in looking at data
Background
Measured data or workload data often come in records comprising several fields,
representing different attributes of the measured event or workload item.
In many cases, there is some form of correlation between different fields.
Our goal is to learn how to uncover and maybe quantify such correlations.
The first step, as usual, is to look at the data.
In this case this means to draw a scatter plot.
Given two fields, we use the X axis to denote the value of one field, and the Y
axis to denote the value of the other field.
Each record is then represented by a dot whose coordinates are the values of
the two fields.
We can then see whether any pattern emerges.
Another obvious thing to do is to calculate the covariance of the two fields.
This measures the degree to which they deviate from the mean in a coordinated
manner.
Denoting the two values of record i by
Xi and Yi, the covariance is
Cov = 1/n sumi=1n (Xi - Xbar)(Yi - Ybar)
Where Xbar is the average of the Xis and
Ybar is the average of the Yis.
The correlation coefficient is the covariance normalized to the range [-1, 1]
be dividing by the standard deviations of the Xis
and Yis.
In this exercise, we will investigate the possible correlation between the
sizes and runtimes of parallel jobs.
We will use a
workload
log from the SDSC Paragon machine from 1995.
This log comes from the
parallel workloads archive,
and its
format is explained there.
All you really need to know is
- This is a simple ASCII file with one line per job
- Lines starting with ; are comment lines and can be ignored
- Fields are separated by white space
- Job runtimes are given in seconds in field 4 (counting from 1)
- Job sizes (number of processors used) are given in field 5 (counting from 1)
To save disk space, you can use the data directly from the course directory
using
gzip -dc ~exp/www/SDSC-Par-1995-2.1-cln.swf.gz | ...
The compressed file size is under 800KB; the full size is nearly 5MB.
Assignment
Analyze the correlation between the job sizes and runtimes in this log.
What you need to do is three things:
- Draw a scatter plot of the data.
Note that there are quite a lot of data points, so this might come out very
crowded, and also lead to a very large image file.
One possible way to avoid problems is to use random sampling and use only a
small part of the data instead of all of it.
- Calculate the correlation coeficient of the job sizes and runtimes.
- Try to come up with some additional metric or method to see if there is
any correlation between job sizes and runtimes.
Hint: I think there is some correlation there.
You are not expected to look for it in the same way that I would.
What I want is a description of some idea you tried, and what came out of it.
It's OK if the idea didn't work out so well as long as it makes sense in
principle.
Submit
Submit
a single pdf file that contains all the following information:
- Your names, logins, and IDs.
- The requested graph with your scatter plot.
Don't forget to label the axes, etc.
- The result of the calculation of the correlation coefficient (NOT the covariance).
- A short explanation of what the results show, i.e. do you think
there is any correlation there.
- A short description of what else you did to try and find some
correlation.
Explain the rationale of why you thought this might be a good idea, and
what came out of it.
Specifically, did you find that there is some correlation there?
Remember that the most important thing is to be truthful regarding your
data, so don't be tempted to say you found something if your data doesn't
show it.
Submission deadline is
Tuesday morning, 22/3/11, because of Purim,
But if you can submit earlier please do.
Please do the exercise in pairs.
To the course home page