Experimental Methods in Computer Science – Exercise 7

Experimental Methods in Computer Science

Exercise 7 – Data Cleaning

Goals

Realize that data is not perfect
Gain some experience in looking at suspect data

This exercise can be done largely based on intuition.

Background

Data comes from various sources, e.g. system logs. Regrettably some of it can be of dubious merit, due to various logging errors.

We will specifically look at the access log of the France World Cup 1998 web site. This site was up for some 3 months, and received about 1.3 billion hits. This is a lot of data (about 8 GB compressed), so we will focus on a very short excerpt of 1000 seconds from one morning, with only 315,694 hits.

The data is available on-line. It is about 2 MB compressed, and about 10 MB when open. To avoid wasting disk space, you can analyze it directly from where it resides in the course directory by using

gzip -dc ~exp/www/ex7-day66.gz | ...

The format of the data is not the usual format used by HTTP servers, but a reduced format with 5 fields per line:

Timestamp (Unix time, i.e. seconds since 1/1/70).
Object ID (identifies the file that was requested).
Object size (bytes).
HTTP return code (see here for explanations).
User ID (identifies where the request came from, but coded to preserve privacy).

Assignment

Look at the data, and try to find what is suspicious about it. Obviously this is too much data to look at manually. So you need to write some scripts that tabulate the distributions of various fields, the correlations between them, or whatever you can think of that might tell you something about the data.

Hint: I know of one problem with this data that I consider potentially serious, and there are several other interesting things there. So you have what to look for. On the other hand, don't just make a long list of workload features that you don't like.

Submit

Submit a single pdf file that contains the following information:

Your names, logins, and IDs
A short explanation of how you approached this assignment, i.e. what you looked at and how you analyzed the data.
If you did not find anything suspicious, simply describe the data you saw in detail using graphs as appropriate.
If you found something, provide a short explanation of what you found. Specifically, for each suspicious thing you saw, answer the following questions:
1. What exactly was suspicious?
2. Do you have a speculation how this data got into the log?
3. What would you suggest to do about it when using the log? Consider the following contexts: modeling the arrivals of requests, modeling the file system on web servers, and modeling the level of activity of different users.
Include any graphs or data you think are relevant. Don't forget labels and scales on the axes, a legend, etc.

Submission deadline is Monday morning, 4/4/11, so I can give feedback in class on Tuesday.

Please do the exercise in pairs.

To the course home page