This exercise can be done largely based on intuition.
Data comes from various sources, e.g. system logs. Regrettably some of it can be of dubious merit, due to various logging errors.
We will specifically look at the access log of the France World Cup 1998 web site. This site was up for some 3 months, and received about 1.3 billion hits. This is a lot of data (about 8 GB compressed), so we will focus on a very short excerpt of 1000 seconds from one morning, with only 315,694 hits.
The data is available on-line. It is about 2 MB compressed, and about 10 MB when open. To avoid wasting disk space, you can analyze it directly from where it resides in the course directory by using
gzip -dc ~exp/www/ex7-day66.gz | ...The format of the data is not the usual format used by HTTP servers, but a reduced format with 5 fields per line:
Look at the data, and try to find what is suspicious about it. Obviously this is too much data to look at manually. So you need to write some scripts that tabulate the distributions of various fields, the correlations between them, or whatever you can think of that might tell you something about the data.
Hint: I know of one problem with this data that I consider potentially serious, and there are several other interesting things there. So you have what to look for. On the other hand, don't just make a long list of workload features that you don't like.
Submit a single pdf file that contains the following information:
Please do the exercise in pairs.