In this exercise we will characterize the workload on a busy web server.
Web servers maintain activity logs with information about all the requests they serve. This can be used to learn about their workload. Luckily, there is a standard format for these logs. Each request is represented by one line of text, with the following fields:
One problem is of course getting data. Many years ago, some log files were made public for research, and we'll use one of them: part of a log from the Soccer World Cup web site from 1998 in France (France beat Brazil 3-0 in the finals). It is available from the Internet Traffic Archive, but it's big: there were about 1.35 billion requests over some 3 months. So we'll use only 1000 seconds worth of data, which is available here. It is about 2 MB compressed, and about 10 MB when open. To avoid wasting disk space, you can analyze it directly from where it resides in the course directory by using
gzip -dc ~perf/www/ex7-day66.gz | ...on any Linux station at the university.
The log is an ASCII file with one line per request, but using a reduced and sanitized format with 5 fields rather than the one described above:
Hint: I know of at least one serious problem, So you have what to look for. On the other hand, don't just make a long list of workload features that you don't like.
Many things can be learnt from the log, like request rate per time of day, distribution of unique file sizes, distribution of request sizes, HTTP return codes, etc. We'll focus on the distribution of popularity of different files.
For this you first need to count how many times each file is requested. But should all the data in the log be used as is? Consider whether you should be selective in using the data.
Given data you are happy with, display the popularity data using the Zipf count-rank plot: sort the files from the most popular to the least popular, and plot the count of how many times the file was accessed as a function of its rank in this ordered list. Use log-log axes.
Use Moodle to submit a report on your work, in pdf format, with the following data.