Topics in Performance Evaluation – Exercise 7

Topics in Performance Evaluation

Exercise 7 – Workload on a Web Server

In this exercise we will characterize the workload on a busy web server.

Background

Web servers maintain activity logs with information about all the requests they serve. This can be used to learn about their workload. Luckily, there is a standard format for these logs. Each request is represented by one line of text, with the following fields:

IP address of requesting client
Client identity (unreliable, typically replaced by "-")
User ID (also typically "-")
Timestamp with format [day/month/year:hour:minute:second zone]
The quoted request string, including the requested action, the path, and the protocol
The HTTP return code (200 is OK, 404 is page not found, etc.)
The size of the returned data, in bytes

One problem is of course getting data. Many years ago, some log files were made public for research, and we'll use one of them: part of a log from the Soccer World Cup web site from 1998 in France (France beat Brazil 3-0 in the finals). It is available from the Internet Traffic Archive, but it's big: there were about 1.35 billion requests over some 3 months. So we'll use only 1000 seconds worth of data, which is available here. It is about 2 MB compressed, and about 10 MB when open. To avoid wasting disk space, you can analyze it directly from where it resides in the course directory by using

gzip -dc ~perf/www/ex7-day66.gz | ...

on any Linux station at the university.

Assignment

Parse and check the data.
The log is an ASCII file with one line per request, but using a reduced and sanitized format with 5 fields rather than the one described above:
1. Timestamp (Unix time, i.e. seconds since 1/1/70).
2. Object ID (identifies the file that was requested).
3. Object size (bytes).
4. HTTP return code (see here for explanations).
5. User ID (identifies where the request came from, but coded to preserve privacy).
Look for problems with the data. Is there anything suspicious about it? Obviously this is too much data to look at manually. So you need to write some scripts that tabulate the distributions of various fields, the correlations between them, or whatever you can think of that might tell you something about the data.
Hint: I know of at least one serious problem, So you have what to look for. On the other hand, don't just make a long list of workload features that you don't like.
Tabulate a select characteristic of the workload.
Many things can be learnt from the log, like request rate per time of day, distribution of unique file sizes, distribution of request sizes, HTTP return codes, etc. We'll focus on the distribution of popularity of different files.
For this you first need to count how many times each file is requested. But should all the data in the log be used as is? Consider whether you should be selective in using the data.
Given data you are happy with, display the popularity data using the Zipf count-rank plot: sort the files from the most popular to the least popular, and plot the count of how many times the file was accessed as a function of its rank in this ordered list. Use log-log axes.

Submit

Use Moodle to submit a report on your work, in pdf format, with the following data.

Your names, logins, and IDs
Any basic data quality problems you found with the original log. What exactly did you find? Why is this a problem? What if anything can be done about it?
Any additional selection you decided to perform as part of the analysis. What data did you decide not to use? Why? How much data was removed?
The count-rank graph you produced, and what you learned about the file popularity. What is the shape of the graph? What can you learn form it about the distribution?

Submission deadline is Monday, 12 May 2014, so I can give feedback in class on Tuesday.

To the course home page