Experimental Methods in Computer Science – Exercise 8

Experimental Methods in Computer Science

Exercise 8 – Burstiness vs. Poisson Modeling

Goals

Understand the scaling properties of Poisson arrivals as opposed to self-similar arrivals.
See this for a real dataset.

Background

The pattern in which work arrives at a system is very important. Is it more or less uniform over time? Are there times with much higher loads than others? In this exercise we will study this issue.

It has traditionally been common to model work arrival as a Poisson process. This simply means that the distribution of interarrival times (the time from one arrival to the next arrival) is exponential, or alternatively, that arrivals occur uniformly over time. For example, the model of packets flowing through a network was to generate exponential random variates, and use these as the delays between one packet and the next.

A major experimental finding, however, is that in many cases the Poisson model does not reflect reality. In particular, when a Poisson process is aggregated, it tends to average out. Thus if we look at the number of arrivals in successive 10 second intervals, we will see less variability than in the arrivals in 1-second intervals. If we aggregate the data in 100-second intervals each time, we will get even less variability. But aggregating real arrivals does not always display such a reduction in variability. Instead, burstiness is observed at many different time scales.

In this exercise, we will try to see whether this holds for different computer workloads, and at what time scales.

Assignment

Collect data.
You need two data sets: real arrival data, and arrivals conforming to a Poisson model. For the real data, use the Unix lastcomm command. This gives a list of all processes executed on the system, how long they ran, and when they terminated. As the resolution provided is minutes, you need data from at least several days in order to be able to aggregate it.
Note that lastcomm is often configured to only return data from the last day, and data from each day is maintained in a separate file. To get a longer series, you therefore need to invoke lastcomm on different log files. You can see the available files under /var/log/account/ on Linux systems (specifically, on inferno these files have been made accessible to all -- you can get to them by doing "rsh pita"). As the files are compressed, you need to perform something like
```
  gzip -dc /var/log/account/pacct.X.gz | lastcomm -f- 
```
where X is the file number (note that file numbers are in reverse: 1 is yesterday, 2 is the day before that, etc.). Don't forget to first look at the data and create some opinion on whether all of it should be used at all.
Given a large set of arrivals, count how many arrivals there were and compute the average interarrival time for this data. Now create a synthetic set of arrivals with the same general characteristics, i.e. the same total number of arrivals, and the same average interarrival time. However, this time the interarrival times will be exponentially distributed. Exponentially distributed random variates with a mean m can be created using the formula t = -m log(u), where u is a uniform random number in the range [0,1], which you can get from the library function random().
Aggregate.
For each data set, create aggregated time series at different granularities: Arrivals per minute, arrivals per 10 minutes, arrivals per 100 minutes, and if you have lots of data, maybe also arrivals per 1000 minutes. Plot these time series for the two data sets (total of 6-8 plots, on one page). Plotting a time series means showing successive values; for example, when using minute granularity, you plot the arrivals in the first minute, the second minute, the third minute, and so on.
When performing aggregation, you are actually "zooming out" from the data. Thus the number of points in each plot should be the same. The coarse plots, say at 100 minute resolution, will show all the data. The medium plots, at 10 minute resolution, will only show a selected section that is 1/10 of the whole. The high-resolution data will only show 1/100.
Don't forget to label the plots and the axes.

As an alternative to lastcomm data, you may use any other data set based on system logs. many various logs are available at /var/log/. You may use any one that is long enough, even if you do not quite know what exactly is being logged — it is enough that each record has a timestamp. However, note the resolution of the timestamps: they may be in seconds rather than in minutes.

Submit

Submit a single pdf file that contains all the following information:

Your names, logins, and IDs.
Your data: from which machine did it come, what period of time did it span, how many records were there, and whether you used all of it (and if not, why not).
Your conclusion: what is a good model for job arrivals to the local machine? are they (1) bursty at different time scales, (2) do they conform to a Poisson process, or (3) is there some other effect at play, and what is it? Or maybe there is some combination of these 3 options?
The output plot showing both datasets at the different scales.
The programs/scripts you used to analyze the lastcomm data and create the synthetic data.

Submission deadline is Monday morning, 1/6/09, so I can give feedback in class on Tuesday.

Please do the exercise in pairs.

To the course home page