Experimental Methods in Computer Science – Exercise 8

Experimental Methods in Computer Science

Exercise 8 – Burstiness vs. Poisson Modeling

Goals

Background

The pattern in which work arrives at a system is very important. Is it more or less uniform over time? Are there times with much higher loads than others? In this exercise we will study this issue.

It has traditionally been common to model work arrival as a Poisson process. This simply means that the distribution of interarrival times (the time from one arrival to the next arrival) is exponential, or alternatively, that arrivals occur uniformly over time. For example, the model of packets flowing through a network was to generate exponential random variates, and use these as the delays between one packet and the next.

A major experimental finding, however, is that in many cases the Poisson model does not reflect reality. In particular, when a Poisson process is aggregated, it tends to average out. Thus if we look at the number of arrivals in successive 10 second intervals, we will see a smaller relative variability than in the arrivals in 1-second intervals. If we aggregate the data in 100-second intervals each time, we will get even smaller relative variability. But aggregating real arrivals does not always display as much reduction in relative variability. As a result, burstiness is observed at many different time scales.

In this exercise, we will try to see whether this holds for different computer workloads, and at what time scales.

Assignment

  1. Collect data.

    You need two data sets: real arrival data, and arrivals conforming to a Poisson model. For the real data, use the Unix lastcomm command. This gives a list of all processes executed on the system, how long they ran, and when they terminated. As the resolution provided is minutes, you need data from at least several days in order to be able to aggregate it.

    Note that lastcomm is often configured to only return data from the last day, and data from each day is maintained in a separate file. To get a longer series, you therefore need to invoke lastcomm on different log files. You can see the available files under /var/log/account/ on Linux systems (specifically, on inferno these files have been made accessible to all -- you can get to them by doing "rsh inferno"). As the files are compressed, you need to perform something like

      gzip -dc /var/log/account/pacct.X.gz | lastcomm -f- 
    
    where X is the file number (note that file numbers are in reverse: 1 is yesterday, 2 is the day before that, etc.). Don't forget to first look at the data and create some opinion on whether all of it should be used at all. (in case of emergency, one day's lastcomm output is available at ~exp/lc-inferno.gz).

    Given a large set of arrivals, count how many arrivals there were and compute the average interarrival time for this data. Now create a synthetic set of arrivals with the same general characteristics, i.e. the same total number of arrivals, and the same average interarrival time. However, this time the interarrival times will be exponentially distributed. Exponentially distributed random variates with a mean m can be created using the formula t = -m log(u), where u is a uniform random number in the range [0,1], which you can get from the library function random().

  2. Aggregate.

    For each data set, create aggregated time series at different granularities: Arrivals per minute, arrivals per 10 minutes, arrivals per 100 minutes, and if you have lots of data, maybe also arrivals per 1000 minutes. Plot these time series for the two data sets (total of 6-8 plots, on one page). Plotting a time series means showing successive values; for example, when using minute granularity, you plot the arrivals in the first minute, the second minute, the third minute, and so on.

    When performing aggregation, you are actually "zooming out" from the data. Thus the number of points in each plot should be the same. The coarse plots, say at 100 minute resolution, will show all the data. The medium plots, at 10 minute resolution, will only show a selected section that is 1/10 of the whole. The high-resolution data will only show 1/100.

    Note that what we are interested in is the relative variability. Therefore you should also scale the Y axis to best show the variability. Thus, instead of using a scale of 0 to some maximal value, you should use a scale that reflects the observed variability — approximately from the lowest observed value to the largest observed value. Try to find how the range (largest - smallest) changes as a function of the aggregation.

    Don't forget to label the plots and the axes.

As an alternative to lastcomm data, you may use any other data set based on system logs. Many different logs are available at /var/log/. You may use any one that is long enough, even if you do not quite know what exactly is being logged — it is enough that each record has a timestamp. However, note the resolution of the timestamps: they may be in seconds rather than in minutes.

Submit

Submit a single pdf file that contains all the following information:

  1. Your names, logins, and IDs.
  2. Your data: from which machine did it come, what period of time did it span, how many records were there, and whether you used all of it (and if not, why not).
  3. The plot described above, showing both datasets at the different scales.
  4. Your results: how does the relative variability change with aggregation? Show this in a table, giving the level of aggregation, and the (largest - smallest) range for the real data and for the Poisson data. Also plot this (variability range as a function of aggregation, for both datasets) on log-log axes. Is the variability reduced according to a power law?
  5. Answer the question: apart from burstiness, is there some other effect at play?
  6. The programs/scripts you used to analyze the lastcomm data and create the synthetic data.
Submission deadline is Monday morning, 11/4/11, so I can give feedback in class on Tuesday.

Please do the exercise in pairs.

To the course home page