Experimental Methods in Computer Science – Exercise 5

Experimental Methods in Computer Science

Exercise 5 – The Distribution of Patience and Waiting

Goals

Background

How long are users willing to wait for the computer? This is obviously not a number but a distribution — some are willing to wait more than others. And we can get empirical data on how long users have waited and when they decided to abandon the system in frustration. But what if most of the users actually got service? We don't know what their threshold of frustration is, because they did not reach it. But we do know that the threshold is above a certain value: it is larger than the time they actually waited.

Such partial data is called censored data, because some samples are "cut short". Censored data can be used when estimating an empirical distribution using the Kaplan-Meier formula. This goes as follows.

Denote the sampled values by xi. Let di represent the number of real samples with magnitude xi (that is, excluding any censored data items). Let ni represent the number of samples with magnitude larger than or equal to xi (both real and censored). The hazard, or risk of surviving for time xi and then dying, is then di / ni. Hence the probability of surviving beyond xi is 1 - di/ni. The probability of surviving an arbitrary value x is then the product of the probabilities of surviving all smaller values:

S(x) = prodxi < x [ 1 - di/ni ]
The empirical distribution function is then F(x) = 1 - S(x).

Note that the above derivation relies on the assumption that the censoring is done at random. If this is not the case, the Kaplan-Meier formula is not applicable. One of your assignments in this exercise is to hazard a guess as to whether this assumption applies to our data.

We don't have good data of this kind for a computer system, but we do have data from a large bank's call center. Specifically, we'll use data from December 1999 (about 1.1 MB compressed, or 4.5 MB when open). Rather than copying it, you can use it directly from where it is stored using

gzip -dc ~exp/www/ex5-dec.gz | ...
The source of the data is from a course on service engineering in the Technion. Each service request is represented by a line with 17 space-separated fields; the ones that interest us are

Assignment

Analyze the bank call center data. Note that obtaining service (an outcome of AGENT) is regarded as a censored data point, because we didn't wait till the user gave up.

What you need to do is simply to draw and compare the CDFs of three distributions:

Hint: also look at how many samples fall in each of the two classes, censored or not, and how many of them either get service or hang up immediately. How should those be handled?

Submit

Submit a single pdf file that contains all the following information:

  1. Your names, logins, and IDs
  2. The requested graph with your results (all three in the same plot). Don't forget to label the axes, etc.
  3. A short explanation of what the results show, including answers to the following questions:
    1. Which distributions dominate which other distributions? To dominate means that the CDF tends to stay lower and to the right, implying that values tend to be bigger.
    2. Does it seem true that the censoring is random, or does the data seem to imply that users who hang up are actually a separate group of users with less patience?
Submission deadline is Monday morning, 11/5/09, so I can give feedback in class on Tuesday.

Please do the exercise in pairs.

To the course home page