Experimental Methods in Computer Science – Exercise 5

Experimental Methods in Computer Science

Exercise 5 – The Distribution of Patience and Waiting

Goals

Use the Kaplan-Meier method to obtain a distribution from censored data
Understand the applicability and limitations of the method

Background

How long are users willing to wait for the computer? This is obviously not a number but a distribution — some are willing to wait more than others. And we can get empirical data on how long users have waited and when they decided to abandon the system in frustration. But what if most of the users actually got service? We don't know what their threshold of frustration is, because they did not reach it. But we do know that the threshold is above a certain value: it is larger than the time they actually waited.

Such partial data is called censored data, because some samples are "cut short". Censored data can be used when estimating an empirical distribution using the Kaplan-Meier formula. This goes as follows.

Denote the sampled values by x_i. Let d_i represent the number of real samples with magnitude x_i (that is, excluding any censored data items). Let n_i represent the number of samples with magnitude larger than or equal to x_i (both real and censored). The hazard, or risk of surviving for time x_i and then dying, is then d_i / n_i. Hence the probability of surviving beyond x_i is 1 - d_i/n_i. The probability of surviving an arbitrary value x is then the product of the probabilities of surviving all smaller values:

S(x) = prod_{x_i < x} [ 1 - d_i/n_i ]

The empirical distribution function is then F(x) = 1 - S(x).

Note that the above derivation relies on the assumption that the censoring is done at random. If this is not the case, the Kaplan-Meier formula is not applicable. One of your assignments in this exercise is to hazard a guess as to whether this assumption applies to our data.

We don't have good data of this kind for a computer system, but we do have data from a large bank's call center. Specifically, we'll use data from December 1999 (about 1.1 MB compressed, or 4.5 MB when open). Rather than copying it, you can use it directly from where it is stored using

gzip -dc ~exp/www/ex5-dec.gz | ...

The source of the data is from a course on service engineering in the Technion. Each service request is represented by a line with 17 space-separated fields; the ones that interest us are

field 12: the time spent in the queue.
field 13: the outcome, which is AGENT if the call was answered, and HANG if the caller gave up.

Assignment

Analyze the bank call center data. Note that obtaining service (an outcome of AGENT) is regarded as a censored data point, because we didn't wait till the user gave up.

What you need to do is simply to draw and compare the CDFs of three distributions:

The empirical distributions of time till service (using only AGENT lines)
The empirical distributions of time till hanging up (using only HANG lines)
The Kaplan-Meier estimate of the distribution of patience, i.e. the distribution of time till hanging up, but also taking into account the censored data from calls that were answered.

Hint: also look at how many samples fall in each of the two classes, censored or not, and how many of them either get service or hang up immediately. How should those be handled?

Submit

Submit a single pdf file that contains all the following information:

Your names, logins, and IDs
The requested graph with your results (all three in the same plot). Don't forget to label the axes, etc.
A short explanation of what the results show, including answers to the following questions:
1. Which distributions dominate which other distributions? To dominate means that the CDF tends to stay lower and to the right, implying that values tend to be bigger.
2. Does it seem true that the censoring is random, or does the data seem to imply that users who hang up are actually a separate group of users with less patience?

Submission deadline is Monday morning, 11/5/09, so I can give feedback in class on Tuesday.

Please do the exercise in pairs.

To the course home page