Topics in Performance Evaluation – Exercise 6

Topics in Performance Evaluation

Exercise 6 – Fitting a Distribution with Q-Q Plots

In this exercise we will compare two distributions using quantile-quantile plots.

Background

A good way to characterize a distribution is to show that it is similar to a model with a simple mathematical definition.

The most primitive way to compare distributions is to compare their statistics: the mean, the standard deviation, or the coefficient of variation (which is their quotient). A better way is to look at the the whole distribution using quantile-quantile plots. This means that we list the values at the different quantiles (e.g. at 10% of the distribution, at 20%, 30%, etc.) for both distributions, and create a plot where each point represents the same quantile in both. A good match should result in a straight line with slope 1.

We will use this method to try and decide what distribution provides a good model for the "cspar" dataset.

Assignment

  1. Get the data.

    The data is the same file of 100,000 numbers (ASCII format, one number per line) which we looked at in Ex5. (You can also access it directly from any Linux station in the university at ~perf/www/cspar.)

  2. Check the fit to an exponential model.

    Do this by calculating the mean, standard deviation, and coefficient of variation (CV) of your data (the CV is the quotient of the standard deviation divided by the mean), and by creating a quantile-quantile plot. Use an exponential distribution with the correct mean as the reference.

    You can either use the formulas, or sample from an exponential distribution created by computing -m log(u), where m is the desired mean and u is a uniformly distributed random variate on [0,1].

    You need to decide how many samples to generate, and what quantiles to use in the comparison.

  3. Check the fit to a hyperexponential distribution

    You can create samples from a hyperexponential distribution with the correct mean and standard deviation as follows.

    1. Calculate the CV squared: c2 = ( s / m )2 where s is the standard deviation and m is the mean.
    2. Calculate p = 1/2 [ 1 - sqrt( (c2-1) / (c2+1) ) ]
    3. With probability p generate an exponential variate with mean m / 2p. Otherwise generate an exponential variate with mean m / 2(1-p).
    Again, do the comparison of the data with the model using the mean, standard deviation, CV, and quantile-quantile plot.

  4. Check the fit to a Pareto distribution

    The Pareto distribution is characterized by a shape parameter a, which is positive. Given a set of observations xi, the parameter a can be estimated as a = 1 / [ (1/n) sum ln xi ] . To create samples from this distribution, compute 1 / u(1/a) were u is again a uniformly distributed random variate on [0,1].

    And yet again, do the comparison of the data with the model using the mean, standard deviation, CV, and quantile-quantile plot.

Submit

Use Moodle to submit a report on your work, in pdf format, with the following data.

  1. Your names, logins, and IDs
  2. A table with the mean, standard deviation, and CV for the original data and for the same number of samples generated from each of the models (or calculated from the model). Do these result indicate that the data fit any or all of the models?
  3. The three Q-Q plots. Also give your judgement regarding what you see in the Q-Q plots.
  4. Your summary: do you have an opinion regarding which model is better, and which comparison method is better?
Submission deadline is Monday morning,28 April 2014, so I can give feedback in class on Tuesday.

Please do the exercise in pairs.

To the course home page