Topics in Performance Evaluation – Exercise 5

Topics in Performance Evaluation

Exercise 5 – Random Variate Generation

In this exercise we will try to generate values for a random variable from a given distribution.

Background

There are several ways to create random variables from a given distribution. The most common is to start with a uniformly distributed random variate, and then transform it into the desired distribution.

Practically all computer systems include a function to create a pseudo-random uniformly distributed random variable. In Unix or Linux this function is simply called random(). Such functions work by starting from a certain seed, and then applying some formula to get the next number. For example, the formula used by the obsolete rand() function was x_i+1 = [ x_i * 1103515245 + 12345 ] mod 32768. Other generators use other formulas, with the goal of creating ``more random'' numbers and a longer cycle.

To create a random variate with a distribution characterized by the CDF F(x), one simply inverts this function. In other words, given a uniform random variate u (in the range [0,1]), we calculate x = F^-1(u). Doing this for many us creates a set of xs from the desired distribution.

Assignment

Get to know the distributions we will use in this exercise.
- The exponential distribution, specified by the CDF F(x) = 1 - e^{- a x}. This has a mean of 1 / a. Based on the inversion technique, the formula used to create an exponential variate x from a uniform variate u is x = -(1/a) ln(u). Check that this formula is convincing.
- The Pareto distribution, specified by the CDF F(x) = 1 - x^-a. For a = 1, it does not have a mean! (Try to understand why. What does it mean for a distribution not to have a mean?). However, it is still possible to generate variates from this distribution using the formula x = 1 / u^1/a.
Generate random variates.
The assignment is to generate a large number of random variates from each of the two distributions, using both the random() and drand48() random number generators. (These are available in C, not sure about Java.) Use a = 1 in all cases. Moreover, for each combination repeat this for different quantities of random variates: 10, 1000, 100,000, and 10,000,000. Finally, repeat the whole thing 5 times with different seeds. Thus the total number of data sets you should produce is 2 distributions X 2 generators X 4 sizes X 5 seeds = 80.
Check the results.
As a simple check of what you got, compute the average and standard deviation of the random variates you created in each of the cases (note that you can save disk space by doing this on the fly). Is this OK? Are there any problems with the results? Is this related to the seeds used? To the number of variates generated?
Look at real data.
A file with 100,000 numbers in it (ASCII format, one number per line) is available here (or you can access it directly from any Linux station in the university at ~perf/www/cspar). Create graphs of the distribution of these numbers, and also do this for one of your 100,000 exponential variates sets and one of your 100,000 Pareto variates sets.
What can you see in the graphs? Are the distributions similar in any way? Can you tell whether cspar is more similar to an exponential distribution or a Pareto distribution?

Submit

Use Moodle to submit a report on your work, in pdf format, with the following data.

Your names, logins, and IDs
Your results: the averages and standard deviations of the random variates you generated in the different cases. Present this in an orderly table.
Can you explain these results? Is everything OK? Specifically,
1. What is the effect of using more and more variates?
2. What is the difference (if any) between the exponential distribution and the Pareto distribution?
3. What is the difference (if any) between using random() and using drand48()?
Would showing the results graphically add anything? If so, please add an appropriate graph (or graphs), and explain what it shows.
Graphs of the 3 distributions of 100,000 values (exponential, Pareto, and cspar). If you think cspar is more similar to either the exponential or Pareto distribution, explain why.

Submission deadline is Monday, 31 March 2014, so I can give feedback in class on Tuesday.

To the course home page