In this exercise we will try to generate values for a random variable from a given distribution.
There are several ways to create random variables from a given distribution. The most common is to start with a uniformly distributed random variate, and then transform it into the desired distribution.
Practically all computer systems include a function to create a pseudo-random uniformly distributed random variable. In Unix or Linux this function is simply called random(). Such functions work by starting from a certain seed, and then applying some formula to get the next number. For example, the formula used by the obsolete rand() function was xi+1 = [ xi * 1103515245 + 12345 ] mod 32768. Other generators use other formulas, with the goal of creating ``more random'' numbers and a longer cycle.
To create a random variate with a distribution characterized by the CDF F(x), one simply inverts this function. In other words, given a uniform random variate u (in the range [0,1]), we calculate x = F-1(u). Doing this for many us creates a set of xs from the desired distribution.
The assignment is to generate a large number of random variates from each of the two distributions, using both the random() and drand48() random number generators. (These are available in C, not sure about Java.) Use a = 1 in all cases. Moreover, for each combination repeat this for different quantities of random variates: 10, 1000, 100,000, and 10,000,000. Finally, repeat the whole thing 5 times with different seeds. Thus the total number of data sets you should produce is 2 distributions X 2 generators X 4 sizes X 5 seeds = 80.
As a simple check of what you got, compute the average and standard deviation of the random variates you created in each of the cases (note that you can save disk space by doing this on the fly). Is this OK? Are there any problems with the results? Is this related to the seeds used? To the number of variates generated?
A file with 100,000 numbers in it (ASCII format, one number per line) is available here (or you can access it directly from any Linux station in the university at ~perf/www/cspar). Create graphs of the distribution of these numbers, and also do this for one of your 100,000 exponential variates sets and one of your 100,000 Pareto variates sets.
What can you see in the graphs? Are the distributions similar in any way? Can you tell whether cspar is more similar to an exponential distribution or a Pareto distribution?
Use Moodle to submit a report on your work, in pdf format, with the following data.