Experimental Methods in Computer Science – Exercise 9

Experimental Methods in Computer Science

Exercise 9 – On the Graph of Phone Calls

Goals

Background

As an example of working with networks, we will be using data from the CDRs (call details records) of 7 billion phone calls.

This data was reduced in stages, starting with raw data records that looked like:

dmQCmFFUTps VCWyZXuAAud 050801 110345 4
NgmvJIJzrIm jdJBQCOdlpB 050801 110346 209
GNgrHXnwnRv oThSAWCeJdg 050801 110346 76
GNgrCtaAJTk GNgrrJAjMnD 050801 110347 286 
Each line consists of two hashed IDs, which are for the most part telephone numbers; then the year, month, and day of the call; then the hour, minute, and second at which the call was placed; then the duration in seconds of the call. GNgr and VCWy are metropolitan regions. The first and third examples above are calls from their metro region to some other more distant number. The fourth line reports a call within the GNgr region. The region PnLa was the most frequently seen; calls in which both source and destination were in the PnLa region defined 3 M source-destination pairs. We consider such connections to be links in a graph, so for PnLa we have 3 M links.

The full data set was reduced to a list of about 1.5 B links, and some parallel processing magic was used to give each hashed ID a unique identifier consisting of a number between 1 and about 116,600,000. In the reduced data set, we distinguish "work" period from "leisure" period. "Work" calls were placed between 8 AM and 6 PM Monday to Friday, "leisure" were all other calls. For each of the links identified by a specific source-destination pair, the aggregated data includes the total number of calls placed and the total duration of those calls, separated into four categories: source to dest during the work period, dest to source during the work period, and the same two directions for the leisure period.

For this exercise, we performed an additional level of aggregation, removing all the details and leaving just the graph structure. The end result is two data files, describing all links found within the PnLa area, giving only the two IDs. One file contains links found during Work time, the other during Leisure time. Calls in either direction are lumped, so each pair of IDs occurs once, not twice. The IDs are numbers between 1 and 3M.

The data files are available locally at /cs/par/users/kirk. They are big. Do not copy them to other places; just use them directly from there.

Assignment

This exercise is more open-ended than usual. The goal is not to make life hard for you, but to allow you to explore the data and see if it interests you. Only the first two items are mandatory, and they are not difficult.

  1. Determine the degree of each site in the PnLa network. Plot the degree distribution and characterize it, comparing it to long-tailed distributions that you have seen in this course. In particular, consider the Pareto distribution (where the survival function drops of as a power law) and the lognormal distribution (where the density has the characteristic bell shape in log space).

    Where do you think the boundary between clearly human activity and clearly robotic calling should be placed? If you want other information than just the links to answer this question (and others), just ask before Pesach, and Scott will extract it.

    In this and all subsequent questions, compare the work period results with the leisure period results and comment on the differences.

  2. Extract the degre-degree correlations between neighbors in the PnLa network. Use a visualization such as a scatter plot to show whether it is assortative (nodes tend to connect to other nodes with similar characteristics) or disassortative (low-degree nodes tend to connect to high-degree nodes and vice versa) or a little of each.

    Compare the scatter plot of this information with a simpler plot where you just show the average degree of neighboring nodes to see if the distributions are highly skewed. Comment on the two methods.

    Note that here you need to do much more work than in the previous question, and keep more data. You can either develop some machine-readable adjacency data structure in the programming environment of your choice, or maybe keep some lower-resolution data that will suffice for your needs. Think about this before you start.

  3. Extra credit -- read Newman's article to see where he does a careful subtraction of the degree of correlation to be expected simply as a result of the degree distribution. When you use his formula, zero is the boundary between assortative and disassortative. What do you get?

  4. More extra credit: Using your adjacency structures, find what fraction of a site's neighbors are themselves neighbors, and plot this as a function of site degree. Calculate a "cluster coefficiet" for each site, defined as the ratio of the number of triangles with the site at one corner to z(z-1)/2, where z is the site degree. You could also take the fraction of three step paths starting at a site which return to that site. Is that the same as the first quantity? Plot the distribution of cluster coefficients against site degree. What is the average cluster coeffient for the whole network? Comment.

  5. Extra extra credit: either of the following:

    1. Specify a reasonable traffic model for betweenness, for example, every site with sufficiently low degree, or a sampling of such sites sends a message to every other site in the set using a shortest path, selected at random. Count the number of messages that pass through each site in the network and characterize the sites with highest betweenness. Are they the ones with highest degree? plot and comment.
    2. Find the size distribution of the k-shells in the PnLa network by performing the pruning process I described. Where is the nucleus? Do you find one nucleus or more than one?
  6. Extra extra extra credit: how does betweenness depend on k-shell index in the PnLa network?

This is an active area of research. Some the questions asked above haven't been answered, and none have been published beyond a conference poster or two. Scott would be interested in your answers.

Submit

Submit a single pdf file that contains all the following information:

  1. Your names, logins, and IDs.
  2. The results of the analysis as described above.
  3. Any relevant output plots you may have generated.
Submission deadline is Sunday morning, 1/5/11.

Please do the exercise in pairs.

To the course home page