Sunday, June 21, 2009

random sampling

A fundamental statistical operation over a network is to sample a node uniformly at random. For large networks like the internet or the Facebook social network, this is the only principled way to gain information about properties of the network like: node degree distributions, fraction of nodes with a given property, clustering coefficient, etc.

For the most part, all techniques for sampling a node uniformly at random are currently just heuristics. Maybe you take a random walk around the graph for a while and then throw in some tricks to try to correct for bias in the stationary distribution. However, I think there is the possibility of a great and important theory problem here. Increasingly, many interesting networks are huge and not completely known. Randomly sampling nodes is a great way to measure properties of such networks, but bias in the sampling can lead to all kinds of problems [1]. Here's a stab at a problem: Given an unknown graph and an arbitrary node (or two or more) in that graph, devise an algorithm to sample (close to) uniformly from the set of all nodes, when the only operation available is to move along edges of the graph. Of course, the graph would have to have "sufficiently nice" properties to do this, e.g. ergodic, good expansion, good degree distribution, etc. Coupling theory would be a good candidate tool to use for this problem.

[1] For example, see this paper for an example of how bias due to traceroutes can lead to erroneous prediction of power-law degree distribution where it does not actually exist.

P.S. A few years ago, Valerie King and I wrote a paper on choosing a random node in a peer-to-peer network. This only worked for highly structured networks and so did not solve the problem described above. However, we came up with the following algorithmic tool that might be useful for the harder problem. Imagine we can broadcast a message from some node to all other nodes in the network once for free and we want to minimize the number of nodes that need to reply to this broadcast. What is the minimum number of responses that need to be sent in order to select a node uniformly at random? A naive approach is for every node to choose a number uniformly at random between 0 and 1; have the broadcaster send out a random number between 0 and 1; have nodes respond that have random numbers closest to that of the broadcaster (mod 1); and choose the single responder that is closest. Surprisingly, this scheme has significant bias. So how then can you solve the problem? Check out the paper for the answer!

No comments:

Post a Comment