Sunday, June 21, 2009

random sampling

A fundamental statistical operation over a network is to sample a node uniformly at random. For large networks like the internet or the Facebook social network, this is the only principled way to gain information about properties of the network like: node degree distributions, fraction of nodes with a given property, clustering coefficient, etc.

For the most part, all techniques for sampling a node uniformly at random are currently just heuristics. Maybe you take a random walk around the graph for a while and then throw in some tricks to try to correct for bias in the stationary distribution. However, I think there is the possibility of a great and important theory problem here. Increasingly, many interesting networks are huge and not completely known. Randomly sampling nodes is a great way to measure properties of such networks, but bias in the sampling can lead to all kinds of problems [1]. Here's a stab at a problem: Given an unknown graph and an arbitrary node (or two or more) in that graph, devise an algorithm to sample (close to) uniformly from the set of all nodes, when the only operation available is to move along edges of the graph. Of course, the graph would have to have "sufficiently nice" properties to do this, e.g. ergodic, good expansion, good degree distribution, etc. Coupling theory would be a good candidate tool to use for this problem.

[1] For example, see this paper for an example of how bias due to traceroutes can lead to erroneous prediction of power-law degree distribution where it does not actually exist.

P.S. A few years ago, Valerie King and I wrote a paper on choosing a random node in a peer-to-peer network. This only worked for highly structured networks and so did not solve the problem described above. However, we came up with the following algorithmic tool that might be useful for the harder problem. Imagine we can broadcast a message from some node to all other nodes in the network once for free and we want to minimize the number of nodes that need to reply to this broadcast. What is the minimum number of responses that need to be sent in order to select a node uniformly at random? A naive approach is for every node to choose a number uniformly at random between 0 and 1; have the broadcaster send out a random number between 0 and 1; have nodes respond that have random numbers closest to that of the broadcaster (mod 1); and choose the single responder that is closest. Surprisingly, this scheme has significant bias. So how then can you solve the problem? Check out the paper for the answer!

Bertinoro

Last week, I attended the Algorithms and Data Structures (ADS) workshop in Bertinoro. Bertinoro is a small medieval town near Bologna, Italy that is perfect for a workshop. There's a hilltop castle offering well-protected meeting spaces for talks, and a nearby nunnery offering living accommodations for participants, plus the restaurants in town are uniformly excellent.

The workshop had a great collection of talks, including several excellent talks on data structures, an area I don't normally keep up on. Some that stood out for me:
  • Monika Henzinger , formerly head of research at google and now at EPFL, surveyed recent results on the problem of choosing a random web page. This problem is rendered non-trivial by the fact that nobody knows the entire network of web pages and links and so the only tool available for sampling is to walk from one page to another. More on this topic in a later post.
  • My friend Amos Fiat, at the University of Tel Aviv, gave a great talk on a connection between differential privacy and computational geometry. A coreset is a small subset of points that can be used to compute an approximate solution to a problem over a larger set of points. Amos introduced the concept of a private coreset, which essentially has low sensitivity to the replacement of a single point in the original point set
  • Andrew Goldberg introduced the concept of the highway dimensionality of a graph to show why certain heuristics for speeding up shortest paths computations work so well on graphs related to road networks. Essentially this is a number like the tree-width of a graph that measures the structural simplicity of the graph. Smaller highway dimention equals a more algorithmically tractable graph. I really like these results that generate analytic techniques that move beyond worst case analysis.
After the conference, Amos and I traveled to Rome where we worked with Stefano Leonardo for several days on a problem related to auctions with budgets.