Friday, December 17, 2010

Unintended Truth or Consequences

If you look back, you realize that the most surprising result of the Netflix contest is not the fact that thousands of researchers spent a year improving an algorithm by 10% (some of them using matrix completion techniques devised by some of you) but rather that what we thought we had (privacy) was pretty much an illusion. Talk about a surprise. As I was searching for a specific blog entry on the Netflix prize in my google reader feeds.  I found this entry written a month ago about government 2.0. by Tom Slee. I think he aptly describes privacy as an arms race. 
....Data anonymization and aggregation are often cited as ways of handling these issues, but recent developments in “re-identification” have shown that such efforts may be doomed. The identification of individuals within the Netflix Prize data set (paper) led to the cancellation of the second Netflix contest, and has cast a chill on crowdsourcing contests. Legal scholar Paul Ohm’s recent long but highly readable paper on Broken Promises of Privacy suggests that we need to re-assess the basis for many of our data privacy laws because information previously thought to be anonymous has now got to be treated as potentially privacy-damaging.
This arm's race has taken a new turn this week using some tools such as non-adaptive request to duplicate entire databases ( Nonadaptive Mastermind Algorithms for String and Vector Databases, with Case Studies by Arthur Asuncion, Michael Goodrich.). I note that the paper does not seem to be aware of recent work in group testing and compressed sensing (see references below) and one really wonders how some of the bounds and techniques devised in CS could improve the cloning or the revealing of information that we think is anonymized. I also wonder to what extent the bounds given in the Donoho-Tanner phase transition could provide some sorts of relief in that respect.. And while it may just be a simple issue of better governing through government 2.0, in the last few days, we've just had three examples of potential large privacy breach outside the whole Gov 2.0 and Google-Facebook fight for the social graph:
In all these cases, few thoughts have been given on the nature of the anonymization process in view of the very recent tools developed in group testing and compressive sensing.

  1. Boolean Compressed Sensing and Noisy Group Testing by George Atia, and Venkatesh Saligrama
  2. Note on Noisy Group Testing: Asymptotic Bounds and Belief Propagation Reconstruction by Dino Sejdinovic, Oliver Johnson.
  3. Superselectors: Efficient Constructions and Applications by Ferdinando Cicalese, Ugo Vaccaro.  
  4. Derandomization and Group Testing by Mahdi Cheraghchi.
  5. Improved Constructions for Non-adaptive Threshold Group Testing by Mahdi Cheraghchi.
  6. Venkatesh Saligrama - Noisy Group Testing and Boolean Compressed Sensing
  7. Group Testing and Sparse Recovery by Mark Iwen  
  8. Compressed Sensing with Probabilistic Measurements: A Group Testing Solution by Mahdi Cheraghchi, Ali Hormati, Amin Karbasi, Martin Vetterli
Credit: Opportunity snapped this image of Santa Maria crater while 20 meters from its rim on sol 2,450 (December 15, 2010). Credit: NASA / JPL


Kevembuangga said...

Privacy is gone already, just get used to it...

Igor said...

It may be gone for savvy users using Google technology like you but not everybody uses the interwebs in such fashion. Furthermore, what I am pointing out is that there is now several academic disciplines whose output can clearly be useful to both the private sector and governments that are there to provide a level playing field (or rules/laws). The example of HIPAA in light of the Netflix identification exercise is probably the most eloquent example. Some of these anonymization processes need to be tried and tested out: Whitening a name or a date of birth is just not sufficient anymore. It's an arms race, both parties ought to have similar tools.