CSCE 470 Lecture 35

From Notes
Jump to navigation Jump to search

« previous | Wednesday, November 20, 2013 | next »


Privacy

Caverlee grew up in the non-internet age, but adopted it.

Our data wake is massive nowadays: everyone is tracking everything we do...

Thank heavens all of our data is locked up across many protected databases (not all together)

A lot of good can come from analyzing the data, but how can we free data without violating privacy of users?

Some systems allow individual control of privacy settings (others don't)

Some databases have public conuterparts that can be linked:


Massachusetts Governor

The governor of Massachusetts (along with 87% of US population) was identified from the union of the following information:


Medical:

  • Name
  • SSN
  • Visit Date
  • Diagnosis
  • Procedure
  • Procedure
  • Medication
  • Total Charge
  • Zip
  • Birth Date
  • Sex

Voter Records:

  • Name
  • Address
  • Date
  • Registered
  • Party affiliation
  • Date last voted
  • Zip
  • Birth Date
  • Sex

The intersection of these datasets is called the "Quasi-identifier"

AOL Query Log Release

In August 2006, AOL released a query log containing 21 M search queries for 650 K users

No names or user identities included

Some users have been identified based just on their query histories

Even just linking search history with a person can be embarassing


TrackMeNot: Firefox extension that issues randomized queries

Some information is public, but not indexed. Once the information is indexed and searchable, it suddenly becomes frightening.


IR Application

We could run a clustering algorithm and release aggregate similar clusters.