CSCE 470 Lecture 35
« previous | Wednesday, November 20, 2013 | next »
Privacy
Caverlee grew up in the non-internet age, but adopted it.
Our data wake is massive nowadays: everyone is tracking everything we do...
Thank heavens all of our data is locked up across many protected databases (not all together)
A lot of good can come from analyzing the data, but how can we free data without violating privacy of users?
Some systems allow individual control of privacy settings (others don't)
Some databases have public conuterparts that can be linked:
Massachusetts Governor
The governor of Massachusetts (along with 87% of US population) was identified from the union of the following information:
Medical:
NameSSN- Visit Date
- Diagnosis
- Procedure
- Procedure
- Medication
- Total Charge
- Zip
- Birth Date
- Sex
Voter Records:
- Name
- Address
- Date
- Registered
- Party affiliation
- Date last voted
- Zip
- Birth Date
- Sex
The intersection of these datasets is called the "Quasi-identifier"
AOL Query Log Release
In August 2006, AOL released a query log containing 21 M search queries for 650 K users
No names or user identities included
Some users have been identified based just on their query histories
Even just linking search history with a person can be embarassing
TrackMeNot: Firefox extension that issues randomized queries
Some information is public, but not indexed. Once the information is indexed and searchable, it suddenly becomes frightening.
IR Application
We could run a clustering algorithm and release aggregate similar clusters.