Richard Chow, Ian Oberst and Jessica Staddon
It is often important to share sensitive documents, but protecting privacy is important. A typical solution is do redact important bits, but often the redacted information can be recovered. Another approach is is to sanitize the data by replacing specific terms with more general terms that hide the underlying data without destroying utility.
The researchers created a tool that helps the user discover privacy risks in the document by scanning the terms in the document and comparing against their prevalence on the web and their linkability to known sensitive terms. The sensitive terms are highlighted to guide the user in sanitizing the document. As the user makes changes the document is continuously rescored so the user can evaluate the effectiveness of their changes. The tool also suggests replacement terms that improve privacy.
The study included twelve users instructed to sanitize two short biographies, of Harrison Ford and Steve Buscemi. Some users behaved differently when they were using the tool than when they weren’t. It seemed that when users were dealing with unfamiliar topics they relied on the tool’s judgement more than their own. The privacy achieved was measured by employing Amazon’s Mechanical Turk service to see if the actors could be identified by the sanitized biographies. The study was focused on preserving privacy and not on the biography’s resulting utility.