Skip to content
dandanlen edited this page Jan 7, 2020 · 6 revisions

Motivation

Anonymized data from the Diffix-protected datasets is inherently restricted. The analyst needs to be familiar with the imposed limitations, and knowledgeable of possible workarounds. The aim of this project is to build a system that automatically extracts a high-level picture of the shape of a given data set whilst intelligently navigating the restrictions imposed by Diffix.

The most fundamental limitation of a Diffix-protected database is that you can't query any data that would uniquely (or even almost-uniquely) identify a person in the database. As a result, the main way of extracting information about a given dataset is through aggregates. On their own, the aggregate functions such as min, max, count, avg... return very coarse-grained stats of limited usefulness. However, using tricks such as calculating aggregates over sub-ranges of data, we can extract enhanced statistics such as histograms to aid the analyst in their exploration of the dataset.

Clone this wiki locally