Skip to content

B – Simple sample data

Closed Dec 7, 2020 95% complete

From our goals discussion:

Simple sample data

To get a good sense of a data set an analyst is likely to want to take a peak at what the data set looks like.
In normal SQL the equivalent query would be something like:

SELECT * FROM table LIMIT 10

Using an anonymizing system such as Aircloak Insights issuing a query like the above yields little of value. Al…

From our goals discussion:

Simple sample data

To get a good sense of a data set an analyst is likely to want to take a peak at what the data set looks like.
In normal SQL the equivalent query would be something like:

SELECT * FROM table LIMIT 10

Using an anonymizing system such as Aircloak Insights issuing a query like the above yields little of value. All rows inevitably end up being identifying and the data gets anonymized away.

From our first goal of providing statistics over the individual columns we have a lot of knowledge of what the data in the data set looks like. With some additional effort we can get beyond the statistical properties and also start producing values that visually resemble the original data.

For example we might capture such properties as that:

A numerical column for example always contain two decimal places and represent monetary amounts. It might also be the case that cent values such as 00, 49, 50, 75, and 99 occur with above average frequencies.
A categorical string column follows a certain pattern such as that of a social security number
A categorical column contains email addresses

The goal of this phase is to be able to produce small data sets of around 10 rows with values that are within range and of the correct type for the corresponding columns. Furthermore we want the data to superficially resemble the underlying data, and where possible also capture simple column dependencies. For example if we have a column for the car make and a column for the model name, then it should be possible to capture that Tesla and Cybertruck belong together as well as Ford and F100 and avoid generating pairings such as Tesla and F100.

Loading