Skip to content

Bucketing

dandanlen edited this page Jan 7, 2020 · 1 revision

Buckets are worth a page of their own as they are quite central to how numerical aggregates are collected and combined. Sorting a data set into buckets of a defined range and calculating aggregates for each bucket allow us to obtain fine-grained statistics while navigating the anonymity protections imposed by Diffix. These fine-grained statistic can be recombined to obtain global metrics such as the data distribution.

For example, counts of data points that fall into sub-ranges (buckets) can be combined to form a histogram:

SELECT 
    bucket(amount by 500),
    count(*)
FROM 
    loans
GROUP BY 1

Note the use of a custom bucket function to truncate the column value to the lower bound of its bucket

This query groups the amount column into buckets of size 500 and counts the number of values in each. If the original amount column contained values between 0 and 50_000, this query would return around at most 100 rows. There may be less than 100 rows returned because Diffix suppresses buckets with too few values.

In other words, there is a trade-off: if the bucket size is too small, the data will be suppressed. If the bucket size is too large, statistics are of limited use. One of the key challenges here is to find the optimal bucket size - sufficient precision without sacrificing too much accuracy.

Bucket resolutions

One way of finding the appropriate resolution is to break data up into ever smaller buckets until data contained in bucket is no longer accurate.

Eg. for numeric data we might want to query for multiple bucket sizes to obtain data at different resolutions:

 min: 0                                                                                     max: 100
|________________________________________bucket size: 100___________________________________________| <-- Global stats, limited usefulness
|______________________50________________________|_____________________50___________________________|
|_________20*_______|_________20________|_________20________|_________20________|_________20________|
|                   |___10*___|____10___|___10____|____10___|___10____|____10___|___10____|____10*__|
|                             |_5__|__5_|_5__|__5_|_5__|__5_|_5__|__5_|_5__|__5_|_5__|__5_|           

* Suppressed: Smaller bucket counts are not returned in the query but can be estimated from larger buckets.

Note that different-sized buckets can be returned from same query. For example, the following returns a count of the amount column sorted into buckets of 500, 200 and 100:

SELECT 
    bucket(amount by 500),
    bucket(amount by 200),
    bucket(amount by 100),
    count(*),
    grouping_id(
        bucket(amount by 500),
        bucket(amount by 200),
        bucket(amount by 100)
    )
FROM 
    loans
GROUP BY GROUPING SETS (1,2,3)

Note the use of the Aircloak-specific grouping_id function. This generates a bitmask that can used to identify the group that a particular row belongs to. See the docs.

This approach applies most obviously to numerical data, however the same principle can used for other types of data. For example, Text data columns can be sorted into buckets based on prefixes / postfixes / substrings. Date ranges can be bucketed similarly to numbers.

Bucket interpolation

Small buckets are more likely to be suppressed than larger ones. If some small buckets are suppressed, we can use the larger buckets to estimate the count for the suppressed buckets. For example, if a bucket of size 100 has a count of 50 and we break it into 5 buckets of size 20, the resulting smaller buckets may have counts of */*/12/12/16. In this case we know that 10 suppressed values belong to the two * buckets. We can simply apportion this evenly, or maybe according to some expected distribution, in order to estimate the size of the remaining buckets.

Bucket hierarchies

Diffix restricts the bucket sizes to multiples of i * (10 ^ n) where i is 1, 2 or 5. So valid bucket sizes are: 0.0001, 0.0002, 0.0005, 0.001, [...], 1, 2, 5, 10, 20, [...], 1_000, 2_000, 5_000, 10_000, ...etc

To help with interpolation it can be useful to define parent-child relationships between buckets. For numerical buckets, this is complicated by the fact that base-2 buckets do not divide nicely into base-5 buckets. For example, you can't divide buckets of size 50 into buckets of size 20, which is the next smaller bucket size.

One approach is to simply skip a generation for these combinations. So, the parent of a base-2 bucket is the next-larger base-1 bucket, and the child of a base-5 bucket is the next base-1 bucket.

A brief note about terminology

Bucket count: The number of values contained in the bucket

Bucket size: The range of the bucket

Bucket base: 1, 2 or 5

Lower bound: The lower bound of the bucket

Upper bound: The upper bound of the bucket

Clone this wiki locally