Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the documentation on small number suppression #1039

Open
wjchulme opened this issue Nov 23, 2022 · 1 comment
Open

Update the documentation on small number suppression #1039

wjchulme opened this issue Nov 23, 2022 · 1 comment
Assignees
Labels
improve content Improve existing documentation content

Comments

@wjchulme
Copy link
Contributor

wjchulme commented Nov 23, 2022

I think the redaction / rounding section of the opensafely documentation could do with a bit of a refresh, both in terms of how we explain disclosure control / suppression and what we recommend.

We say here that the general principle is that any statistic describing 5 or fewer patients, either directly or indirectly, should be redacted. I wrote this (or maybe it's since been tweaked), but I now think it's misleading.

The key principle is the suppression of information about groups of size 5 or fewer. We shouldn't know the size of these groups (ie, we should only know if the group-size is 5 or fewer), and we shouldn't know any details in addition to what we already know about how that group is defined. So for example, if we know there are "at most five people in [study population] with [attributes a,b,c]" we are not allowed to know anything more about them (eg, average age, proportion who died...). I think this principle is best expressed, in general terms, as something like:

All information about groups of size 5 or fewer should be suppressed. This includes the size of the group and any other demographic or clinical information such as the average age, proportion with a particular disease, or number who have died. This principle applies for both primary and secondary disclosure, so if any additional information about such a group can be inferred from information released elsewhere then further suppression is required.

I think this wording is better as it doesn't imply that redaction is necessary for suppression, and it doesn't use the term "statistic" which is probably unhelpful.

A few examples then would help explain these principles in practice, using redaction and/or rounding as the means of suppression. The existing examples are probably good enough for this but we should review to make sure they focus on our primary recommendations for disclosure control.

@wjchulme wjchulme self-assigned this Nov 23, 2022
@wjchulme
Copy link
Contributor Author

In terms of what we want to recommend for suppression, I think we focus too much on redaction of counts and should focus instead on rounding of counts. As discussed here, there is some confusion about different approaches to rounding, so we also need a summary and worked examples of rounding, with some recommendations for the most appropriate approach in any given scenario.

Repeating what I said in the thread, which needs to be refined:

We need to suppress all counts 5 or less, including counts that can be recovered through differencing.

Redaction is vulnerable to differencing, which is why we recommend rounding.

Rounding to r is unbiased. But for non-negative values (like counts), the binwidth is r everywhere except for the lowest bin, where the binwidth is ceiling(r/2). (eg rounding to r=6, the lowest bin is [1,2,3] -> 0, so we break the suppression rule. We also cannot distinguish true zeros from rounded-down-to-zeros.

Alternatives:

  • round counts to a higher number, like 10. Then we lose some precision, and still don't have non-zero-preservation.
  • Use rounding as usual, but combine the lowest 2 bins. This introduces a tiny bit of bias.
  • Round up, using a ceiling function. So [1,2,3,4,5,6] -> 6 and [7,8,9,10,11,12] -> 12 etc. Zeros are zeros. can say "a rounded count of x must really be one of x-5, x-4, ..., x]". But now it's biased: mean(X_rounded) is higher than mean(X), by r/2.
  • Fix the bias in the previous option by deducting r/2. So now [1,2,3,4,5,6] -> 3 and [7,8,9,10,11,12] -> 9 etc. Now, zeros are zeros, X_rounded is unbiased

The bias is only really an issue in certain scenarios. Like for estimating mortality rates, say. If you're just counting things, and you know that a 6-rounded value of x actually means the underlying value is one of [x-5, x-4, ..., x] then it's not a big deal. But for consistency within an entire project, it's helpful to choose one method and stick to it, rather than have to explain different methods for different quantities.

@StevenMaude StevenMaude added the improve content Improve existing documentation content label Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improve content Improve existing documentation content
Projects
None yet
Development

No branches or pull requests

2 participants