You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think the redaction / rounding section of the opensafely documentation could do with a bit of a refresh, both in terms of how we explain disclosure control / suppression and what we recommend.
We say here that the general principle is that any statistic describing 5 or fewer patients, either directly or indirectly, should be redacted. I wrote this (or maybe it's since been tweaked), but I now think it's misleading.
The key principle is the suppression of information about groups of size 5 or fewer. We shouldn't know the size of these groups (ie, we should only know if the group-size is 5 or fewer), and we shouldn't know any details in addition to what we already know about how that group is defined. So for example, if we know there are "at most five people in [study population] with [attributes a,b,c]" we are not allowed to know anything more about them (eg, average age, proportion who died...). I think this principle is best expressed, in general terms, as something like:
All information about groups of size 5 or fewer should be suppressed. This includes the size of the group and any other demographic or clinical information such as the average age, proportion with a particular disease, or number who have died. This principle applies for both primary and secondary disclosure, so if any additional information about such a group can be inferred from information released elsewhere then further suppression is required.
I think this wording is better as it doesn't imply that redaction is necessary for suppression, and it doesn't use the term "statistic" which is probably unhelpful.
A few examples then would help explain these principles in practice, using redaction and/or rounding as the means of suppression. The existing examples are probably good enough for this but we should review to make sure they focus on our primary recommendations for disclosure control.
The text was updated successfully, but these errors were encountered:
In terms of what we want to recommend for suppression, I think we focus too much on redaction of counts and should focus instead on rounding of counts. As discussed here, there is some confusion about different approaches to rounding, so we also need a summary and worked examples of rounding, with some recommendations for the most appropriate approach in any given scenario.
Repeating what I said in the thread, which needs to be refined:
We need to suppress all counts 5 or less, including counts that can be recovered through differencing.
Redaction is vulnerable to differencing, which is why we recommend rounding.
Rounding to r is unbiased. But for non-negative values (like counts), the binwidth is r everywhere except for the lowest bin, where the binwidth is ceiling(r/2). (eg rounding to r=6, the lowest bin is [1,2,3] -> 0, so we break the suppression rule. We also cannot distinguish true zeros from rounded-down-to-zeros.
Alternatives:
round counts to a higher number, like 10. Then we lose some precision, and still don't have non-zero-preservation.
Use rounding as usual, but combine the lowest 2 bins. This introduces a tiny bit of bias.
Round up, using a ceiling function. So [1,2,3,4,5,6] -> 6 and [7,8,9,10,11,12] -> 12 etc. Zeros are zeros. can say "a rounded count of x must really be one of x-5, x-4, ..., x]". But now it's biased: mean(X_rounded) is higher than mean(X), by r/2.
Fix the bias in the previous option by deducting r/2. So now [1,2,3,4,5,6] -> 3 and [7,8,9,10,11,12] -> 9 etc. Now, zeros are zeros, X_rounded is unbiased
The bias is only really an issue in certain scenarios. Like for estimating mortality rates, say. If you're just counting things, and you know that a 6-rounded value of x actually means the underlying value is one of [x-5, x-4, ..., x] then it's not a big deal. But for consistency within an entire project, it's helpful to choose one method and stick to it, rather than have to explain different methods for different quantities.
I think the redaction / rounding section of the opensafely documentation could do with a bit of a refresh, both in terms of how we explain disclosure control / suppression and what we recommend.
We say here that the general principle is that any statistic describing 5 or fewer patients, either directly or indirectly, should be redacted. I wrote this (or maybe it's since been tweaked), but I now think it's misleading.
The key principle is the suppression of information about groups of size 5 or fewer. We shouldn't know the size of these groups (ie, we should only know if the group-size is 5 or fewer), and we shouldn't know any details in addition to what we already know about how that group is defined. So for example, if we know there are "at most five people in [study population] with [attributes a,b,c]" we are not allowed to know anything more about them (eg, average age, proportion who died...). I think this principle is best expressed, in general terms, as something like:
I think this wording is better as it doesn't imply that redaction is necessary for suppression, and it doesn't use the term "statistic" which is probably unhelpful.
A few examples then would help explain these principles in practice, using redaction and/or rounding as the means of suppression. The existing examples are probably good enough for this but we should review to make sure they focus on our primary recommendations for disclosure control.
The text was updated successfully, but these errors were encountered: