Into the (Data) Breach: open data, privacy and re-identification

About fifty people came along to hear Dr Chris Culnane present on open data, privacy and re-identification on Nov 30 at an event I organsied on behalf of the Open Knowledge Foundation.

Culnane is a cryptographer with the Department of Computing & Information Systems at Melbourne University and led the effort to re-identify the Medicare/Pharmaceutical Benefits Scheme data that had been published on the Commonwealth government data repository This data set was published as de-identified but poor practices meant that researchers could easily reverse engineer the methods used.

The dataset contained longitudinal data over a period of three decades on 10% of Medicare patients or approximately 3 million people.

This significant breach of privacy of our most sensitive personal information was enabled by legislation that allows administrative data collected from us in the course of receiving health care to be re-purposed for research without our informed consent. Rather than reign in the potential for further breaches of privacy, the recently released Productivity Commission Data Inquiry Report recommends that this ability to repurpose administrative data without our consent be expanded from health data to cover all research goals determined to be in the public interest (Draft Recommendation 5.2).

The Productivity Commission Report attempts to establish a balance between collection, storage and use of data for the public good with the risks and counter-arguments weighing against that goal.

Open to debate is the question of whether this risk can be measured and managed at all. Re-identification risk increases with the number of data points available on an individual. When datasets are linked together to provide more information on each individual contained within them, the risk of those individuals being identified increases.

This is a significant matter in the context of the government’s plans to open up the flow of data between agencies and jurisdictions for the specific purpose of creating linked dataset projects- the most notorious of those at present being the plans to use 2016 Census data to link together multiple administrative datasets. The Productivity Commission report makes these plans explicit proposing a new Data Sharing and Release Act to enable cross portfolio and jurisdictional linkage by overriding all other legislation (where there are conflicts).

The intention of this new Act is to provide the legislative environment for linking administrative data on a scale never seen before in Australia. While there is some merit in this idea, there is also valid cause for concern, evidenced by the government’s demonstrated inability to successfully de-identify data, manage data collection or retain social licence for its actions.

The government response not only to criminalise re-identification of supposedly anonymised open data but to criminalise the mere knowledge that datasets may be open to re-identification has been condemned as a knee-jerk response with little thought for the consequences.

The Privacy Amendment (Re-identification Offense Bill) is currently the subject of a public Inquiry by the Senate Legal and Constitutional Affairs Legislation Committee which is accepting submissions until December 16 for reporting by February 7, 2017.

The Privacy Act to date has applied only to organisations with a turnover exceeding $3Million annually and government agencies. The re-identification amendment however extends, the reach of the Act where open data is involved, to apply criminal sanctions to individuals and applies this liability retrospectively.

This punitive response to attempts by the community to point out the government’s own incompetence in handling our personal information has done little to restore trust . As stated by the Melbourne University researchers in their submission to the Inquiry into the re-identication Bill:

A key theme of our questions is to distinguish between re-identification per se, and the use of de- identified government data to do harm. The two are not the same. We believe that re- identification should not be a crime, though some uses of government data should be.

The team is concerned that the threat of criminal penalties will inhibit research and fast identification of improper de-identification which will actually increase the risk that sensitive information about us will fall into the wrong hands. It is argued that the kinds of people with less honourable intentions will find and act on data breaches before the research community can respond.

As the team states, had the new rules been in place (or presumably had they known they would be put in place and backdated) they would never have brought our attention to the false claims made by the government that the data was de-identified.

If the new rules had been in place in September, we would not have discovered the problem in the MBS/PBS dataset encryption, the dataset would probably still be up, and the government could be unaware it was insecure.

It can be deduced from this that the knee-jerk response to criminalise re-identification has done nothing to better balance the risks and rewards of making use of government held datasets.

During his presentation on November 30, Culnane argued against the release of curf level data (data about individuals) as open data, the argument being that the perturbation required in efforts to de-anonymise data nullifies its use for research purposes yet provides inadequate security against re-identification. Many open data advocates believe that the more data is made public the better yet there are good reasons why this wholesale approach to sharing data with unknown users for unspecified purposes be reconsidered.

The Productivity Commission appears to agree with this caution, offering up a framework for providing access to sensitive datasets based on trusted users with the open version of the data being restricted to aggregate forms of that dataset:


While the notion of trusted users needs to be debated and defined, this approach would be preferable to providing administrative data to the public without our informed consent and with no certainty as to its anonymity.

Print Friendly, PDF & Email
Liked it? Take a second to support Rosie on Patreon!