Statistical Disclosure Control

Contact:

Peter-Paul de Wolf
Statistics Netherlands
P.O. Box 24500
2490 HA The Hague
The Netherlands
Phone: +31 70 337 5060

Last update: 10 Oct 2011

The CASC project

The general aim of this project is to enhance the development of practical tools in the field of statistical confidentiality. This project builds further on the results achieved by the SDC-project, which was funded under the 4th Framework. This newly formed consortium will take over the products resulting from the SDC-project and bring the outcome on a higher level. The development of these tools plays a central role in the construction of this consortium. A rather large group of experts from Europe has been brought together originating from 5 countries. Because of this larger group co-operating together we have opted for a detailed description of the work to be undertaken. This is a first binding force in this project. All partners know from each other what is expected. The project falls apart in two main streams. The first stream is devoted to the disclosure control of microdata and the second one to tabular data.

Microdata

Microdata is the topic of the WP1.1, WP1.2 and WP2. WP2 concentrates on the implementation of the research carried out in the WP1.1 and WP1.2 It is foreseen that several new techniques for disclosure protection will be implemented. The need for these new techniques lies in the fact that the currently used methods like global recoding and local suppression are inadequate for the protection of business microdata. New techniques investigated are micro-aggregation, noise addition, PRAM (Post-randomisation) and masking techniques. A researcher at the main-coordinators office is preparing a thesis on PRAM. The results will be implemented in µ-ARGUS. Noise addition and masking techniques are studied and a special study into an alternative method for business data preserving the individual profile for each unit will be undertaken. Micro-aggregation will be studied as an alternative. In addition to these new techniques for disclosure protection risk models will be investigated. These disclosure risk models help to assess the safety of a protected microdata file. A study on record level measures will result in a research report on noise addition. These latter will result in research that might be implemented in ARGUS during the CASC-project, but will be implemented only after the foreseen scope of this project.

A simulation of the intruder will be a topic of WP5, where attempts will be made to undo the disclosure protection. An other important study is into the effects on the analytical power of the protected microdata file, i.e. how well are these protected microdata file suited for statistic analysis projects.

The different approaches for this topic are justified by the need for safe business microdata files, for which few solutions are available. The implementation of these methods in ARGUS will allow for an easy application of these methods, which will result in growing insight in the quality and the applicability of these methods. In the long run we might reach a common opinion on recommendations for the generation of safe business microdata files. Eventually this offers the possibility of European harmonisation.

Tabular data

WP3, WP4.1 and WP4.2 play a similar central role in this topic. τ-ARGUS for tabular data resulting from the SDC-project covers the disclosure protection of simple unstructured tables up to dimension 3. A central role in the disclosure protection of tables is played by the dominance rule. This rule states which cells in a table are unsafe and therefore cannot be published. Due to the presence of marginals in a table it is often easy to recalculate these suppressed cells. So additional cells must be suppressed to prevent this recalculation of the primary unsafe cells. It is not only enough to prevent exact recalculation but also to guarantee a safety range to protect the primary unsafe cells. The optimal selection of these secondary cells, as to avoid unnecessary high losses in the information content of the protected tables, is a very complex numerical optimisation problem.

Although in the τ-ARGUS resulting from the SDC-project a solution is available for unstructured tables, it cannot be applied to many tables in the daily life of a statistical office, because they have a hierarchical structure. These hierarchical structures imply many more (sub-)marginals, which can be used to recalculate these primary suppressed cells. Also the linked tables, having some marginals in common, must be treated simultaneously. This makes the optimisation problem to find the optimum suppression pattern still much harder. Even for renowned researchers in the field of numerical optimisation this as a very hard problem. Nevertheless we aim at a solution for this hard problem. The main approach can be found in WP4.1, dealing with the research required to specify the new models before implementation and testing. A second supporting approach is based on network flow algorithms. Besides these complex optimisation approaches we will develop and implement heuristic methods, which aim at a much quicker, near optimal but good solution. For several tables this might prove to be adequate. Some of these methods are already available in a basic form (e.g. GHQUAR) but we will extend τ-ARGUS to facilitate the access to these heuristic methods. Another approach is based on the non-hierarchical solutions already available, by breaking down the big hierarchical table into several subproblems. One of the outcomes of these WP's is the composition of a set of test-tables. These tables will play the role of test-bench for the optimisation procedures and be of vital value for the researchers in numerical optimisation to find the right solutions.

Dissemination

That the results of this project will be used in real life situations is of course the major objective of this project. The composition of the project team has been designed in such a way that the primary users, i.e. the NSI's, are active members. Seven statistical offices (5 national and two regional) participate in the project, either actively in the various stages of the development as testers of the results. This reflects the needs and the interest of the NSI's for this kind of tools. However the project team will present the results of this project at various conferences, which we might expect the potential users to attend. A very good example of such a conference was a work session on Statistical Disclosure Control organised jointly by Eurostat and the UN/ECE (Geneva) organised work-session on Statistical Disclosure Control, the first of which was held in Thessaloniki, Greece. This is planned to be the first in a series, as the second worksession is to be held in Skopje 2001. Participants at this conference are not only the Western European countries but the whole of Europe. This is a good opportunity to disseminate the results also in that part of Europe. Also the series of ETK-conferences are suitable platforms.

The group will set up a WEB-side, where the results/milestones of this project will be made available to the end-users. We expect this WEB-site to play a central role as the information node for the topic of Statistical Disclosure Control. Also this central WEB-site will serve as a binding factor in the CASC-project.

Besides this the major players in this consortium will participate in the AMRADS project in CPA8. i.e. the project to facilitate the results and knowledge acquired to this moment in the European projects. Statistical Disclosure control is one of the topics.