Microdata: new disclosure risk assessment methodology (WP 1.2)
Leading partner: URV
Participating partners: Soton, IStat
This workpackage is devoted to research oriented to assessing the disclosure risk
for microdata at the individual record level. The work mainly focuses on unperturbed
microdata (e.g. microdata resulting from sampling a population of records),
which are the kind of microdata released by many important statistical offices nowadays
(e.g. U.S. Bureau of the Census, ONS, etc.). Unlike WP 1-1, this workpackage does not
deal with the development of new SDC masking methods. Thus, both workpackages are
complementary, and they are both aimed at improving m-ARGUS. The following are the
objectives of this workpackage broken down by tasks:
Task T1 (responsible Soton)
Objectives
Disclosure risk can be measured at either the file level or the record level.
Record-level measures are useful for use in conjunction with disclosure limitation
methods which are applied at the record level, for example local suppression.
The objectives of this workpackage are:
- To extend the methods proposed by
Skinner and Holmes (see references in the Description of Work below)
to allow for misclassification of the key variables
- To investigate the application of record-linkage ideas
(see references in the Description of Work below) to record-level measures of
disclosure risk
- To investigate record-level measures of risk within the framework for µ-ARGUS
Description of the work
Skinner and Holmes (1998) consider records r with key variable values x(r) in
the microdata and corresponding external units r* with values .
Writing r = r* if record r belongs to external unit r*, a measure of risk is
Pr(r=r*|x(r), x(r*))
if all population units s* are included in the microdata file with equal probability.
In the case of no misclassification this probability reduced to 1/Fx(r), where Fx(r)
is the number of units in the population with key variables x(r).
When measurement error is present the terms in (a) may be expressed in terms of
misclassification probabilities. The aim would be to develop such measures,
extending the approach of Skinner and Holmes (1998) and drawing on the theory of
misclassification (Kuha and Skinner, 1997) and record linkage (Copas and Hilton, 1990;
Winkler, 1998).
These measures will depend on specified assumptions about the nature
and degree of misclassification both in the microdata and in the external data.
In the absence of measurement error, Skinner and Holmes (1998) consider a simple
measure of risk for records which are unique in the sample with respect to some
categorical key variables. The measure is given by exp(-(1-π)f/π), where π is the sampling
fraction and f is a fitted frequency for the combination of key variable values
of the given record. The measure may be interpreted as the estimated probability
that the variable combination is unique in the population. This is the simplest
measure which will be considered in the framework of µ-ARGUS. The computation of the
fitted frequencies would require some iterated proportional fitting. The measure could
be extended to records which are not unique in the sample.
References
Copas, J. B. and Hilton, F. J. (1990) Record Linkage: statistical models for
matching computer records, (with discussion) J. Roy. Statist. Soc., A, 287-320
Kuha, J. T. and Skinner, C. J. (1997) Categorical data analysis and misclassification.
In L. Lyberg et al (eds.) Survey Measurement and Process Quality, Wiley,
New York, 633-670.
Skinner, C. J. and Holmes, D. J. (1998) Estimating the re-identification risk per record
in microdata. J. Official Statist. 14, 361-372.
Winkler, W. E. (1998) Re-identification methods for evaluating the confidentiality of
analytically valid microdata. Research in Official Statistics, 2,87-104
Milestones and expected result
- Development of theory for record-level measures under misclassification
- Programming of measures for methodological investigation
- Completion of numerical evaluation of methods
The expected result of the project is an improved set of methods for assessing
disclosure risk in microdata.
Task T2 (responsible Soton).
Objectives
- To apply the methods developed on Task T1 to the Labour Force Survey (an instance of survey of EU-wide interest).
- To assess the protection afforded by sampling and measurement error.
- To study the dependence of disclosure risk on different levels of detail of the potential key variables especially geography and occupation.
Description of the work
The Labour Force Survey will be considered (as a survey raising EU-wide interest).
The following will be done:
Identify potential sets of key variables.
1. Obtain best estimates of misclassification rates for these key variables
from various methodological studies.
2. Determine alternative levels of detail in the key variables.
3. Apply the record-level measures of risk developed in Task T1 to the survey data
at the different levels of detail.
4. Assess implications for the use of disclosure limitation methods in the light
of the uses of the survey and its different forms of release.
Milestones and expected result
- Determination of misclassification rates
- Application of record-level measures of risk to data
- Review of disclosure limitation implications
The results of the project are expected to help assess the value of record-level measures of disclosure risk and provide, through one case study, a model for the evaluation of disclosure risk in other surveys
Task T3 (responsible IStat).
Objectives
To build into µ-ARGUS the individual disclosure risk approach for complex
micro-data (hierarchical) as defined in the Esprit n° 20462,
SDC and taking advantage of the developments of Task T1.
Description of the work
In order to improve the capabilities of µ-ARGUS and give a wider choice of
methodology to the user, the individual unit risk
(called record-level risk in T1 and T2) approach will be implemented in µ-ARGUS.
The programs already available in SAS as output of the SDC project will be used as
a basis to define a C procedure to estimate the individual risk.
Moreover, efficient protection algorithms will have to be developed that take into
account dependencies in the data.
Tasks to be carried out include
1 Study of the steps to be followed to include the methodology into the software:
definition of metadata, in particular key variables and their characteristics
with respect to the dependencies, input of the data, specification of
dependence structure, estimation of the individual risk according to the type
of dependence structure, identification of all factors that influence risk,
definition of the iterative procedure to obtain a safe file.
2 Preparation of a program flow chart.
3 Migration from SAS to C.
4 Integration with existing µ-ARGUS software.
5 Development of efficient protection algorithms for dependent data.
6 Testing.
7 Validation.
Milestones and expected result
The implementation of individual risk of disclosure into µ-ARGUS will widen user
choice. The resulting evaluation of disclosure risk will enable the user to measure
the safety levels reached in the micro-data file.
|