Statistical Disclosure Control |
|
Contact:Peter-Paul de WolfStatistics Netherlands P.O. Box 24500 2490 HA The Hague The Netherlands Phone: +31 70 337 5060 Last update: 10 Oct 2011 |
FAQ: Frequently asked questionsOn this page you will find some frequently asked questions. If you cannot find the answer to your question here, please send your question to the European SDC-expert team Microdatam1. Synthetic datam2. What is microaggregation m3. I only want to apply microaggregation to several continuous scaled variables. In this case, do I have to provide a meta-data (rda) file as well? m4. Could I also use µ-ARGUS if I have population data (no sampling weights). m5. When working with large datasets with many variables, creating a metadata file manual is very time consuming. Is there a way to get metadata files created automatically by usual statistical programs like SAS? m6. Is there a limit of variables or cases µ-ARGUS can handle? m7. Is it possible to run methods in µ-ARGUS on subsets of the data? Tabular datat1. Thresholds in sensitivity rulest2. Licences for τ-ARGUS t3. The frequency rule and the (n,k)-rule implemented in τ-Argus t4. What helps to reduce processing time of secondary cell suppression in τ-ARGUS? t5. Which sensitivity rule should I use? t6. Which of the secondary cell suppression algorithms of τ-ARGUS should I use, Modular, Hypercube, Network, or Optimal? t7. What is the "singleton problem"? Othero1. In which directory can the test data be found after installation of Argus?o2. What are the recommended hard- and software components? Synthetic DataQUESTION: Synthetic data only preserve the statistical properties explicitly captured by the model used by the data protector to synthesize the data. Why not directly publish the statistics one wants to preserve or simply the parameters of the imputation model used to synthesize the data rather than release a synthetic microdata set?ANSWER: There are several reasons for still releasing a synthetic microdata set: 1) Synthetic data are normally generated by using more information on the original data than is specified in the model whose preservation is guaranteed by the data protector releasing the synthetic data. 2) As a consequence of the above, synthetic data may offer utility beyond the models they exactly preserve. 3) It is impossible to anticipate all possible statistics an analyst might be interested in. So access to the micro dataset should be granted. 4) Not all users of a public use file will have a sound background in statistics. Some of the users might only be interested in some descriptive statistics and are happy, if they know the right commands in their statistical package to get what they want. They will not be able to generate the results if only the parameters are provided. 5) The imputation models in most applications can be very complex, because different models are fitted for every variable and often for different subsets of the dataset. This might lead to hundreds of parameters just for one variable. Thus, it is much more convenient even for the skilled user of the data to have the synthesized dataset available. 6) The most important reason for not releasing the parameters is that the parameters themselves could be disclosive in some occasions. For that reason, only some general statements about the generation of the public use file should be released. For example, these general statements could provide information, which variables where included in the imputation model, but not the exact parameters. So the user can judge if her analysis would be covered by the imputation model, but she will not be able to use the parameters to disclose any confidential information. back to top Thresholds in sensitivity rulesQUESTION: What threshold provides a good protection in case of sensitivity rules ?or in other words: What are the factors to consider when fixing the threshold for a sensitivity rule such as the dominance rule or the p% rule? What is the (positive or negative) influence of each factor on the threshold? ANSWER: In general we advise to use the p% rule. The larger the p, the stricter the protection. Typical values for p are between 5 and 15. There are many different arguments to choose the value of p. There can be legal restrictions. Also the sensitivity of the table can be an argument to choose for smaller or larger value of p. The SDC-handbook is a valuable source for further reading on the sensitivity rules. back to top Microaggregation.QUESTION: What is microaggregation?ANSWER: Microaggregation is a family of masking methods for statistical disclosure control of numerical microdata (although variants for categorical data exist). The rationale behind microaggregation is that confidentiality rules in use allow publication of microdata sets if records correspond to groups of k or more individuals, where no individual dominates i.e. contributes too much to) the group and k is a threshold value. Strict application of such confidentiality rules leads to replacing individual values with values computed on small aggregates (microaggregates) prior to publication. This is the basic principle of microaggregation. To obtain microaggregates in a microdata set with n records, these are combined to form g groups of size at least k. For each attribute, the average value over each group is computed and is used to replace each of the original averaged values. Groups are formed using a criterion of maximal similarity. Once the procedure has been completed, the resulting (modified) records can be published. The optimal k-partition (from the information loss point of view) is defined to be the one that maximizes within-group homogeneity; the higher the within-group homogeneity, the lower the information loss, since microaggregation replaces values in a group by the group centroid. The sum of squares criterion is common to measure homogeneity in clustering. The within-groups sum of squares SSE is defined as SSE = Σ Σ ( (xij - x̄i)' (xij - x̄i) ) The lower SSE, the higher the within-group homogeneity. Thus, in terms of sums of squares, the optimal k-partition is the one that minimizes SSE. For a microdata set consisting of p attributes, these can be microaggregated together or partitioned into several groups of attributes. Also the way to form groups may vary. Several taxonomies are possible to classify the microaggregation algorithms in the literature: i) fixed group size vs variable group size; ii) exact optimal (only for the univariate case vs heuristic microaggregation; iii) continuous vs categorical microaggregation. Microaggregation has recently been proposed as an option to generate hybrid data, combining original data and synthetic data. The idea is to form small aggregates of k records and then, rather than replacing records in an aggregate by an average, replace them by synthetic records preserving the means and covariances of original records in the aggregate. For an illustrative example of the application of microaggregation to real data see also the case study section of the Essnet on SDC web page at: http://neon.vb.cbs.nl/casc/ESSNet/Case studies B2.pdf References: The ESSNet Handbook on SDC gives a further reading of this subject. Josep Domingo-Ferrer, 'Microaggregation', entry for the Encyclopedia of Database Systems, New York: Springer-Verlag, 2009, pp. 1736-1737. ISBN 978-0-387-35544-3 Josep Domingo-Ferrer, 'Microaggregation-based numerical hybrid data', in Joint UNECE/EUROSTAT Work Session on Statistical Disclosure Control, Bilbao, Basque Country, Dec. 2-Dec. 4, 2009. back to top Licences for τ-ARGUSQUESTION: Why do we need licences for using τ-ARGUSANSWER: Both μ-ARGUS and τ-ARGUS are the result of a fruitful European cooperation during several projects. The software can be freely downloaded and used. There are no restrictions here. However for solving the complex mathematical models behind cell-suppression and also controlled rounding we need to solve large optimisation problems. For that we rely on high-quality solvers like XPress and CPlex. We have included the Open Solver Soplex as well, since version 4.1.0. This solver is free to use for academics and European NSIs. For the use with τ-ARGUS we have negotiated friendly prices with Xpress. Contact support@fico.com on this. Note that for large(r) instances, the Open Solver might not be powerfull enough. back to top Location of the test datasets.QUESTION: In which directory can the test data be found after installation of Argus?ANSWER: After installation of both µ-ARGUS and τ-ARGUS some testdata will be installed as well. These datasets can be found in a subdirectory below the software. The software will usually be installed in a subdirectory (mu_argus or tauargus) of "C:\program files". But of course during the installation you are free to choose a different location. back to top The frequency rule and the p% and (n,k)-rule implemented in τ-ARGUSQUESTION: What is the interpretation of the thresholds of the min. frequency rule, the p% rule and the (n,k)-rule in τ-ARGUS?ANSWER: If you specify a freq=3 rule, the interpretation is that a cell with frequency 3 is still considered unsafe. Freq=4 will then be safe. For the p% rule and the (n,k) rule the interpretation is that a cell is unsafe if the value of the sensitivity rule is above the threshold. If the value is equal to the threshold, the protection levels would become zero. So no protection would be needed and therefore the cell is considered safe. For more information on protection levels see the SDC-handbook back to top Subsets of the data in µ-ARGUSQUESTION:Is it possible to run methods in µ-ARGUS on subsets of the data?ANSWER: No. This is considered preparation of the data and is not a task for µ-ARGUS. This would complicate µ-ARGUS too much and much better tools are available. back to top Meta data if only microaggregation is requiredQUESTION: I only want to apply microaggregation to several continuous scaled variables.In this case, do I have to provide a meta-data (rda) file as well ANSWER: Yes. µ-ARGUS always requires some meta data. It need to know where the variables you are interested in are located in the datafile. But when using fixed format microdata, you only have to specify in the RDA-file the variables you will use for your job. The remaining not-specified data will be copied to the output file as-is. back to top Population filesQUESTION: Could I also use µ-ARGUS if I have population data (no sampling weights)?ANSWER: Yes, most methods in µ-ARGUS are still valid when using population files. Only the risk model has been developed especially for sample files. back to top Large datasets with many variablesQUESTION: When working with large datasets with many variables, creating a metadata file manual is very time consuming.Is there a way to get metadata files created automatically by usual statistic programs like SAS? ANSWER: 1. When working with fixed format ASCII files, only the variables actively used in µ-ARGUS need to be described in the RDA-file. The remaining non-described parts will be copied as is to the output-file by µ-ARGUS. 2. When working with SPSS a similar procedure apllies. Only the needed variables are exported from SPSS. 3. When working with SAS, there are two options. Either the user exports the necessary data to a comma-seperated file with the variable names on the first row. In that case µ-ARGUS can read this first line and create a first verison of the RDA-fike which can be extended. Alternatively the SAS procedures developed during the ESSNet can be used. There are no plans to extend µ-ARGUS to read SAS directly. We think that these export prodecures do a better job. back to top µ-ARGUS limitsQUESTION: Is there a limit of variables or cases µ-ARGUS can handle?ANSWER: 1. The maximum number of variables in the RDA-file is 450. More than enough in normal situations 2. There is no real limit to the number of cases. Of course larger data files require larger processing time. But many steps like global recoding etc. are done after the initial computations. These steps only require aggregated information and will be done very quickly. back to top Hardware and software requirementsQUESTION: What are the hardware and software requirements for ARGUS?ANSWER: Both µ-ARGUS and τ-ARGUS require a Windows PC. Windows 2000 and later. 2Gb of RAM is usually enough back to top Reduce processing time in τ-ARGUS?QUESTION: What helps to reduce processing time of secondary cell suppression in τ-ARGUS?ANSWER: If you use the Modular Method, introducing a hierarchical structure into the data usually helps a lot to reduce computational complexity, and hence computing time. Smaller numbers of categories per equation will make the program run faster. For the hypercube method, this is less of an issue. Here it is generally good to avoid structures with very few categories within a relation. The method has been developed for large tables, where the number of categories per relation is usually less than 20. back to top t5. Which sensitivity rule should I use? Which sensitivity ruleQUESTION: Which sensitivity rule should I use?ANSWER: A min. frequency rule is often used. Cells with only one contribution are obviously risky. Cells with two contributions too, as the two contributors (often each other competitor) could reveal each others contribution. But also cells with a few large (dominating) cells can be risky. Traditionally the dominance (n,k)-rule is used. The sum of the largest (n) contributors should not be more than k% of the cell total. Recently the p%-rule is used more often. This rule states that no contributor to a cell can make a better estimate of another contributor than p%. This rule focuses more on the real problem and also has better behaviour in case of waivers. Therefor the p%-rule is recommended. For further reading we refer to the SDC handbook. back to top Hardware and software requirementsQUESTION: Which of the secondary cell suppression algorithms of Tau-ARGUS should I use, Modular, Hypercube, Network, or Optimal?ANSWER: In a real (i.e. production) context, Network and Optimal should not be used, because they do not prevent "singleton problems". In terms of information loss, the experience is that for typical tables of business statistics (large tables, with detailed hierarchical structures) Modular gives much better results than Hypercube (30%-50% smaller number of suppressions). This justifies both, the (much!) longer computation times, and the extra expense for a commercial license (LP solver Xpress or Cplex) that are necessary for using Modular. Note that an Open Solver (Soplex) is included for use by academics and European NSIs, but that solver may not be powerfull enough for large(r) instances. back to top Singleton problemQUESTION: What is the "singleton problem"?ANSWER: The problem that occurs when there is a row (or column) of the table, where only two cells are suppressed (with an unsuppressed row total) and both cells provide information that relates to a single respondent. Then each of these single respondents ("singletons") could use his special knowledge (on the value of his "own" cell value) and disclose (by differencing between the published total, his own value and the value of the other published cells) the value of the other suppressed cell, e.g. the value relating to the other "single" respondent. back to top |