Project Name:

Creation of Synthetic Data and Development and Use of Verification Metrics (Survey of Earned Doctorates)

Contractor: The Urban Institute

Lessons Learned

In preparation for the creation of synthetic SED data, the team worked together to prepare an outline of the gold standard file. The gold standard file will be used to inform the creation of a synthetic data file. The lesson learned is the importance of consensus by the team on the gold standard file. This includes building off the multi-disciplinary nature of the team to account for different perspectives and expertise.

The main insights emerged from the focus groups are that:

  • When creating a synthetic data file, it is important to talk with stakeholders to discuss key variables that are needed for a wide variety of analytic research.
  • Tiered access to data can be beneficial for educational use for teaching and the opportunity to develop and test analysis code.
  • Users find detailed documentation about the methodology for creating a synthetic data file to be helpful. In addition, validation metrics could be beneficial for users conducting complex statistical analyses.
  • When planning for disclosure risk assessment of synthetic data it is important to identify sources that may have been released in other formats and may increase disclosure risk.

From the “SEDSyn Data User Focus Group Report,” we learned that:

  • Users see potential for synthetic data use in education, training, debugging, developing initial research plans, and as an intermediate step before requesting secure data access.
  • User education materials and standards need to be put in place to ensure proper buy-in/adoption for other uses of synthetic data.
  • Verification/validation servers could help as another tier (not replacement) of secure data access.

When creating synthetic data files, it is necessary to consider both utility and disclosure risks. Specifically for this project, this has meant investigating the levels of granularity that are possible for key variables. Given that the Survey of Earned Doctorates (SED) is an annual census of research doctorate recipients from US accredited institutions, using a single year of data may pose additional disclosure risks. The team is investigating alternative approaches to reduce disclosure risks while maintaining analytic utility. As the team continues to investigate alternative approaches, there will be additional research and continued internal discussions about the planned uses for the synthetic data files and the balance between utility and privacy. This project has been successful in identifying key questions that need to be addressed as synthetic data files are produced which can inform future synthetic data projects.

  • Generating synthetic data is a common method for providing access to data while preserving the privacy of participants and is considered one of many privacy enhancing technologies (PET) for providing safe access to confidential data through a National Secure Data Service. For any PET, a suite of utility metrics should be implemented to ensure a balance between disclosure risk and utility. One type of utility metrics is analysis-specific utility, which measures the similarity of results between confidential and synthetic datasets for a specific analysis or multiple analyses. In other words, these types of metrics assess if a researcher comes to the same conclusion on their analyses whether applied to the confidential dataset or synthetic dataset. For this project, we conducted analyses from peer-reviewed papers, using the confidential data, which require detailed information about the original analysis to be reproducible. This is essential for effectively assessing whether synthetic data can help replicate similar results as the restricted-use file. This information can and should include code, model specifications, data processing decisions, and others needed to ensure specific synthetic data utility replicates a realistic process. However, these papers do not always provide this information, which makes testing for analysis-specific utility measures more difficult.
  • The National Secure Data Service will involve data curators, researchers, data practitioners, and public policymakers with diverse technical backgrounds and subject matter expertise at all levels of government. Given these varied backgrounds, a common taxonomy will be crucial to ensure effective communication across the different groups. For instance, this project team involves statisticians, data scientists, survey methodologists, and social scientists with different technical backgrounds in PETs. We found it helpful to establish operational definitions of terms such as “key variables” and “sensitive variables” when planning for disclosure and utility risk analyses. Although these definitions are subjective and may be imperfect, providing definitions allows the disclosure risk metrics to be based on clear assumptions. The operational definitions we use are in
    the context of SED synthetic data generation are the following:

    • Key variables: Information about an individual that could be reasonably gathered externally and accurately from professional websites (e.g., LinkedIn), organization papers, or resumes/CVs.
    • Sensitive variables: Information that is not readily available about an individual externally and is also information that an individual would be more likely to want to keep private.
  • As mentioned in previous lessons learned, we must balance disclosure risk and utility. For this project, we learned the importance of conducting a thorough landscape scan of publicly available data sources and information that a malicious actor could potentially use to disclose sensitive information from the in a synthetic version of Survey of Earned Doctorates data. For example, we need to assess the probability that a unique record, given certain characteristics, could be linked or identified by an external source.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.