Project Name:

Synthetic Data Generation with Large, Real-World Data

Contractor: Westat

Lessons Learned

  • This project aims to generate synthetic data and utilize computing space across multiple agencies. Given the involvement of multiple stakeholders and governance requirements, this project has highlighted the need for ongoing collaboration and effective communication. The team has addressed this by holding biweekly meetings with the full team, biweekly management meetings and creating a detailed timeline with achievable milestones. The involvement of multiple entities and the corresponding legal agreements have required significant time to ensure proper documentation preparation and review. The team has recognized the need for comprehensive documentation of this process for future reference. Clearly identifying obstacles and offering solutions or recommendations will be crucial for similar future projects.
  • Synthetic data generation, which relies on a truth source, requires careful selection of variables, including assessments of missingness and levels of granularity. In addition, data quality of the variables in the truth source may impact what is selected to inform the synthetic data generation. The team has addressed this by involving stakeholders and expert users of the truth data to help inform the variable selection process.
  • Establishing a process in the early stages that includes understanding the core data structure and determining the synthetic data generation approach based on the latest advancements, particularly those that accommodate the data’s unique characteristics, is crucial for building a solid foundation for synthetic data generation. The adaptive process within this project draws on discussions with subject matter experts and insights from the literature and was necessary to address two main challenges at hand: the complex data structure (e.g., longitudinal patient history) and high dimensionality.
  • Leveraging existing open-source methods that were vetted by subject matter experts and released under proper software licenses adds to the strong foundation for the open-source synthetic data generation tool. In this project, specifically, the team consulted with subject matter experts and successfully secured appropriate licenses for all open-source tools considered while exploring alternative methods. Utilizing pre-existing code that generates data of similar structure offers transparency and builds on previous lessons learned.
  • Early assessment of scalability and interoperability requirements and alignment with national initiatives in advanced computing will enhance the efficiency of code deployment. This will, in turn, accelerate meeting public needs in a shared service environment. In the project, the team evaluates various software options to ensure these requirements are satisfied.
  • Implementing project risk mitigation plans and exploring ways to streamline data access are encouraged in future synthetic data generation efforts. Within this project, cross-agency collaboration for data access in a secure supercomputer environment requires coordination with project leads and legal teams. While this has posed some risk to the project timeline, the team has developed risk mitigation strategies. These strategies include working with public-use data in the secure supercomputer environment to familiarize themselves with the environment and to troubleshoot computer programs and code, ensuring they are ready when all agreements are signed, and the actual real-world data becomes available.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.