Project Name:
Synthetic Data Generation with Large, Real-World Data
Contractor: Westat
Lessons Learned
- This project aims to generate synthetic data and utilize computing space across multiple agencies. Given the involvement of multiple stakeholders and governance requirements, this project has highlighted the need for ongoing collaboration and effective communication. The team has addressed this by holding biweekly meetings with the full team, biweekly management meetings and creating a detailed timeline with achievable milestones. The involvement of multiple entities and the corresponding legal agreements have required significant time to ensure proper documentation preparation and review. The team has recognized the need for comprehensive documentation of this process for future reference. Clearly identifying obstacles and offering solutions or recommendations will be crucial for similar future projects.
- Synthetic data generation, which relies on a truth source, requires careful selection of variables, including assessments of missingness and levels of granularity. In addition, data quality of the variables in the truth source may impact what is selected to inform the synthetic data generation. The team has addressed this by involving stakeholders and expert users of the truth data to help inform the variable selection process.
- Establishing a process in the early stages that includes understanding the core data structure and determining the synthetic data generation approach based on the latest advancements, particularly those that accommodate the data’s unique characteristics, is crucial for building a solid foundation for synthetic data generation. The adaptive process within this project draws on discussions with subject matter experts and insights from the literature and was necessary to address two main challenges at hand: the complex data structure (e.g., longitudinal patient history) and high dimensionality.
- Leveraging existing open-source methods that were vetted by subject matter experts and released under proper software licenses adds to the strong foundation for the open-source synthetic data generation tool. In this project, specifically, the team consulted with subject matter experts and successfully secured appropriate licenses for all open-source tools considered while exploring alternative methods. Utilizing pre-existing code that generates data of similar structure offers transparency and builds on previous lessons learned.
- Early assessment of scalability and interoperability requirements and alignment with national initiatives in advanced computing will enhance the efficiency of code deployment. This will, in turn, accelerate meeting public needs in a shared service environment. In the project, the team evaluates various software options to ensure these requirements are satisfied.
- Implementing project risk mitigation plans and exploring ways to streamline data access are encouraged in future synthetic data generation efforts. Within this project, cross-agency collaboration for data access in a secure supercomputer environment requires coordination with project leads and legal teams. While this has posed some risk to the project timeline, the team has developed risk mitigation strategies. These strategies include working with public-use data in the secure supercomputer environment to familiarize themselves with the environment and to troubleshoot computer programs and code, ensuring they are ready when all agreements are signed, and the actual real-world data becomes available.
- The National Clinical Cohort Collaborative (N3C) COVID enclave environment works well for a limited dataset, however in order to synthesize the high dimensional data, a secure supercomputing environment (like that offered by ORNL) that can handle computationally intensive tasks is needed. While the N3C COVID enclave supports secure, sequential processing on slices of data, it has limitations in RAM and concurrent processing. This reinforced the importance of early evaluation of computational needs, workflow design, and performance testing to ensure alignment with technical and operational requirements.
- Engaging with support resources for the ORNL secure enclave environment has been highly beneficial. Regular use of office hours, technical documentation, and support channels has provided insights into both system capabilities/limitations and data handling. These resources greatly improved troubleshooting, informed workflow design, and enhanced data preparedness. This experience highlights the importance of consistent, accessible support mechanisms and active engagement with them, particularly in complex environments. For future NSDS shared services, concierge-style guidance will be essential, especially when navigating advanced AI tools in unfamiliar platforms.
- Navigating interagency and subcontracting agreements require sustained coordination across legal, administrative, and technical teams. Targeted agreements for specific project components have been more feasible than attempting one comprehensive agreement. For example, initiating an agreement using synthetic data with the Oak Ridge National Laboratory enabled work to begin while broader negotiations for the true data access continue. This phased approach enabled steady progress and is informing best practices for agreement development in multi-agency collaborations.
- The complexity of the N3C data reinforced the importance of using advanced AI tools capable of modeling nuanced temporal and relational structures. Open-source Large Language Models (LLMs), built on transformer architectures, offer a flexible, efficient way to capture complex dependencies without requiring proprietary software or predefined relationship specifications. Their ability to capture complex dependency structures in relational tables, where traditional AI models like generative adversarial networks struggle, and to incorporate custom tokenization, make LLMs especially well-suited for this context. Ongoing evaluation of emerging tools remains critical to ensuring scalability and methodological alignment with synthetic data generation best practices.
Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.