Project Name:

Synthetic Data Generation with Large, Real-World Data

Contractor: Westat

Lessons Learned

  • This project aims to generate synthetic data and utilize computing space across multiple agencies. Given the involvement of multiple stakeholders and governance requirements, this project has highlighted the need for ongoing collaboration and effective communication. The team has addressed this by holding biweekly meetings with the full team, biweekly management meetings and creating a detailed timeline with achievable milestones. The involvement of multiple entities and the corresponding legal agreements have required significant time to ensure proper documentation preparation and review. The team has recognized the need for comprehensive documentation of this process for future reference. Clearly identifying obstacles and offering solutions or recommendations will be crucial for similar future projects.
  • Synthetic data generation, which relies on a truth source, requires careful selection of variables, including assessments of missingness and levels of granularity. In addition, data quality of the variables in the truth source may impact what is selected to inform the synthetic data generation. The team has addressed this by involving stakeholders and expert users of the truth data to help inform the variable selection process.
  • Establishing a process in the early stages that includes understanding the core data structure and determining the synthetic data generation approach based on the latest advancements, particularly those that accommodate the data’s unique characteristics, is crucial for building a solid foundation for synthetic data generation. The adaptive process within this project draws on discussions with subject matter experts and insights from the literature and was necessary to address two main challenges at hand: the complex data structure (e.g., longitudinal patient history) and high dimensionality.
  • Leveraging existing open-source methods that were vetted by subject matter experts and released under proper software licenses adds to the strong foundation for the open-source synthetic data generation tool. In this project, specifically, the team consulted with subject matter experts and successfully secured appropriate licenses for all open-source tools considered while exploring alternative methods. Utilizing pre-existing code that generates data of similar structure offers transparency and builds on previous lessons learned.
  • Early assessment of scalability and interoperability requirements and alignment with national initiatives in advanced computing will enhance the efficiency of code deployment. This will, in turn, accelerate meeting public needs in a shared service environment. In the project, the team evaluates various software options to ensure these requirements are satisfied.
  • Implementing project risk mitigation plans and exploring ways to streamline data access are encouraged in future synthetic data generation efforts. Within this project, cross-agency collaboration for data access in a secure supercomputer environment requires coordination with project leads and legal teams. While this has posed some risk to the project timeline, the team has developed risk mitigation strategies. These strategies include working with public-use data in the secure supercomputer environment to familiarize themselves with the environment and to troubleshoot computer programs and code, ensuring they are ready when all agreements are signed, and the actual real-world data becomes available.
  • The National Clinical Cohort Collaborative (N3C) COVID enclave environment works well for a limited dataset, however in order to synthesize the high dimensional data, a secure supercomputing environment (like that offered by ORNL) that can handle computationally intensive tasks is needed. While the N3C COVID enclave supports secure, sequential processing on slices of data, it has limitations in RAM and concurrent processing. This reinforced the importance of early evaluation of computational needs, workflow design, and performance testing to ensure alignment with technical and operational requirements.
  • Engaging with support resources for the ORNL secure enclave environment has been highly beneficial. Regular use of office hours, technical documentation, and support channels has provided insights into both system capabilities/limitations and data handling. These resources greatly improved troubleshooting, informed workflow design, and enhanced data preparedness. This experience highlights the importance of consistent, accessible support mechanisms and active engagement with them, particularly in complex environments. For future NSDS shared services, concierge-style guidance will be essential, especially when navigating advanced AI tools in unfamiliar platforms.
  • Navigating interagency and subcontracting agreements require sustained coordination across legal, administrative, and technical teams. Targeted agreements for specific project components have been more feasible than attempting one comprehensive agreement. For example, initiating an agreement using synthetic data with the Oak Ridge National Laboratory enabled work to begin while broader negotiations for the true data access continue. This phased approach enabled steady progress and is informing best practices for agreement development in multi-agency collaborations.
  • The complexity of the N3C data reinforced the importance of using advanced AI tools capable of modeling nuanced temporal and relational structures. Open-source Large Language Models (LLMs), built on transformer architectures, offer a flexible, efficient way to capture complex dependencies without requiring proprietary software or predefined relationship specifications. Their ability to capture complex dependency structures in relational tables, where traditional AI models like generative adversarial networks struggle, and to incorporate custom tokenization, make LLMs especially well-suited for this context. Ongoing evaluation of emerging tools remains critical to ensuring scalability and methodological alignment with synthetic data generation best practices.
  • The complexity of input data (specifically, N3C data), arising from high dimensionality, variable structures, and true missingness, requires careful planning and flexible strategies. Understanding these characteristics early supports more realistic modeling, reproducibility, and alignment with real-world data. For example, working with continuous variables proved more complex than anticipated, highlighting the need for specialized approaches and additional computational consideration. Tailoring preprocessing and modeling techniques to accommodate continuous and high-dimensional features enhances the accuracy and robustness of synthetic data outputs.
  • As datasets may be made available cumulatively or in batches, the analytical approach and supporting tools must remain flexible. Adapting methods to the manner in which data are accumulated or separated, such as modifying preprocessing workflows or refining evaluation metrics for partial datasets, enables the team to maximize the value of each batch while maintaining efficiency throughout the project lifecycle. Each batch also offers an opportunity to learn from both the data and the process itself; for instance, identifying variations across data providers can inform subsequent analyses and support the development of more precise, context-aware models once the full dataset becomes available.
  • As the team encounters diverse data formats and secure supercomputing environments, developing tools and workflows that are environment- and data-agnostic will be increasingly beneficial. We are addressing ongoing needs to properly isolate the components that are system or data agnostic, allowing for smoother development of solutions that may extend beyond the environments and data set used for this project.
  • Temporary suspension of access to the secure N3C enclave required adjustments to project activities and timelines. During this period, the team leveraged code maintained outside the enclave and utilized an alternative computing environment (secure supercomputer environment available for the project) along with a publicly available, comparable dataset to continue development and testing. This approach enabled progress despite access limitations, allowing workflows and models to be refined and validated in preparation for reentry into the secure environment. The experience underscored the value of adaptable planning to sustain momentum during periods of restricted access and maintaining code and tools available outside a single environment as relying solely on the one environment, where access is not always guaranteed, can create project risks.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.