Project Name:

Measuring Large Language Model Understanding of Federal Statistical Data

Contractor: NORC at the University of Chicago

Lessons Learned

  • Balance breadth of prompt coverage. Balancing prompts across scenarios (Data Discovery, Access, and Use) and personas (Casual/Intermediate/Experienced) can prevent over-indexing on advanced analysis prompts and ensure that the evaluation reflects a range of realistic user questions (e.g., about data update frequency).
  • Design prompts to support one-to-many data asset relationships. Real users’ questions often span domains (e.g., education and income). Allowing prompts that can depend on multiple data assets reflects realistic workflows and can exposes cross-dataset metadata dependencies.
  • Distinguish domain expertise from technical expertise. Generic personas (“Experienced, Intermediate, and Casual”) provide a helpful scaffold but distinguishing experienced users into domain experts and data scientists reflects real-world diversity in user needs aligned with the appropriate response context.
  • Early Engagement Streamlined the PRA Clearance Process. Early planning and proactive engagement with our government partners enabled a quick and efficient Paperwork Reduction Act (PRA) clearance process. By sharing our user engagement plan with government clients early, they were able to coordinate with the appropriate officials regarding the planned outreach. A brief consultation with the Commerce PRA Clearance Officer confirmed that PRA clearance would be required for the effort. The team then promptly submitted the necessary documentation, which initiated an expedited five-day review. At the conclusion of that review, the team received OMB approval to move forward with outreach activities. This experience reinforced the value of early coordination, clear communication, and advanced preparation when navigating PRA requirements.
  • Need for Understanding. The need to understand the IT infrastructure the contractor is building for. Since this is R&D, we are exploring what to build and what to deliver. This has led to a lot of questions about the long-term maintainability of the application that’s being delivered at the end. There are cost implications for how the software is developed that should be considered when developing the software to account for these AI models constantly changing.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to [email protected].