Project Name:
Artificial Intelligence for Enhancing Data Quality, Standardization, and Integration for Federal Statistics
Contractor: NORC at the University of Chicago
Lessons Learned
- To enhance data quality, standardization, and integration in the federal statistical system, prioritize opportunities for (a) survey data, (b) public sector administrative records, (c) third-party or private sector data, and (d) geospatial data.
- Availability of data documentation is critical for data use and integration across data types.
- When survey, administrative, or third-party data are used for purposes other than those for which they were originally collected, such as to link records, determinations about data quality hinge on representativeness and fitness for use.
- A review of existing AI tools for non-spatial (tabular) and spatial data revealed that machine learning and rule-based methods are more commonly applied than LLMs or related techniques.
- Considerations for use of AI include balancing data privacy with utility and disclosing AI sources and methods used to insure explainability for decision-making
- Key mitigation techniques for AI risks involve including humans in the loop to direct and review AI performance in acquiring, describing, and transforming data for use
- AI may be leveraged to automate data cleaning and validation tasks across all data types, such as identifying and correcting errors, inconsistencies, and outliers.
- AI may simplify the integration of geospatial data into statistical applications by automating and standardizing the extraction of structured features, such as road networks and their attributes from satellite imagery.
- Large Language Models can be leveraged to enhance existing data documentation and metadata, thereby improving data discoverability and usability.
- Including humans in the loop when designing AI workflows can ensure higher accuracy and mitigate model hallucinations. AI tool design should facilitate collaboration between systems and humans.
- To address algorithmic biases, strategies for agencies to consider and explore include fairness audits, bias correction techniques, and training novel AI systems on curated, representative data.
- Transparency and explainability when using AI can be upheld by disclosing which records and variables have been used, as well as the methods applied in data processing and integration, allowing users to understand potential limitations that may influence model performance.
Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to ncsesweb@nsf.gov.