Project Name:

Informing Next-generation Federal Statistics

Contractor: Georgetown University

Lessons Learned

  • Novel Strategies Encompass a Broad and Complex Range of Methods at Varying Degrees of Maturity: Novel methods range in scope from those for which a body of research already exists(e.g., working with commercial or satellite data, implementing small area estimation methods) to those that are nascent (e.g., using large language models to assess concepts across a system of records). As a result, any roadmap to federal implementation of alternative frames and estimation strategies will need to address this complexity: some strategies will be implementation-ready, while others will require additional research.
  • Need to be Judicious in Framing Extent of Problem: One of the main strengths of this project is that it explores a wide variety of new and creative approaches, several of which are planned to be tested in the case studies. However, as we began outlining the case studies and, especially, gathering sources for the literature review, it became clear that even when we narrow the focus to frame development and related estimation methods, the amount of material in statistics and data science that could be studied is still vast and overwhelming. This work will help inform the federal statistical system by outlining how agencies will need to be judicious in selecting representative articles that meet the questions at hand. Defining a key population to measure or estimation strategy will need to implement selection criteria to prioritize metanalyses, exemplars of key concerns, and literature that is open access.
  • Early Indication of Possible Resource Troves: There are open-source resources that can be used to support novel approaches for frame development and estimation that already exist. This project will provide a roadmap for how to utilize available tools for specific case studies that can then be applied in other situations (you can then cite the tools if you want). Some examples include the following:
    • The Urbanworm package on GitHub (https://billbillbilly.github.io/urbanworm/) and the Huggingface repository (https://huggingface.co/) are excellent tools for automating the identification and coding of street and satellite images.
    • The National Neighborhood Data Archive (NaNDA) (https://nanda.isr.umich.edu/) is a robust source of contextual data sets that can be linked to address-based sampling frames.
    • The Transportation Secure Data Center (https://www.nrel.gov/transportation/secure- transportation-data) is a repository—with tiered access—to “high-resolution transportation data from hundreds of travel and transit surveys and studies covering more than 70% of U.S. states,” which facilitates research and testing.
  • Consideration Needs to Be Given to Data Quality: Preliminary case study efforts have identified some early lessons for the feasibility of federal implementation.
    • Careful data discovery is required when initiating work with novel data systems, as data structure drift may occur and may, or may not be, well documented (e.g., finding information about student family history in a data field titled “religion” within an education data system). This will entail collaboration with data producers and utilization (or development of) robust data documentation.
    • Web-scraped data need to be assessed for data completeness and patterns of missingness prior to making an assessment of fitness for use. For example, patterns of missingness in court data may affect whether constructs such as recidivism or case progression and similarity can be reliably observed in the available records.

Disclaimer: America’s DataHub Consortium (ADC), a public-private partnership, implements research opportunities that support the strategic objectives of the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation (NSF). These results document research funded through ADC and is being shared to inform interested parties of ongoing activities and to encourage further discussion. Any opinions, findings, conclusions, or recommendations expressed above do not necessarily reflect the views of NCSES or NSF. Please send questions to [email protected].