Bibliography - AI Workflow Design for Official Statistics

alemohammad_2023 — Alemohammad, Sina et al. (2023). Self-Consuming Generative Models Go MAD. arXiv preprint arXiv:2307.01850

azanza_2025 — Azanza, Maider, Perez Lamancha, Beatriz, Pizarro, Emilio. (2025). Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry. Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE 2025)

bastani_2024 — Bastani, Hamsa et al. (2024). Generative AI Without Guardrails Can Harm Learning: Evidence from High School Mathematics. Proceedings of the National Academy of Sciences. DOI: 10.1073/pnas.2422633122

becker_2025 — Becker, Joel et al. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arXiv preprint arXiv:2507.09089. URL: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

bradley_terry_1952 — Bradley, Ralph Allan, Terry, Milton E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4), 324-345

brodeur_2025_ssrp — Brodeur, Abel et al. (2025). Assessing Reproducibility in Economics Using Standardized Crowd-sourced Analysis. NBER Working Paper No. 33753. URL: https://www.nber.org/system/files/working_papers/w33753/w33753.pdf

brodeur_2026 — Brodeur, A. et al. (2026). Reproducibility and robustness of economics and political science research. Nature, 652, 151-158. DOI: 10.1038/s41586-026-10251-x

bucknerPetty_2019 — Buckner-Petty, Skye, Dale, Ann Marie, Evanoff, Bradley A. (2019). Efficiency of autocoding programs for converting job descriptors into Standard Occupational Classification (SOC) codes. American Journal of Industrial Medicine, 62(1), 59-68. DOI: 10.1002/ajim.22928

carlini_2021_extracting — Carlini, Nicholas et al. (2021). Extracting Training Data from Large Language Models. 30th USENIX Security Symposium (USENIX Security 21), 2633-2650

census_fedcasic_2024 — U.S. Census Bureau. (2024). Machine Learning for In-Instrument Product Code Search: SINCT for NAPCS Coding. FedCASIC 2024 Conference Presentation

chen_2026_sweci — Chen, Jialong et al. (2026). SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration. arXiv preprint arXiv:2603.03823. URL: https://arxiv.org/abs/2603.03823

chroma_2024_chunking — Chroma. (2024). Evaluating Chunking Strategies for Retrieval. URL: https://www.trychroma.com/research/evaluating-chunking

dell_acqua_2023 — Dell’Acqua, Fabrizio et al. (2023). Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. Harvard Business School Working Paper No. 24-013

desai_2016 — Desai, Tanvi, Ritchie, Felix, Welpton, Richard. (2016). Five Safes: Designing Data Access for Research. University of the West of England. URL: https://www2.uwe.ac.uk/faculties/BBS/Documents/1601.pdf

digiuseppe_2026 — DiGiuseppe, Matthew R., Flynn, Michael E. (2026). Scaling Open-Ended Survey Responses Using LLM-Paired Comparisons. Public Opinion Quarterly, nfag013

doncio_raise2_2024 — Department of the Navy, Chief Information Officer. (2024). RAISE 2.0: Risk Management Framework Assessment and Implementation Steps and Examples

dora_2024 — DORA Team, Google Cloud. (2024). Accelerate State of DevOps Report 2024. Google Cloud. URL: https://dora.dev/research/2024/dora-report/

fagerberg_2025 — Fagerberg, Pontus and Sallander, Oskar and Vikhe Patil, Koustubh et al. (2025). Dual-Model LLM Ensemble via Web Chat Interfaces Reaches Near-Perfect Sensitivity for Systematic-Review Screening. medRxiv preprint

fan_2024_metacognitive — Fan, Yizhou et al. (2024). Beware of Metacognitive Laziness: Effects of Generative Artificial Intelligence on Learning Motivation, Processes, and Performance. URL: https://arxiv.org/abs/2412.09315

fcsm_2020_data_quality — Federal Committee on Statistical Methodology. (2020). A Framework for Data Quality. Federal Committee on Statistical Methodology

fcsm_2025_aiready — Hoppe, Travis et al. (2025). AI-Ready Federal Statistical Data: An Extension of Communicating Data Quality. Federal Committee on Statistical Methodology

fedramp_20x_2025 — General Services Administration. (2025). GSA Announces FedRAMP 20x. URL: https://www.gsa.gov/about-us/newsroom/news-releases/gsa-announces-fedramp-20x-03242025

fellegi_1976 — Fellegi, Ivan P., Holt, D. Tim. (1976). A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association, 71(353), 17-35

fortier_2011 — Fortier, Isabel and Doiron, Dany and Little, Julian et al. (2011). Is Rigorous Retrospective Harmonization Possible? Application of the DataSHaPER Approach Across 53 Large Studies. International Journal of Epidemiology, 40(5), 1314-1328

fortier_2017 — Fortier, Isabel and Raina, Parminder and van den Heuvel, Edwin R. et al. (2017). Maelstrom Research Guidelines for Rigorous Retrospective Data Harmonization. International Journal of Epidemiology, 46(1), 103-115

gama_2014 — Gama, Joao, \vZliobait{.e. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys, 46(4)

gotel_1994 — Gotel, Orlena C. Z., Finkelstein, Anthony C. W. (1994). An Analysis of the Requirements Traceability Problem. Proceedings of the 1st International Conference on Requirements Engineering, 94-101

groves_2009 — Groves, Robert M. et al. (2009). Survey Methodology. Wiley

gu_2024 — Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Guo, Jialiang et al. (2024). A Survey on LLM-as-a-Judge. arXiv preprint arXiv:2411.15594

hogan_2021 — Hogan, Aidan et al. (2021). Knowledge Graphs. ACM Computing Surveys, 54(4), 1-37

huang_2024 — Huang, Jie et al. (2024). Large Language Models Cannot Self-Correct Reasoning Yet. Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)

kalyuga_2003 — Kalyuga, Slava et al. (2003). The Expertise Reversal Effect. Educational Psychologist, 38(1), 23-31. DOI: 10.1207/S15326985EP3801_4

kamen_2025 — Kamen, Ali, Kamen, Yonatan. (2025). Majority Rules: LLM Ensemble is a Winning Approach for Content Categorization. arXiv preprint

knoxsystems_fedramp_2026 — Knox Systems. (2026). FedRAMP Authorization Timeline: A Comprehensive Guide

landis_1977 — Landis, J. Richard, Koch, Gary G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174

lee_2025 — Lee, Harrison et al. (2025). RefineBench: Evaluating Iterative Self-Refinement in Large Language Models

lee_2026_metaharness — Lee, Yoonho et al. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv preprint arXiv:2603.28052. URL: https://arxiv.org/abs/2603.28052

li_2023_order_sensitivity — Li, Peiwen and Chen, Tao et al. (2023). Large Language Models Sensitivity to the Order of Options in Multiple-Choice Questions. arXiv preprint arXiv:2308.11483. URL: https://arxiv.org/abs/2308.11483

li_d_2024 — Li, Dawei et al. (2024). From Generation to Judgment: Opportunities and Challenges of LLM-as-a-Judge. arXiv preprint arXiv:2411.16594

li_h_2024 — Li, Haitao et al. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv preprint arXiv:2412.05579

lin_2025 — Lin, Yiming et al. (2025). TWIX: Automatically Reconstructing Structured Data from Templatized Documents. Proceedings of the ACM on Management of Data

liu_2024_chatgpt_behavior — Liu, Lingjiao and Ren, Zhitao et al. (2024). How Is ChatGPT’s Behavior Changing over Time?. Harvard Data Science Review. URL: https://hdsr.mitpress.mit.edu/pub/y95zitmz

liu_2025_sejury — Liu, Xiaoyu et al. (2025). SE-Jury: An LLM-as-Ensemble-Judge Metric for Narrowing the Gap with Human Evaluation in Software Engineering. arXiv preprint

madaan_2023 — Madaan, Aman et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems 36 (NeurIPS 2023)

mazeika_2025 — Mazeika, Mantas and Gatti, Alice and Menghini, Cristina and Sehwag, Udari Madhushani and Singhal, Shivam and Orlovskiy, Yury and Basart, Steven et al. (2025). Remote Labor Index: Measuring AI Automation of Remote Work. arXiv preprint arXiv:2510.26787. URL: https://arxiv.org/abs/2510.26787

microsoft_azure_gov_features — Microsoft. (2026). Feature Availability for Azure Government. URL: https://learn.microsoft.com/en-us/azure/security/fundamentals/feature-availability

miske_2026 — Miske, O. et al. (2026). Investigating the reproducibility of the social and behavioural sciences. Nature, 652, 126-134. DOI: 10.1038/s41586-026-10203-5

mitre_aida_timelines — MITRE Corporation. (2026). Timelines - AiDA: Acquisition in the Digital Age. URL: https://aida.mitre.org/demystifying-dod/timelines/

morris_2025_verasight — Morris, David. (2025). The Risks of Using LLM Imputation of Survey Data to Produce ``Synthetic Samples’'. Verasight. URL: https://www.verasight.io/reports/synthetic-sampling-2

nccoe_agent_identity_2026 — Booth, Harold et al. (2026). Accelerating the Adoption of Software and AI Agent Identity and Authorization. National Cybersecurity Center of Excellence, National Institute of Standards and Technology

nist_aasi_2026 — National Institute of Standards, Technology, Center for AI Standards, Innovation. (2026). AI Agent Standards Initiative: Ensuring a Trusted, Interoperable, and Secure Agentic Frontier. Program page, created 2026-02-17

nist_ai_rmf_2023 — National Institute of Standards, Technology. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST

nist_caisi_rfi_2026 — National Institute of Standards, Technology, Center for AI Standards, Innovation. (2026). Request for Information Regarding Security Considerations for Artificial Intelligence Agents. National Institute of Standards and Technology, U.S. Department of Commerce

nist_csf_2024 — National Institute of Standards, Technology. (2024). The NIST Cybersecurity Framework (CSF) 2.0. National Institute of Standards and Technology. DOI: 10.6028/NIST.CSWP.29

nist_cyber_ai_profile_2025 — Megas, Katerina et al. (2025). Cybersecurity Framework Profile for Artificial Intelligence (Cyber AI Profile). National Institute of Standards and Technology, U.S. Department of Commerce

nist_genai_2024 — National Institute of Standards, Technology. (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST

oecd_digital_education_2026 — Organisation for Economic Co-operation, Development. (2026). OECD Digital Education Outlook 2026. OECD Publishing. URL: https://www.oecd.org/en/publications/oecd-digital-education-outlook-2026_062a7394-en.html

omb_ai_procurement_2025 — Office of Management, Budget. (2025). Responsible Procurement of Artificial Intelligence in Government. Executive Office of the President

ouyang_2025_nondeterminism — Ouyang, Siyuan et al. (2025). Non-Determinism of ``Deterministic’’ LLM System Settings in Hosted LLMs. Eval4NLP 2025. URL: https://aclanthology.org/2025.eval4nlp-1.12.pdf

owasp_agentic_top10_2026 — OWASP GenAI Security Project, Agent Security Initiative. (2025). OWASP Top 10 for Agentic Applications 2026. Published December 9, 2025; peer-reviewed by 100+ experts

pan_2024 — Pan, Linyi et al. (2024). Spontaneous Reward Hacking in Iterative Self-Refinement

pdf_parser_benchmark_2026 — Horn, Pius, Keuper, Janis. (2026). Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation. arXiv preprint

pike_1989 — Pike, Rob. (1989). Notes on Programming in C. Bell Labs internal document. URL: https://www.lysator.liu.se/c/pikestyle.html

reveal_2025 — Reveal Global Consulting. (2025). ACS Autocoder Modernization: LLM Embeddings and Fine-Tuning for Occupation and Industry Coding. Technical report

santos_2025_harmonia — Santos, Aecio et al. (2025). Interactive Data Harmonization with LLM Agents: Opportunities and Challenges. arXiv preprint arXiv:2502.07132. URL: https://arxiv.org/abs/2502.07132

shapira_2026 — Shapira, Natalie et al. (2026). Agents of Chaos. URL: https://arxiv.org/abs/2602.20021

shapiro_2026 — Shapiro, Dan. (2026). The Five Levels: From Spicy Autocomplete to the Dark Factory. danshapiro.com

shumailov_2024 — Shumailov, Ilia et al. (2024). AI Models Collapse When Trained on Recursively Generated Data. Nature, 631(8022), 755-759

simmhan_2005 — Simmhan, Yogesh L., Plale, Beth, Gannon, Dennis. (2005). A Survey of Data Provenance in e-Science. ACM SIGMOD Record, 34(3), 31-36

song_2025_correlated_errors — Song, Zhi et al. (2025). Correlated Errors in Large Language Models. ICML 2025. URL: https://icml.cc/virtual/2025/poster/44225

tada_vldb_2024 — Parciak, Marcel et al. (2024). Schema Matching with Large Language Models: An Experimental Study. VLDB 2024 Workshop: Tabular Data Analysis Workshop (TaDA)

tam_2024 — Tam, Thanh Yen Caelen and Sivarajkumar, Sonish and Kapoor, Sumit et al. (2024). A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review. NPJ Digital Medicine, 7, 258

templ_2026 — Templ, Matthias. (2026). AI-Assisted Statistical Disclosure Control with sdcMicro. R package vignette, R Foundation

tian_2025 — Tian, Weiyushi (Sarah) et al. (2025). Comparing Large Language Models and Traditional Methods for Imputing Missing Survey Responses in a 2024 U.S. Presidential Election Survey. AAPOR 2025 Annual Conference

tripathi_2025 — Tripathi, Tuhina et al. (2025). Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation. arXiv preprint arXiv:2504.14716

van_buuren_2011 — van Buuren, Stef, Groothuis-Oudshoorn, Karin. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67

wang_2022 — Wang, Xuezhi et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv preprint arXiv:2203.11171

webb_2025_opencensus_mcp — Webb, Brock. (2025). Open Census MCP Server: Lessons Learned. Project documentation. URL: https://github.com/brockwebb/open-census-mcp-server

webb_2026_ai4stats — Webb, Brock. (2026). AI for Official Statistics. Zenodo. DOI: 10.5281/zenodo.19206379

webb_2026_concept_mapper — Webb, Brock. (2026). AI-Assisted Federal Survey Harmonization: Cross-Survey Integration Analysis of 47 Census Bureau Demographic Surveys. Draft working paper

webb_2026_crosswalk — Webb, Brock. (2026). When AI Enters Federal Statistics: A Regulatory Crosswalk of NIST AI RMF and FCSM Statistical Quality Standards. Zenodo. DOI: 10.5281/zenodo.18772590

webb_2026_pragmatics — Webb, Brock. (2026). Pragmatics: Delivering Expert Judgment to AI Systems. Zenodo. DOI: 10.5281/zenodo.18913092

widmer_kubat_1996 — Widmer, Gerhard, Kubat, Miroslav. (1996). Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning, 23, 69-101

wolf_2016 — Wolf, Christof et al. Harmonizing Survey Questions Between Cultures and Over Time. The SAGE Handbook of Survey Methodology, 502-524

xu_2024 — Xu, Wenda et al. (2024). Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. arXiv preprint

yang_2025 — Yang, Zhangyue and Zhang, Yifan and Wang, Yuxin et al. A Probabilistic Inference Scaling Theory for LLM Self-Correction. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)