alemohammad_2023 — Alemohammad, Sina et al. (2023). Self-Consuming Generative Models Go MAD. arXiv preprint arXiv:2307.01850
azanza_2025 — Azanza, Maider, Perez Lamancha, Beatriz, Pizarro, Emilio. (2025). Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry. Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE 2025)
bastani_2024 — Bastani, Hamsa et al. (2024). Generative AI Without Guardrails Can Harm Learning: Evidence from High School Mathematics. Proceedings of the National Academy of Sciences. DOI: 10.1073/pnas.2422633122
becker_2025 — Becker, Joel et al. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arXiv preprint arXiv:2507.09089. URL: https://
bradley_terry_1952 — Bradley, Ralph Allan, Terry, Milton E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4), 324-345
brodeur_2025_ssrp — Brodeur, Abel et al. (2025). Assessing Reproducibility in Economics Using Standardized Crowd-sourced Analysis. NBER Working Paper No. 33753. URL: https://
brodeur_2026 — Brodeur, A. et al. (2026). Reproducibility and robustness of economics and political science research. Nature, 652, 151-158. DOI: 10.1038/s41586-026-10251-x
bucknerPetty_2019 — Buckner-Petty, Skye, Dale, Ann Marie, Evanoff, Bradley A. (2019). Efficiency of autocoding programs for converting job descriptors into Standard Occupational Classification (SOC) codes. American Journal of Industrial Medicine, 62(1), 59-68. DOI: 10.1002/ajim.22928
carlini_2021_extracting — Carlini, Nicholas et al. (2021). Extracting Training Data from Large Language Models. 30th USENIX Security Symposium (USENIX Security 21), 2633-2650
census_fedcasic_2024 — U.S. Census Bureau. (2024). Machine Learning for In-Instrument Product Code Search: SINCT for NAPCS Coding. FedCASIC 2024 Conference Presentation
chen_2026_sweci — Chen, Jialong et al. (2026). SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration. arXiv preprint arXiv:2603.03823. URL: https://
chroma_2024_chunking — Chroma. (2024). Evaluating Chunking Strategies for Retrieval. URL: https://
dell_acqua_2023 — Dell’Acqua, Fabrizio et al. (2023). Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. Harvard Business School Working Paper No. 24-013
desai_2016 — Desai, Tanvi, Ritchie, Felix, Welpton, Richard. (2016). Five Safes: Designing Data Access for Research. University of the West of England. URL: https://
digiuseppe_2026 — DiGiuseppe, Matthew R., Flynn, Michael E. (2026). Scaling Open-Ended Survey Responses Using LLM-Paired Comparisons. Public Opinion Quarterly, nfag013
doncio_raise2_2024 — Department of the Navy, Chief Information Officer. (2024). RAISE 2.0: Risk Management Framework Assessment and Implementation Steps and Examples
dora_2024 — DORA Team, Google Cloud. (2024). Accelerate State of DevOps Report 2024. Google Cloud. URL: https://
fagerberg_2025 — Fagerberg, Pontus and Sallander, Oskar and Vikhe Patil, Koustubh et al. (2025). Dual-Model LLM Ensemble via Web Chat Interfaces Reaches Near-Perfect Sensitivity for Systematic-Review Screening. medRxiv preprint
fan_2024_metacognitive — Fan, Yizhou et al. (2024). Beware of Metacognitive Laziness: Effects of Generative Artificial Intelligence on Learning Motivation, Processes, and Performance. URL: https://
fcsm_2020_data_quality — Federal Committee on Statistical Methodology. (2020). A Framework for Data Quality. Federal Committee on Statistical Methodology
fcsm_2025_aiready — Hoppe, Travis et al. (2025). AI-Ready Federal Statistical Data: An Extension of Communicating Data Quality. Federal Committee on Statistical Methodology
fedramp_20x_2025 — General Services Administration. (2025). GSA Announces FedRAMP 20x. URL: https://
fellegi_1976 — Fellegi, Ivan P., Holt, D. Tim. (1976). A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association, 71(353), 17-35
fortier_2011 — Fortier, Isabel and Doiron, Dany and Little, Julian et al. (2011). Is Rigorous Retrospective Harmonization Possible? Application of the DataSHaPER Approach Across 53 Large Studies. International Journal of Epidemiology, 40(5), 1314-1328
fortier_2017 — Fortier, Isabel and Raina, Parminder and van den Heuvel, Edwin R. et al. (2017). Maelstrom Research Guidelines for Rigorous Retrospective Data Harmonization. International Journal of Epidemiology, 46(1), 103-115
gama_2014 — Gama, Joao, \vZliobait{.e. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys, 46(4)
gotel_1994 — Gotel, Orlena C. Z., Finkelstein, Anthony C. W. (1994). An Analysis of the Requirements Traceability Problem. Proceedings of the 1st International Conference on Requirements Engineering, 94-101
groves_2009 — Groves, Robert M. et al. (2009). Survey Methodology. Wiley
gu_2024 — Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Guo, Jialiang et al. (2024). A Survey on LLM-as-a-Judge. arXiv preprint arXiv:2411.15594
hogan_2021 — Hogan, Aidan et al. (2021). Knowledge Graphs. ACM Computing Surveys, 54(4), 1-37
huang_2024 — Huang, Jie et al. (2024). Large Language Models Cannot Self-Correct Reasoning Yet. Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)
kalyuga_2003 — Kalyuga, Slava et al. (2003). The Expertise Reversal Effect. Educational Psychologist, 38(1), 23-31. DOI: 10.1207/S15326985EP3801_4
kamen_2025 — Kamen, Ali, Kamen, Yonatan. (2025). Majority Rules: LLM Ensemble is a Winning Approach for Content Categorization. arXiv preprint
knoxsystems_fedramp_2026 — Knox Systems. (2026). FedRAMP Authorization Timeline: A Comprehensive Guide
landis_1977 — Landis, J. Richard, Koch, Gary G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174
lee_2025 — Lee, Harrison et al. (2025). RefineBench: Evaluating Iterative Self-Refinement in Large Language Models
lee_2026_metaharness — Lee, Yoonho et al. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv preprint arXiv:2603.28052. URL: https://
li_2023_order_sensitivity — Li, Peiwen and Chen, Tao et al. (2023). Large Language Models Sensitivity to the Order of Options in Multiple-Choice Questions. arXiv preprint arXiv:2308.11483. URL: https://
li_d_2024 — Li, Dawei et al. (2024). From Generation to Judgment: Opportunities and Challenges of LLM-as-a-Judge. arXiv preprint arXiv:2411.16594
li_h_2024 — Li, Haitao et al. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv preprint arXiv:2412.05579
lin_2025 — Lin, Yiming et al. (2025). TWIX: Automatically Reconstructing Structured Data from Templatized Documents. Proceedings of the ACM on Management of Data
liu_2024_chatgpt_behavior — Liu, Lingjiao and Ren, Zhitao et al. (2024). How Is ChatGPT’s Behavior Changing over Time?. Harvard Data Science Review. URL: https://
liu_2025_sejury — Liu, Xiaoyu et al. (2025). SE-Jury: An LLM-as-Ensemble-Judge Metric for Narrowing the Gap with Human Evaluation in Software Engineering. arXiv preprint
madaan_2023 — Madaan, Aman et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
mazeika_2025 — Mazeika, Mantas and Gatti, Alice and Menghini, Cristina and Sehwag, Udari Madhushani and Singhal, Shivam and Orlovskiy, Yury and Basart, Steven et al. (2025). Remote Labor Index: Measuring AI Automation of Remote Work. arXiv preprint arXiv:2510.26787. URL: https://
microsoft_azure_gov_features — Microsoft. (2026). Feature Availability for Azure Government. URL: https://
miske_2026 — Miske, O. et al. (2026). Investigating the reproducibility of the social and behavioural sciences. Nature, 652, 126-134. DOI: 10.1038/s41586-026-10203-5
mitre_aida_timelines — MITRE Corporation. (2026). Timelines - AiDA: Acquisition in the Digital Age. URL: https://
morris_2025_verasight — Morris, David. (2025). The Risks of Using LLM Imputation of Survey Data to Produce ``Synthetic Samples’'. Verasight. URL: https://
nccoe_agent_identity_2026 — Booth, Harold et al. (2026). Accelerating the Adoption of Software and AI Agent Identity and Authorization. National Cybersecurity Center of Excellence, National Institute of Standards and Technology
nist_aasi_2026 — National Institute of Standards, Technology, Center for AI Standards, Innovation. (2026). AI Agent Standards Initiative: Ensuring a Trusted, Interoperable, and Secure Agentic Frontier. Program page, created 2026-02-17
nist_ai_rmf_2023 — National Institute of Standards, Technology. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST
nist_caisi_rfi_2026 — National Institute of Standards, Technology, Center for AI Standards, Innovation. (2026). Request for Information Regarding Security Considerations for Artificial Intelligence Agents. National Institute of Standards and Technology, U.S. Department of Commerce
nist_csf_2024 — National Institute of Standards, Technology. (2024). The NIST Cybersecurity Framework (CSF) 2.0. National Institute of Standards and Technology. DOI: 10.6028/NIST.CSWP.29
nist_cyber_ai_profile_2025 — Megas, Katerina et al. (2025). Cybersecurity Framework Profile for Artificial Intelligence (Cyber AI Profile). National Institute of Standards and Technology, U.S. Department of Commerce
nist_genai_2024 — National Institute of Standards, Technology. (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST
oecd_digital_education_2026 — Organisation for Economic Co-operation, Development. (2026). OECD Digital Education Outlook 2026. OECD Publishing. URL: https://
omb_ai_procurement_2025 — Office of Management, Budget. (2025). Responsible Procurement of Artificial Intelligence in Government. Executive Office of the President
ouyang_2025_nondeterminism — Ouyang, Siyuan et al. (2025). Non-Determinism of ``Deterministic’’ LLM System Settings in Hosted LLMs. Eval4NLP 2025. URL: https://
owasp_agentic_top10_2026 — OWASP GenAI Security Project, Agent Security Initiative. (2025). OWASP Top 10 for Agentic Applications 2026. Published December 9, 2025; peer-reviewed by 100+ experts
pan_2024 — Pan, Linyi et al. (2024). Spontaneous Reward Hacking in Iterative Self-Refinement
pdf_parser_benchmark_2026 — Horn, Pius, Keuper, Janis. (2026). Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation. arXiv preprint
pike_1989 — Pike, Rob. (1989). Notes on Programming in C. Bell Labs internal document. URL: https://
reveal_2025 — Reveal Global Consulting. (2025). ACS Autocoder Modernization: LLM Embeddings and Fine-Tuning for Occupation and Industry Coding. Technical report
santos_2025_harmonia — Santos, Aecio et al. (2025). Interactive Data Harmonization with LLM Agents: Opportunities and Challenges. arXiv preprint arXiv:2502.07132. URL: https://
shapira_2026 — Shapira, Natalie et al. (2026). Agents of Chaos. URL: https://
shapiro_2026 — Shapiro, Dan. (2026). The Five Levels: From Spicy Autocomplete to the Dark Factory. danshapiro.com
shumailov_2024 — Shumailov, Ilia et al. (2024). AI Models Collapse When Trained on Recursively Generated Data. Nature, 631(8022), 755-759
simmhan_2005 — Simmhan, Yogesh L., Plale, Beth, Gannon, Dennis. (2005). A Survey of Data Provenance in e-Science. ACM SIGMOD Record, 34(3), 31-36
song_2025_correlated_errors — Song, Zhi et al. (2025). Correlated Errors in Large Language Models. ICML 2025. URL: https://
tada_vldb_2024 — Parciak, Marcel et al. (2024). Schema Matching with Large Language Models: An Experimental Study. VLDB 2024 Workshop: Tabular Data Analysis Workshop (TaDA)
tam_2024 — Tam, Thanh Yen Caelen and Sivarajkumar, Sonish and Kapoor, Sumit et al. (2024). A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review. NPJ Digital Medicine, 7, 258
templ_2026 — Templ, Matthias. (2026). AI-Assisted Statistical Disclosure Control with sdcMicro. R package vignette, R Foundation
tian_2025 — Tian, Weiyushi (Sarah) et al. (2025). Comparing Large Language Models and Traditional Methods for Imputing Missing Survey Responses in a 2024 U.S. Presidential Election Survey. AAPOR 2025 Annual Conference
tripathi_2025 — Tripathi, Tuhina et al. (2025). Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation. arXiv preprint arXiv:2504.14716
van_buuren_2011 — van Buuren, Stef, Groothuis-Oudshoorn, Karin. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67
wang_2022 — Wang, Xuezhi et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv preprint arXiv:2203.11171
webb_2025_opencensus_mcp — Webb, Brock. (2025). Open Census MCP Server: Lessons Learned. Project documentation. URL: https://
webb_2026_ai4stats — Webb, Brock. (2026). AI for Official Statistics. Zenodo. DOI: 10.5281/zenodo.19206379
webb_2026_concept_mapper — Webb, Brock. (2026). AI-Assisted Federal Survey Harmonization: Cross-Survey Integration Analysis of 47 Census Bureau Demographic Surveys. Draft working paper
webb_2026_crosswalk — Webb, Brock. (2026). When AI Enters Federal Statistics: A Regulatory Crosswalk of NIST AI RMF and FCSM Statistical Quality Standards. Zenodo. DOI: 10.5281/zenodo.18772590
webb_2026_pragmatics — Webb, Brock. (2026). Pragmatics: Delivering Expert Judgment to AI Systems. Zenodo. DOI: 10.5281/zenodo.18913092
widmer_kubat_1996 — Widmer, Gerhard, Kubat, Miroslav. (1996). Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning, 23, 69-101
wolf_2016 — Wolf, Christof et al. Harmonizing Survey Questions Between Cultures and Over Time. The SAGE Handbook of Survey Methodology, 502-524
xu_2024 — Xu, Wenda et al. (2024). Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. arXiv preprint
yang_2025 — Yang, Zhangyue and Zhang, Yifan and Wang, Yuxin et al. A Probabilistic Inference Scaling Theory for LLM Self-Correction. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)