Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 12 - Large Language Models for Survey Operations

1. Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import cohen_kappa_score, accuracy_score

No LLM API calls are made in this chapter. All code examples use simulated data that reproduces the statistical patterns observed in published studies. See examples/chapter-12/ for fully runnable scripts.

2. The coding problem in federal statistics

Every major federal survey collects open-ended text responses that must be assigned standardized codes before analysis. The scale is large. The Current Population Survey and the American Community Survey together process roughly six million industry and occupation descriptions per year. The National Death Index processes nearly three million cause-of-death text strings. The Occupational Employment and Wage Statistics program processes over one million occupation descriptions.

These numbers matter because they determine what is economically feasible. Human coders working at 400 to 800 descriptions per day represent a substantial operational cost. Rule-based autocoding systems -- the current standard -- achieve 60 to 80 percent automation rates, leaving the remainder to human review. Improving that automation rate, or reducing the human review burden on the remaining fraction, has direct cost implications for every agency operating at this scale.

See examples/chapter-12/01_coding_problem.py for a full table of federal coding programs and concrete examples of the ambiguity that makes the problem difficult.

2.1 A concrete example: industry coding

Industry coding is harder than it looks. The assignment rule is that NAICS codes the industry of the employer, not the occupation of the respondent. This matters because many descriptions describe what the respondent does, not what the employer makes or sells. “I do IT for a bank” gets coded to Finance and Insurance (NAICS 52), not Information (NAICS 51), because the employer is a bank. A respondent who says “I’m a nurse at a clinic” presents ambiguity: a clinic could be an outpatient office (NAICS 621) or a hospital outpatient department (NAICS 622), and the word “clinic” alone does not resolve it.

Adjacent sector confusion is the dominant error pattern in both human and LLM coding. Professional Services (54) and Information (51) share many workers whose descriptions mention technology. Health Care (62) and Other Services (81) share service workers whose descriptions mention patients or clients. Understanding which confusions are genuinely ambiguous versus which are clear errors is essential for evaluating a coding system.

2.2 Federal automated coding: a brief history

Automated coding in federal statistical programs predates machine learning by decades. The Census Bureau developed deterministic coding systems for industry and occupation in the 1980s using keyword dictionaries and hierarchical matching rules. The National Center for Health Statistics built ACME (Automated Classification of Medical Entities) for cause-of-death coding starting in the 1960s. These rule-based systems reduced human review burden substantially but required constant manual maintenance as language and industries evolved.

The current generation of systems adds machine learning on top of rules. NIOSH introduced the NIOSH Industry and Occupation Computerized Coding System (NIOCCS) in 2014, adopting ML-based coding in 2021; it has since processed more than 100 million records (CDC/NIOSH, 2022). The Census Bureau’s automated coding uses a combination of exact matching, probabilistic matching, and classifier models trained on decades of human-coded data. The Bureau of Labor Statistics OEWS program uses a similar hybrid approach for occupation coding.

LLMs are the latest approach, not the first. The advantage they offer is generalization: an LLM can handle novel descriptions and informal language without requiring explicit rule maintenance. The risk is that they introduce different failure modes than rule-based systems, and those failure modes are less transparent. This chapter is about measuring those failure modes systematically. For NIOCCS specifically, see Chapter 11.

3. How LLM-based coding works

LLM-based coding uses the model’s language understanding to classify text without training a task-specific model. The key tool is the prompt: a structured text input that tells the model what to do, provides the classification scheme, and optionally provides examples.

3.1 Prompt design principles

A coding prompt has three required components: a role instruction that tells the model it is coding for a federal statistical agency, the classification scheme it should use, and the output format specification. A fourth optional component is few-shot examples -- pairs of (description, correct code) that illustrate the classification rule.

The zero-shot prompt (no examples) works reasonably well for clear cases. The few-shot prompt (two to five examples) significantly improves performance on confusable sectors, particularly when the examples are chosen to illustrate the precise distinction the model needs to make. For the 54 versus 51 confusion, examples showing that NAICS codes the employer’s industry rather than the employee’s task are more useful than examples that simply show more retail or health care cases.

See examples/chapter-12/02_prompt_design.py for the build_coding_prompt() function, zero-shot and few-shot prompt examples, and a logging schema.

A short illustrative example of the prompt structure:

def build_coding_prompt(description, few_shot_examples=None):
    prompt = (
        "You are an expert industry coder for a federal statistical agency.\n"
        "Assign the most appropriate NAICS 2-digit sector code.\n"
        "Respond with ONLY: XX - Sector Name\n\n"
        "NAICS sectors: ...\n"
    )
    if few_shot_examples:
        for text, code in few_shot_examples:
            prompt += f'  Description: "{text}" -> {code}\n'
    prompt += f'\nDescription: "{description}"\nCode: '
    return prompt

3.2 Prompt management for production

Prompts are code. They should be versioned, tested, and logged with the same discipline you would apply to any software component that produces a published statistic.

Prompt versioning means every prompt template has an identifier (v1.0, v1.1, ...) with a changelog explaining what changed and why. When you change the prompt, you document the previous accuracy and the expected change. When a production run is audited, the prompt version is part of the audit record.

Prompt regression testing means that every time you change the prompt, you run the full evaluation dataset against the new prompt before deploying it. This is directly analogous to software regression testing. A prompt change that improves accuracy on the target confusion pair may degrade accuracy on a different sector you were not watching. The evaluation dataset is your test suite. See examples/chapter-12/02_prompt_design.py for an example version registry.

Prompt-model interaction means that a prompt optimized for one model may perform differently on another. If your agency standardizes on a FedRAMP-authorized version of one model but evaluates on a different model’s API, the evaluation results may not transfer. Always run final evaluations on the model you will deploy.

Template versus instance: the template is the prompt structure with placeholders (the versioned artifact). The instance is the template with a specific description filled in (the per-record artifact). Log both. The template version identifies the methodology; the instance is what the model actually received. Industry guidance recommends managing prompts as versioned configurations with change logs, running regression test suites across prompt versions, and monitoring per-version quality metrics to detect silent degradations (Anthropic, 2024; OpenAI, 2025; Google, 2024).

4. Building the evaluation dataset

Before you can evaluate an LLM coding system, you need a ground-truth evaluation dataset: a set of descriptions where the correct code is known, ideally because they were coded by trained human coders using your agency’s standard procedures. For this chapter, we use a simulated 200-record dataset that reproduces the accuracy patterns and confusion structure reported in published studies.

The dataset covers ten NAICS 2-digit sectors with 20 descriptions each. Simulated LLM responses include realistic per-sector accuracy variation, adjacent-sector confusion patterns, and a two percent refusal rate (responses of “UNCLEAR”). The seed is fixed at 2025 for reproducibility.

See examples/chapter-12/03_evaluation_dataset.py for the full dataset and simulation code.

Key design notes on the simulation:

5. Evaluation: agreement metrics

Accuracy alone is an insufficient measure for multi-class coding systems with imbalanced classes. A system that always predicts the most common sector would achieve whatever that sector’s base rate is, with zero coding ability. Cohen’s kappa (Cohen, 1960) corrects for chance agreement and is the standard metric for inter-coder reliability comparisons in survey research.

5.1 Overall accuracy and Cohen’s kappa

See examples/chapter-12/04_agreement_metrics.py for the full computation. The key results from the simulated dataset:

The kappa interpretation table that every coding evaluation report should include:

RangeInterpretation
0.81 -- 1.00Almost perfect agreement
0.61 -- 0.80Substantial agreement
0.41 -- 0.60Moderate agreement
0.21 -- 0.40Fair agreement
0.00 -- 0.20Slight agreement (near chance)

A kappa below 0.61 — “moderate” or lower on the Landis and Koch (1977) scale — would indicate the system is not suitable for operational use without a high human review rate.

5.2 Per-sector accuracy

Sector-level accuracy reveals where the system needs improvement. In the simulated results, Other Services (81) and Professional Services (54) are the lowest-performing sectors, consistent with published findings. Both involve heterogeneous employer types that share surface language with adjacent sectors.

See examples/chapter-12/04_agreement_metrics.py for the per-sector bar chart and human-human comparison. Published studies report human-human kappa values in the substantial-to-almost-perfect range (roughly 0.6-0.8+) for broad occupation and industry groupings (Landis & Koch, 1977 interpretation scale). This is the practical ceiling: no automated system should be expected to exceed it, because some cases are genuinely ambiguous even to trained coders.

5.3 Cost and throughput analysis

LLM coding at federal scale is primarily an economic decision. The accuracy question is whether the system meets quality thresholds; the cost question is whether it does so more efficiently than alternatives.

Human coder costs for industry and occupation coding are estimated at 0.50to0.50 to 2.00 per record (author’s engineering estimate based on federal coder salary scales and typical throughput rates of 400 to 800 records per day), accounting for training, productivity, quality assurance overhead, and supervision. The lower end represents experienced coders on straightforward tasks; the upper end reflects complex cases requiring specialist knowledge.

As an early 2026 illustrative snapshot (verify current pricing before budgeting): end-to-end classification on a typical 1,000-token record costs on the order of fractions of a cent using compact or mini-tier models, and up to roughly one cent per record using flagship frontier models. Both represent a dramatic reduction from human coder costs of 0.50to0.50 to 2.00 per record. For a more detailed cost-performance framing, see the model selection section in Chapter 11.

The break-even calculation is simple: at a 30 percent human review rate (typical for a 95 percent accuracy threshold with a large frontier model), total cost is approximately API cost plus 0.30 times human review cost per record. At a human review cost of 1.00perrecord,thatisroughly1.00 per record, that is roughly 0.30 per record -- still substantially cheaper than full human coding at 0.50to0.50 to 2.00.

The more important planning parameter is throughput. Federal production coding runs are typically batch operations with latency tolerance measured in hours, not milliseconds. Use batch API endpoints rather than synchronous per-call endpoints; they are cheaper and scale better. Plan for 1 to 6 hour turnaround per batch. Capacity planning should estimate records per batch, batches per week, and human review capacity for the residual fraction.

See examples/chapter-12/08_hybrid_workflow.py for the cost-performance table and break-even analysis.

Model tierCost / 1K recordsEst. accuracy (2-digit)Human review rate (95% target)Effective cost / accepted record
Large frontier2.502.50 -- 8.0082 -- 88%~25 -- 35%0.0120.012 -- 0.040
Mid-size0.300.30 -- 0.8075 -- 83%~30 -- 45%0.0050.005 -- 0.015
Small open-source (on-premise)~$0.05 compute65 -- 78%~40 -- 55%0.0010.001 -- 0.005
Human coder onlyN/A91 -- 93%100%0.500.50 -- 2.00

Early 2026 illustrative snapshot. Verify current pricing with vendors.

6. Error analysis: understanding failure modes

Accuracy tells you how often the system is right. Error analysis tells you why it is wrong and what to do about it. The three error types in LLM industry coding are:

Adjacent sector errors: the LLM assigned a neighboring sector that represents genuine coding ambiguity. A consultant who “provides IT strategy to clients” could be Professional Services (54) if the primary activity is consulting, or Information (51) if the employer is a software firm. Both human coders and LLMs make errors on these cases, and the error rate on them is an upper bound set by the ambiguity of the descriptions themselves.

Unrelated sector errors: the LLM assigned a sector that shares no reasonable overlap with the true sector. These are the concerning errors because they indicate a failure to understand the description, not a judgment call on an ambiguous case.

Refusals: the LLM returned “UNCLEAR” or a non-code response. Refusals must be routed to human review. A high refusal rate may indicate the prompt is ambiguous about what to do with difficult cases, or that the model is poorly calibrated for this task.

See examples/chapter-12/06_error_analysis.py for the classification logic and stacked bar chart by sector. The sectors with the highest unrelated error rates are the candidates for targeted prompt revision.

7. Reproducibility challenges

LLMs are stochastic. The same prompt sent to the same model twice may return different codes, particularly for ambiguous descriptions. This is not acceptable for published statistics, and it is not acceptable for the reproducibility standards that federal agencies are subject to.

7.1 Temperature and majority voting

Setting temperature to zero produces deterministic output within a model version. This is necessary but not sufficient. Run five identical calls with temperature=0 and you will get identical results. But that determinism disappears when the model version changes. See examples/chapter-12/07_reproducibility.py for a simulation of this effect.

Majority voting -- running the same prompt k times and taking the mode -- reduces variance at temperature > 0 but increases cost by a factor of k. For production coding, temperature=0 with version pinning is usually more practical.

7.2 Model version pinning

Model version pinning means locking your API calls to a specific dated model identifier, not a floating alias. “gpt-4o” is a floating alias that resolves to whatever version the provider has current. “gpt-4o-2024-11-20” is a specific model version. These are different artifacts with different behavior.

Vendor update schedules are not synchronized with your evaluation cycles. An evaluation run that establishes 94 percent accuracy on a specific model version is not valid for a different model version. Pin the version. Treat it as a dependency. When you upgrade, re-run the evaluation.

For on-premise open-source models, pinning means tracking the model file hash (SHA256) in your artifact management system and storing the model file rather than referencing a container tag.

7.3 The silent update problem

Vendors update models without always announcing the change or preserving behavioral compatibility. A system running at 94 percent accuracy today may run at 87 percent next quarter if the underlying model was silently updated. Monitoring is not optional for any production LLM coding system.

Set up a scheduled evaluation run -- weekly or monthly -- against your held-out validation set. Alert if accuracy drops more than two percentage points from the established baseline. The evaluation infrastructure you built for initial deployment is also your ongoing monitoring infrastructure.

7.4 Prompt-response logging

Every production inference call should produce a structured log record. See examples/chapter-12/07_reproducibility.py for the recommended log schema. The minimum fields are: prompt template version, full prompt instance, raw model response (verbatim), parsed code, model identifier, temperature, timestamp, batch ID, and routing decision (auto-accepted or human review). These logs are your reproducibility record. Without them, you cannot demonstrate to an auditor what the system actually did.

8. Privacy and security considerations

Survey responses sent to an LLM API leave the federal security boundary. Whether that is permissible depends on the legal authority under which the data were collected, the data’s sensitivity classification, and the authorization status of the receiving service.

For data collected under Title 13, CIPSEA, or the Privacy Act, sending records to an unapproved commercial API is a legal violation, not just a risk management concern. The appropriate deployment paths are a FedRAMP-authorized cloud service with a data use agreement (moderate risk, requires legal review), or an on-premise open-source model running on agency hardware (lowest risk, highest setup cost). See examples/chapter-12/ for a full privacy risk assessment of each approach.

The de-identification option -- removing PII before sending descriptions to an external API -- deserves a specific caveat: removing employer names and location information to reduce PII risk may also remove the context that makes the descriptions codeable. A description of “I manage the store” loses its NAICS identity when the employer name is removed. Test de-identified accuracy against full-text accuracy before assuming the approach is viable.

8.1 Multilingual considerations

The ACS, NHIS, and other major federal surveys collect responses in multiple languages. Spanish-language responses are particularly common in industry and occupation items. LLM accuracy varies significantly by language. Models trained predominantly on English text perform worse on Spanish, Mandarin, Vietnamese, and other languages that appear in federal survey data.

Non-English responses require separate evaluation. Recent text-classification research shows that LLM classifiers often perform materially worse in lower-resource languages, with cross-language accuracy gaps of 10 to 40 percentage points depending on the setting (Batatia et al., 2025). Do not assume English-language accuracy generalizes to Spanish or other languages represented in federal survey data. Depending on the language distribution of your survey population, you may need separate prompts, separate evaluation datasets, or separate deployment decisions for each language. Some languages may require a different model entirely.

9. Designing a hybrid human-LLM workflow

The hybrid workflow is almost always better than either human-only or LLM-only approaches. The core mechanism is confidence-based routing: auto-accept LLM assignments above a confidence threshold, route everything below the threshold to human review.

Most LLM APIs can return log-probabilities alongside the text response, which can be used to construct a confidence score. Higher log-probability responses are more likely to be correct. The threshold analysis in examples/chapter-12/08_hybrid_workflow.py shows that at approximately the 0.85 to 0.90 confidence threshold (on simulated data), accuracy on auto-accepted records exceeds 95 percent while keeping the human review queue at 25 to 35 percent of total volume.

The recommended production workflow for federal statistics:

  1. LLM codes all records and assigns confidence scores

  2. Records at confidence >= threshold: auto-accept the LLM code

  3. Records at moderate confidence (below threshold): route to human review with LLM suggestion visible

  4. Records flagged “UNCLEAR”: full human coding (LLM output optionally shown as reference)

  5. Five to ten percent random sample of auto-accepted records: human audit to monitor accuracy drift

The threshold itself is a policy decision that depends on the consequence of error for your specific program. A two-digit NAICS code used for descriptive tabulations tolerates more error than a six-digit code used for regulatory classification.

10. When to use LLMs and when not to

The decision of whether to use LLM coding for a specific application depends on volume, accuracy requirements, data sensitivity, and available infrastructure. LLMs have clear advantages for high-volume tasks with moderate classification depth and informal input text. They have clear disadvantages for fine-grained classification (6-digit NAICS), legally sensitive applications where audit trails are required, and highly sensitive data where on-premise infrastructure is not available.

Use caseUse LLMUse traditional / human
VolumeHigh (> 100K records)Low (manual feasible)
Ambiguity levelModerate (needs judgment)Low (clear rules exist)
Required consistencyModerateHigh (legal / regulatory)
Classification depth2-digit sector6-digit industry
Data sensitivityDe-identified or FedRAMPHigh PII: on-prem only
MultilingualConsider separate evaluationRule-based may fail
First pass vs. finalFirst pass + human reviewFinal coded output

For a structured evaluation tool covering all relevant dimensions, see the 10-dimension rubric in Chapter 14.

11. In-class activity

See examples/chapter-12/09_activity.py for starter code with TODO markers.

12. Key takeaways for survey methodology