Chapter 1 - Regression and Classification for Survey Data

Predict continuous outcomes and categorical labels from demographic features, and learn to evaluate both honestly.

Full runnable code for all examples is in examples/chapter-01/.

Learning goals¶

Tell classification from regression by the type of target variable.
Understand synthetic survey data with realistic demographic features.
Make train and test splits, including stratified splits for imbalanced outcomes.
Interpret a linear regression model fitted to predict income and read its coefficients.
Read a logistic regression model fitted to predict nonresponse and evaluate it with multiple metrics.
Read a confusion matrix and ROC curve without confusion.
Know what to say when someone asks “how good is the model?”

1. What is supervised learning¶

Supervised learning is the branch of machine learning where you train a model on labeled examples and then ask it to predict the label for new, unlabeled records.

The rule of thumb is simple: if y is real-valued and measured on a continuous scale, reach for regression. If y is a discrete label -- responded / did not respond, low / medium / high income bracket, paper / web / phone mode of response -- reach for classification.

Federal statistics produces both kinds of targets constantly. Income, hours worked, and count-based outcomes (number of children, number of jobs) are continuous. Response propensity, data quality flags, and program eligibility categories are discrete. Knowing which family of methods applies before you open a dataset is the first judgment call a reviewer should make.

Think-pair-share

Look at each column below. Which ones could serve as a target y? Which would be regression tasks and which would be classification tasks?

Column	Type
Annual income	Continuous
Responded to survey	Binary
Mode of response (mail/web/phone)	Categorical
Number of contact attempts	Count
Income bracket (low/medium/high)	Ordinal

Evaluation mindset¶

Your job is not to build these models. Your job is to evaluate whether they were built correctly and whether the results can be trusted for the purpose claimed. When a vendor, contractor, or internal analyst hands you a model report, the following questions should guide your review:

What was the target variable, and was the choice of regression vs. classification appropriate for it?
What features were used as inputs, and could any of them encode protected characteristics by proxy?
How was performance measured, and is that metric appropriate for this outcome and population?
Was the model evaluated on data it had not seen during training?
Was performance checked separately for population subgroups that matter for equity or coverage?
What assumptions does the model make about the relationship between inputs and the target?

The rest of this chapter builds your technical vocabulary for each of these questions.

2. Data: synthetic ACS-like survey records¶

The examples in this chapter use a synthetic dataset that mimics an ACS-style person record file. Every record is generated by a known statistical process, not drawn from any real respondent. This makes the dataset portable -- no downloads or API keys required -- and lets you compare model output to the known data-generating process.

The dataset contains 1,200 person records with these columns:

Column	Description
`state`	One of five states, with Illinois overrepresented (25%)
`age`	Drawn from a normal distribution centered at 42, clipped to 18--80
`education_years`	Discrete: 9, 12, 14, 16, or 18 years
`hours_per_week`	Hours worked, centered at 38, clipped to 0--80
`urban`	1 = urban tract (72% of records)
`contact_attempts`	Poisson-distributed contact attempts, clipped to 1--7
`prior_response`	1 if the person responded in the prior survey cycle (68%)
`income`	Log-normal, driven by education, age, and hours worked
`responded`	Binary: 1 = responded to this survey

Income is deliberately log-normal because wage distributions are right-skewed: most earners cluster at lower values, but a long tail of high earners pulls the mean upward. Nonresponse probability is a logistic function of contact attempts and prior response history, which mirrors what survey paradata actually show.

To generate the dataset:

cd examples/chapter-01
python 01_generate_survey_data.py

The script prints the response rate and summary statistics. Inspect those numbers before fitting any model: they tell you what the data looks like before you ask the model to explain it.

3. The train/test split¶

Before fitting any model, divide the data into a training set (used to fit the model) and a test set (used to estimate how well it generalizes). The test set is data the model has never seen. Reporting performance on training data is not performance evaluation -- it is memorization measurement.

A typical split is 80% training, 20% test. The split must be done before any model fitting. If you use the test set to make any decisions -- adjusting features, tuning parameters, choosing between models -- it is no longer an honest estimate of out-of-sample performance.

from sklearn.model_selection import train_test_split

# 80% train, 20% test; random_state fixes the shuffle for reproducibility
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg, test_size=0.20, random_state=42
)

# For classification, use stratify=y to preserve the response rate in both splits
X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(
    X_clf, y_clf, test_size=0.20, random_state=42, stratify=y_clf
)

4. Regression: predicting income¶

Income is a continuous variable. The goal is to predict it from age, education, hours worked, and urban status. A linear regression model assumes that income is a weighted sum of those features plus an intercept. Each weight (coefficient) represents the estimated change in predicted income per one-unit increase in the corresponding feature, holding all other features constant.

4.1 Fit a linear regression model¶

Fitting the model in scikit-learn requires three lines:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

reg = LinearRegression()
reg.fit(X_reg_train, y_reg_train)
y_reg_pred = reg.predict(X_reg_test)

Running 02_regression_income.py fits this model and prints MAE, MSE, and R².

4.2 Examine the coefficients¶

The coefficient table is the most directly interpretable output of a linear regression. A positive coefficient means the feature is associated with higher predicted income; a negative one means the opposite.

For this synthetic dataset, education has the largest coefficient by design: each additional year of education adds roughly $1,000-2,000 to predicted annual income. Age and hours worked contribute smaller increments. Urban status contributes a positive premium (urban workers in this dataset tend to have higher incomes).

Caution: these are correlations, not causal estimates. Multiple confounders exist, and a real analysis would report coefficients with confidence intervals and discuss limitations explicitly.

4.3 Diagnostic plots¶

Two plots tell you whether the linear model fits the data reasonably:

Residual plot: Plot residuals (true minus predicted) on the vertical axis against predicted values on the horizontal axis. A well-fitting model shows points scattered randomly around zero. A fan shape (residuals grow larger at higher predicted values) signals that the model’s errors are larger for high-income individuals -- common with right-skewed distributions. A curve signals a non-linear relationship the model is missing.

Parity plot: Plot true values on the horizontal axis and predicted values on the vertical axis. A perfect model’s points would lie on the 45-degree diagonal. Systematic offset above or below the diagonal indicates bias. Wide scatter indicates high variance.

Both plots are produced by 02_regression_income.py and saved as PNG files.

4.4 How split choice affects accuracy¶

The metrics you report depend partly on which records ended up in the test set. This is not a flaw -- it is a property of finite data. Running the split with 30 different random seeds and measuring the variability of MAE shows you how much your reported number could shift under a different split.

The sensitivity analysis in 02_regression_income.py sweeps across test sizes of 10%, 20%, and 30% and reports mean and standard deviation of MAE across 30 seeds. Smaller test sets produce noisier estimates because you are averaging over fewer predictions.

4.5 Regularization: Ridge and Lasso¶

When features correlate with each other, a plain linear regression can place unreasonably large weights on individual features. Regularization adds a penalty to the loss function that shrinks coefficients toward zero.

In practice:

from sklearn.linear_model import Ridge, Lasso

Ridge(alpha=100).fit(X_train, y_train)
Lasso(alpha=50, max_iter=10000).fit(X_train, y_train)

The alpha parameter controls how much shrinkage is applied. Larger alpha means more shrinkage. 02_regression_income.py compares all three models side by side.

What to look for in a regression report¶

When an analyst or vendor presents regression results, work through this checklist before accepting the findings:

What features were used? Are any of them proxies for protected characteristics (race, sex, national origin)?
What is the R²? A high R² on training data means nothing; R² on the held-out test set is what matters.
Were residuals checked? A fan-shaped residual plot signals the model is systematically worse for some income ranges.
Was the split stratified or grouped appropriately for the population structure (households, clusters)?
Were coefficients reported with confidence intervals, or only point estimates?
Were subgroup results reported? A model with a good overall MAE may still be biased for rural, low-income, or minority subgroups.
Are the coefficient signs plausible? If education has a negative income coefficient, something is wrong.

5. Classification: predicting nonresponse¶

Whether a sampled unit responds is a binary outcome. Predicting it in advance allows field staff to prioritize follow-up contacts. High contact attempts and absent prior response history predict nonresponse; urban residents and habitual respondents are more likely to respond.

5.1 Fit a logistic regression model¶

Despite its name, logistic regression is a classification model. It predicts the probability of the positive class (responded = 1). A threshold -- default 0.5 -- converts that probability into a class label.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=500)
clf.fit(X_clf_train, y_clf_train)

y_clf_pred  = clf.predict(X_clf_test)          # class labels (0 or 1)
y_clf_proba = clf.predict_proba(X_clf_test)[:, 1]  # probability of responding

5.2 Classification metrics¶

Five metrics characterize a binary classifier. Each tells a different story:

For imbalanced outcomes -- which is typical of nonresponse -- accuracy is the least informative metric. A model that predicts “everyone responds” achieves high accuracy but zero recall for nonrespondents. F1 and AUC are better starting points.

Running 03_classification_nonresponse.py prints all five metrics for the default threshold and produces confusion matrix and ROC curve figures.

5.3 Confusion matrix and ROC curve¶

The confusion matrix shows where the model makes each type of error:

	Predicted 0	Predicted 1
True 0 (nonresponse)	True Negative (TN)	False Positive (FP)
True 1 (response)	False Negative (FN)	True Positive (TP)

In a nonresponse targeting system, False Negatives -- predicted to respond but actually did not -- mean missed field contacts: these households needed follow-up that was never scheduled. False Positives -- flagged for follow-up but would have responded anyway -- waste field budget. The right trade-off depends on the relative cost of each error type.

The ROC curve plots the true positive rate (recall) against the false positive rate at every possible threshold. It summarizes the full precision-recall trade-off in a single curve. A model no better than random produces the 45-degree diagonal. AUC is the area under that curve.

Reading the confusion matrix

The four cells are:

	Predicted 0	Predicted 1
True 0 (nonresponse)	True Negative (TN)	False Positive (FP)
True 1 (response)	False Negative (FN)	True Positive (TP)

In a nonresponse targeting system, False Negatives (missed nonrespondents) mean wasted field budget: you expected them to respond but they did not, so no follow-up was planned. False Positives (unnecessary follow-up) are less costly but still wasteful. Choose your threshold based on the relative cost of each error type.

5.4 Threshold sensitivity¶

The default threshold of 0.5 is rarely optimal. Lowering the threshold to 0.35 flags more records as likely nonrespondents (higher recall) but also generates more false alarms (lower precision). The threshold choice is ultimately a budget and policy decision.

The pre-computed results below show how metrics shift across thresholds for the synthetic dataset:

threshold  accuracy  precision  recall    f1
0.30       0.612     0.768      0.726     0.746
0.35       0.643     0.782      0.710     0.744
0.40       0.674     0.802      0.688     0.741
0.45       0.693     0.820      0.669     0.737
0.50       0.710     0.838      0.651     0.732
0.55       0.719     0.858      0.627     0.725
0.60       0.725     0.882      0.601     0.715
0.65       0.720     0.904      0.565     0.696
0.70       0.708     0.928      0.519     0.666

As the threshold rises, precision improves and recall falls. The F1 score peaks somewhere in the middle. Your agency’s field operations staff should weigh in on what matters more: catching more nonrespondents at the cost of unnecessary contacts, or conserving field budget at the cost of missed follow-ups.

5.5 Interpreting logistic regression coefficients¶

Logistic regression coefficients are on a log-odds scale. The odds ratio -- the exponent of the coefficient -- is easier to interpret:

Odds ratio > 1 means the feature increases the probability of the positive class (responded).
Odds ratio < 1 means it decreases it.

For example, an odds ratio of roughly 2.0 for prior_response means people who responded in the prior cycle are about twice as likely (in odds terms) to respond again, all else equal. Survey researchers call this “habitual respondents,” and it is one of the most reliable predictors in any nonresponse model (Groves & Couper, 1998).

What to look for in a classification report¶

When reviewing a classification model for nonresponse or any binary survey outcome:

Is the metric appropriate for class imbalance? Accuracy alone is not sufficient.
What threshold was used, and how was it chosen? If it was not reported, ask.
How do precision and recall trade off at the chosen threshold? Which direction of error is more costly?
Was the split stratified to preserve class balance in both halves?
Were subgroup results reported? A model that works well on average may miss high-nonresponse subgroups (rural areas, non-English speakers, hard-to-reach demographics).
Does the confusion matrix pattern make sense? Off-diagonal errors should be interpretable in operational terms.
Were any features used that could create disparate impact on protected groups?

6. From regression to categories: income brackets¶

Sometimes a continuous prediction is binned into categories for reporting or program eligibility. Income brackets -- low, middle, high -- appear throughout federal statistics for exactly this reason.

Binning continuous income into three categories and treating the result as a multi-class classification problem illustrates both the mechanics of multi-class modeling and the trade-offs involved in discretization.

The synthetic dataset bins income as follows:

Bracket	Range	Approximate share
Low	< $40,000	~25%
Middle	$40,000--$ 90,000	~50%
High	> $90,000	~25%

A multi-class logistic regression fits one set of coefficients per class and predicts the bracket with the highest probability.

When discretization is appropriate -- and when it loses information¶

Discretizing a continuous variable is appropriate when:

Output categories are required by policy or reporting standards (poverty status, program eligibility thresholds)
Stakeholders need categorical labels for operational decisions
You are comparing subgroup membership across populations and bracket assignment is the unit of analysis

Discretization loses information when:

The continuous variable has meaningful variation within a bracket (a household at $39,000 and one at$ 41,000 receive opposite labels despite near-identical incomes)
Downstream analysis needs the continuous value (household income totals, inequality metrics)
The bracket boundaries are arbitrary and could reasonably be drawn differently

The general rule: keep continuous variables continuous as long as possible and discretize only at the reporting or decision-making step. When brackets are policy-defined, document the boundary values explicitly and flag cases near the boundary.

6.1 Multi-class metrics: macro vs. weighted averaging¶

When a target has more than two classes, precision, recall, and F1 must be averaged across classes. Two averaging strategies exist:

The confusion matrix for three classes is a 3x3 grid. The diagonal cells are correct predictions; off-diagonal cells are errors. Looking at which adjacent bracket gets confused most often (low vs. middle, not low vs. high) reveals whether the model is making plausible near-boundary errors or systematic misclassification.

7. Quick reference¶

8. Evaluate this output¶

These exercises ask you to interpret pre-computed results, not to run code. Work through the questions first; then run examples/chapter-01/05_exercises.py to verify your reasoning.

8.1 Interpret a regression report¶

A colleague fits a linear regression to predict income using only age and hours worked (two features, 90/10 split). They report:

MAE: $18,420
R²:  0.138

Questions:

Is R² = 0.138 good or bad? What would R² = 0 mean?
The four-feature model in Section 4 achieves R² around 0.38. What does the gap tell you?
Why might a 90/10 split produce noisier metrics than an 80/20 split?
If you were reviewing this model for production use in income imputation, what would you want to know before approving it?

8.2 Interpret a threshold decision¶

A field operations supervisor is reviewing the nonresponse model. The model team has run the threshold sensitivity analysis and reported this table:

threshold  precision  recall    f1
0.35       0.782      0.710     0.744
0.50       0.838      0.651     0.732
0.65       0.904      0.565     0.696

The supervisor has a field budget that allows for 400 follow-up contacts in the 240-record test set.

Questions:

At threshold 0.35, is the model flagging too many, too few, or about the right number of records for 400 contacts?
The supervisor says “I want to catch as many nonrespondents as possible.” Which threshold would you recommend?
The budget analyst says “minimize unnecessary contacts.” Which threshold would you recommend?
What would you tell both of them about the trade-off?

8.3 Spot the error in a confusion matrix narrative¶

A report states: “Our nonresponse model achieved 91% accuracy on the test set. We are confident it correctly identifies nonrespondents.”

The confusion matrix for the test set (240 records) is:

                Predicted: Did not respond   Predicted: Responded
True: Did not respond           18                  22
True: Responded                  1                 199

Questions:

Calculate the accuracy from the confusion matrix. Does it match 91%?
Among true nonrespondents (40 records), how many did the model correctly flag?
What is the recall for the nonresponse class?
Why is the accuracy number misleading here?
What would you write in your review memo?

8.4 Optional: run the code¶

If you want to verify the numbers above or extend any analysis, all scripts are in examples/chapter-01/. Run them in order (01 generates the data; 02, 03, 04 analyze it; 05 contains exercise solutions).

Key takeaways for survey methodology¶

How to explain these methods to leadership

“We built a model that predicts which sampled households are unlikely to respond based on prior response history, number of contact attempts, and a few demographic characteristics. This lets field staff prioritize follow-up rather than contacting everyone with equal intensity.”

“The model is not making decisions. It is providing a ranked list so that human supervisors can allocate follow-up resources more efficiently. Every flagged case still goes through normal quality review.”

“We evaluated the model on data it had not seen during training. The accuracy numbers you see reflect real-world performance estimates, not just how well it memorized the training data. We also checked performance separately for urban and rural tracts to confirm it was not systematically worse for any subgroup.”

“The threshold we chose balances the cost of a missed nonrespondent against the cost of an unnecessary field visit. We can adjust this based on budget constraints.”