Predict continuous outcomes and categorical labels from demographic features, and learn to evaluate both honestly.
Full runnable code for all examples is in
examples/chapter-01/.
Learning goals¶
Tell classification from regression by the type of target variable.
Understand synthetic survey data with realistic demographic features.
Make train and test splits, including stratified splits for imbalanced outcomes.
Interpret a linear regression model fitted to predict income and read its coefficients.
Read a logistic regression model fitted to predict nonresponse and evaluate it with multiple metrics.
Read a confusion matrix and ROC curve without confusion.
Know what to say when someone asks “how good is the model?”
1. What is supervised learning¶
Supervised learning is the branch of machine learning where you train a model on labeled examples and then ask it to predict the label for new, unlabeled records.
The rule of thumb is simple: if y is real-valued and measured on a continuous scale, reach for regression. If y is a discrete label -- responded / did not respond, low / medium / high income bracket, paper / web / phone mode of response -- reach for classification.
Federal statistics produces both kinds of targets constantly. Income, hours worked, and count-based outcomes (number of children, number of jobs) are continuous. Response propensity, data quality flags, and program eligibility categories are discrete. Knowing which family of methods applies before you open a dataset is the first judgment call a reviewer should make.
Evaluation mindset¶
Your job is not to build these models. Your job is to evaluate whether they were built correctly and whether the results can be trusted for the purpose claimed. When a vendor, contractor, or internal analyst hands you a model report, the following questions should guide your review:
What was the target variable, and was the choice of regression vs. classification appropriate for it?
What features were used as inputs, and could any of them encode protected characteristics by proxy?
How was performance measured, and is that metric appropriate for this outcome and population?
Was the model evaluated on data it had not seen during training?
Was performance checked separately for population subgroups that matter for equity or coverage?
What assumptions does the model make about the relationship between inputs and the target?
The rest of this chapter builds your technical vocabulary for each of these questions.
2. Data: synthetic ACS-like survey records¶
The examples in this chapter use a synthetic dataset that mimics an ACS-style person record file. Every record is generated by a known statistical process, not drawn from any real respondent. This makes the dataset portable -- no downloads or API keys required -- and lets you compare model output to the known data-generating process.
The dataset contains 1,200 person records with these columns:
| Column | Description |
|---|---|
state | One of five states, with Illinois overrepresented (25%) |
age | Drawn from a normal distribution centered at 42, clipped to 18--80 |
education_years | Discrete: 9, 12, 14, 16, or 18 years |
hours_per_week | Hours worked, centered at 38, clipped to 0--80 |
urban | 1 = urban tract (72% of records) |
contact_attempts | Poisson-distributed contact attempts, clipped to 1--7 |
prior_response | 1 if the person responded in the prior survey cycle (68%) |
income | Log-normal, driven by education, age, and hours worked |
responded | Binary: 1 = responded to this survey |
Income is deliberately log-normal because wage distributions are right-skewed: most earners cluster at lower values, but a long tail of high earners pulls the mean upward. Nonresponse probability is a logistic function of contact attempts and prior response history, which mirrors what survey paradata actually show.
To generate the dataset:
cd examples/chapter-01
python 01_generate_survey_data.pyThe script prints the response rate and summary statistics. Inspect those numbers before fitting any model: they tell you what the data looks like before you ask the model to explain it.
3. The train/test split¶
Before fitting any model, divide the data into a training set (used to fit the model) and a test set (used to estimate how well it generalizes). The test set is data the model has never seen. Reporting performance on training data is not performance evaluation -- it is memorization measurement.
A typical split is 80% training, 20% test. The split must be done before any model fitting. If you use the test set to make any decisions -- adjusting features, tuning parameters, choosing between models -- it is no longer an honest estimate of out-of-sample performance.
from sklearn.model_selection import train_test_split
# 80% train, 20% test; random_state fixes the shuffle for reproducibility
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
X_reg, y_reg, test_size=0.20, random_state=42
)
# For classification, use stratify=y to preserve the response rate in both splits
X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(
X_clf, y_clf, test_size=0.20, random_state=42, stratify=y_clf
)4. Regression: predicting income¶
Income is a continuous variable. The goal is to predict it from age, education, hours worked, and urban status. A linear regression model assumes that income is a weighted sum of those features plus an intercept. Each weight (coefficient) represents the estimated change in predicted income per one-unit increase in the corresponding feature, holding all other features constant.
4.1 Fit a linear regression model¶
Fitting the model in scikit-learn requires three lines:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
reg = LinearRegression()
reg.fit(X_reg_train, y_reg_train)
y_reg_pred = reg.predict(X_reg_test)Running 02_regression_income.py fits this model and prints MAE, MSE, and R².
4.2 Examine the coefficients¶
The coefficient table is the most directly interpretable output of a linear regression. A positive coefficient means the feature is associated with higher predicted income; a negative one means the opposite.
For this synthetic dataset, education has the largest coefficient by design: each additional year of education adds roughly $1,000-2,000 to predicted annual income. Age and hours worked contribute smaller increments. Urban status contributes a positive premium (urban workers in this dataset tend to have higher incomes).
Caution: these are correlations, not causal estimates. Multiple confounders exist, and a real analysis would report coefficients with confidence intervals and discuss limitations explicitly.
4.3 Diagnostic plots¶
Two plots tell you whether the linear model fits the data reasonably:
Residual plot: Plot residuals (true minus predicted) on the vertical axis against predicted values on the horizontal axis. A well-fitting model shows points scattered randomly around zero. A fan shape (residuals grow larger at higher predicted values) signals that the model’s errors are larger for high-income individuals -- common with right-skewed distributions. A curve signals a non-linear relationship the model is missing.
Parity plot: Plot true values on the horizontal axis and predicted values on the vertical axis. A perfect model’s points would lie on the 45-degree diagonal. Systematic offset above or below the diagonal indicates bias. Wide scatter indicates high variance.
Both plots are produced by 02_regression_income.py and saved as PNG files.
4.4 How split choice affects accuracy¶
The metrics you report depend partly on which records ended up in the test set. This is not a flaw -- it is a property of finite data. Running the split with 30 different random seeds and measuring the variability of MAE shows you how much your reported number could shift under a different split.
The sensitivity analysis in 02_regression_income.py sweeps across test sizes of 10%, 20%, and 30% and reports mean and standard deviation of MAE across 30 seeds. Smaller test sets produce noisier estimates because you are averaging over fewer predictions.
4.5 Regularization: Ridge and Lasso¶
When features correlate with each other, a plain linear regression can place unreasonably large weights on individual features. Regularization adds a penalty to the loss function that shrinks coefficients toward zero.
In practice:
from sklearn.linear_model import Ridge, Lasso
Ridge(alpha=100).fit(X_train, y_train)
Lasso(alpha=50, max_iter=10000).fit(X_train, y_train)The alpha parameter controls how much shrinkage is applied. Larger alpha means more shrinkage. 02_regression_income.py compares all three models side by side.
What to look for in a regression report¶
When an analyst or vendor presents regression results, work through this checklist before accepting the findings:
What features were used? Are any of them proxies for protected characteristics (race, sex, national origin)?
What is the R²? A high R² on training data means nothing; R² on the held-out test set is what matters.
Were residuals checked? A fan-shaped residual plot signals the model is systematically worse for some income ranges.
Was the split stratified or grouped appropriately for the population structure (households, clusters)?
Were coefficients reported with confidence intervals, or only point estimates?
Were subgroup results reported? A model with a good overall MAE may still be biased for rural, low-income, or minority subgroups.
Are the coefficient signs plausible? If education has a negative income coefficient, something is wrong.
5. Classification: predicting nonresponse¶
Whether a sampled unit responds is a binary outcome. Predicting it in advance allows field staff to prioritize follow-up contacts. High contact attempts and absent prior response history predict nonresponse; urban residents and habitual respondents are more likely to respond.
5.1 Fit a logistic regression model¶
Despite its name, logistic regression is a classification model. It predicts the probability of the positive class (responded = 1). A threshold -- default 0.5 -- converts that probability into a class label.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=500)
clf.fit(X_clf_train, y_clf_train)
y_clf_pred = clf.predict(X_clf_test) # class labels (0 or 1)
y_clf_proba = clf.predict_proba(X_clf_test)[:, 1] # probability of responding5.2 Classification metrics¶
Five metrics characterize a binary classifier. Each tells a different story:
For imbalanced outcomes -- which is typical of nonresponse -- accuracy is the least informative metric. A model that predicts “everyone responds” achieves high accuracy but zero recall for nonrespondents. F1 and AUC are better starting points.
Running 03_classification_nonresponse.py prints all five metrics for the default threshold and produces confusion matrix and ROC curve figures.
5.3 Confusion matrix and ROC curve¶
The confusion matrix shows where the model makes each type of error:
| Predicted 0 | Predicted 1 | |
|---|---|---|
| True 0 (nonresponse) | True Negative (TN) | False Positive (FP) |
| True 1 (response) | False Negative (FN) | True Positive (TP) |
In a nonresponse targeting system, False Negatives -- predicted to respond but actually did not -- mean missed field contacts: these households needed follow-up that was never scheduled. False Positives -- flagged for follow-up but would have responded anyway -- waste field budget. The right trade-off depends on the relative cost of each error type.
The ROC curve plots the true positive rate (recall) against the false positive rate at every possible threshold. It summarizes the full precision-recall trade-off in a single curve. A model no better than random produces the 45-degree diagonal. AUC is the area under that curve.
5.4 Threshold sensitivity¶
The default threshold of 0.5 is rarely optimal. Lowering the threshold to 0.35 flags more records as likely nonrespondents (higher recall) but also generates more false alarms (lower precision). The threshold choice is ultimately a budget and policy decision.
The pre-computed results below show how metrics shift across thresholds for the synthetic dataset:
threshold accuracy precision recall f1
0.30 0.612 0.768 0.726 0.746
0.35 0.643 0.782 0.710 0.744
0.40 0.674 0.802 0.688 0.741
0.45 0.693 0.820 0.669 0.737
0.50 0.710 0.838 0.651 0.732
0.55 0.719 0.858 0.627 0.725
0.60 0.725 0.882 0.601 0.715
0.65 0.720 0.904 0.565 0.696
0.70 0.708 0.928 0.519 0.666As the threshold rises, precision improves and recall falls. The F1 score peaks somewhere in the middle. Your agency’s field operations staff should weigh in on what matters more: catching more nonrespondents at the cost of unnecessary contacts, or conserving field budget at the cost of missed follow-ups.
5.5 Interpreting logistic regression coefficients¶
Logistic regression coefficients are on a log-odds scale. The odds ratio -- the exponent of the coefficient -- is easier to interpret:
Odds ratio > 1 means the feature increases the probability of the positive class (responded).
Odds ratio < 1 means it decreases it.
For example, an odds ratio of roughly 2.0 for prior_response means people who responded in the prior cycle are about twice as likely (in odds terms) to respond again, all else equal. Survey researchers call this “habitual respondents,” and it is one of the most reliable predictors in any nonresponse model (Groves & Couper, 1998).
What to look for in a classification report¶
When reviewing a classification model for nonresponse or any binary survey outcome:
Is the metric appropriate for class imbalance? Accuracy alone is not sufficient.
What threshold was used, and how was it chosen? If it was not reported, ask.
How do precision and recall trade off at the chosen threshold? Which direction of error is more costly?
Was the split stratified to preserve class balance in both halves?
Were subgroup results reported? A model that works well on average may miss high-nonresponse subgroups (rural areas, non-English speakers, hard-to-reach demographics).
Does the confusion matrix pattern make sense? Off-diagonal errors should be interpretable in operational terms.
Were any features used that could create disparate impact on protected groups?
6. From regression to categories: income brackets¶
Sometimes a continuous prediction is binned into categories for reporting or program eligibility. Income brackets -- low, middle, high -- appear throughout federal statistics for exactly this reason.
Binning continuous income into three categories and treating the result as a multi-class classification problem illustrates both the mechanics of multi-class modeling and the trade-offs involved in discretization.
The synthetic dataset bins income as follows:
| Bracket | Range | Approximate share |
|---|---|---|
| Low | < $40,000 | ~25% |
| Middle | 90,000 | ~50% |
| High | > $90,000 | ~25% |
A multi-class logistic regression fits one set of coefficients per class and predicts the bracket with the highest probability.
When discretization is appropriate -- and when it loses information¶
Discretizing a continuous variable is appropriate when:
Output categories are required by policy or reporting standards (poverty status, program eligibility thresholds)
Stakeholders need categorical labels for operational decisions
You are comparing subgroup membership across populations and bracket assignment is the unit of analysis
Discretization loses information when:
The continuous variable has meaningful variation within a bracket (a household at 41,000 receive opposite labels despite near-identical incomes)
Downstream analysis needs the continuous value (household income totals, inequality metrics)
The bracket boundaries are arbitrary and could reasonably be drawn differently
The general rule: keep continuous variables continuous as long as possible and discretize only at the reporting or decision-making step. When brackets are policy-defined, document the boundary values explicitly and flag cases near the boundary.
6.1 Multi-class metrics: macro vs. weighted averaging¶
When a target has more than two classes, precision, recall, and F1 must be averaged across classes. Two averaging strategies exist:
The confusion matrix for three classes is a 3x3 grid. The diagonal cells are correct predictions; off-diagonal cells are errors. Looking at which adjacent bracket gets confused most often (low vs. middle, not low vs. high) reveals whether the model is making plausible near-boundary errors or systematic misclassification.
7. Quick reference¶
8. Evaluate this output¶
These exercises ask you to interpret pre-computed results, not to run code. Work through the questions first; then run examples/chapter-01/05_exercises.py to verify your reasoning.
8.1 Interpret a regression report¶
A colleague fits a linear regression to predict income using only age and hours worked (two features, 90/10 split). They report:
MAE: $18,420
R²: 0.138Questions:
Is R² = 0.138 good or bad? What would R² = 0 mean?
The four-feature model in Section 4 achieves R² around 0.38. What does the gap tell you?
Why might a 90/10 split produce noisier metrics than an 80/20 split?
If you were reviewing this model for production use in income imputation, what would you want to know before approving it?
Discussion
R² = 0.138 means the model explains about 14% of income variance -- well below the four-feature model. The missing features (education and urban status) carry substantial predictive power. A 90/10 split produces a smaller test set, so each individual prediction error has more influence on the reported metric. Before approving for production: check residuals for patterns, compare performance across education and urban subgroups, ask whether the two included features are proxies for anything protected.
8.2 Interpret a threshold decision¶
A field operations supervisor is reviewing the nonresponse model. The model team has run the threshold sensitivity analysis and reported this table:
threshold precision recall f1
0.35 0.782 0.710 0.744
0.50 0.838 0.651 0.732
0.65 0.904 0.565 0.696The supervisor has a field budget that allows for 400 follow-up contacts in the 240-record test set.
Questions:
At threshold 0.35, is the model flagging too many, too few, or about the right number of records for 400 contacts?
The supervisor says “I want to catch as many nonrespondents as possible.” Which threshold would you recommend?
The budget analyst says “minimize unnecessary contacts.” Which threshold would you recommend?
What would you tell both of them about the trade-off?
Discussion
At threshold 0.35 the model flags more records (higher recall), likely exceeding 400 contacts; at 0.65 it flags fewer. The supervisor maximizing recall wants 0.35. The budget analyst minimizing unnecessary contacts wants 0.65. The right answer is a conversation about the cost ratio: how many unnecessary contacts is one missed nonrespondent worth? That is an operational judgment, not a statistical one. The model team’s job is to surface the trade-off clearly, not to make the policy decision.
8.3 Spot the error in a confusion matrix narrative¶
A report states: “Our nonresponse model achieved 91% accuracy on the test set. We are confident it correctly identifies nonrespondents.”
The confusion matrix for the test set (240 records) is:
Predicted: Did not respond Predicted: Responded
True: Did not respond 18 22
True: Responded 1 199Questions:
Calculate the accuracy from the confusion matrix. Does it match 91%?
Among true nonrespondents (40 records), how many did the model correctly flag?
What is the recall for the nonresponse class?
Why is the accuracy number misleading here?
What would you write in your review memo?
Discussion
Accuracy = (18 + 199) / 240 = 90.4%, close to the reported 91%. But recall for nonrespondents = 18 / 40 = 45%. The model misses more than half of true nonrespondents. Accuracy is high because 200 of 240 test records are responders; a model that predicted “everyone responds” would achieve 83% accuracy with zero nonrespondent recall. The review memo should flag that recall for the target class (nonrespondents) is the operationally relevant metric, and that 45% recall is not sufficient for a follow-up targeting system.
8.4 Optional: run the code¶
If you want to verify the numbers above or extend any analysis, all scripts are in examples/chapter-01/. Run them in order (01 generates the data; 02, 03, 04 analyze it; 05 contains exercise solutions).