Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 1 - Regression and Classification for Survey Data

Predict continuous outcomes and categorical labels from demographic features, and learn to evaluate both honestly.

Full runnable code for all examples is in examples/chapter-01/.

Learning goals


1. What is supervised learning

Supervised learning is the branch of machine learning where you train a model on labeled examples and then ask it to predict the label for new, unlabeled records.

The rule of thumb is simple: if y is real-valued and measured on a continuous scale, reach for regression. If y is a discrete label -- responded / did not respond, low / medium / high income bracket, paper / web / phone mode of response -- reach for classification.

Federal statistics produces both kinds of targets constantly. Income, hours worked, and count-based outcomes (number of children, number of jobs) are continuous. Response propensity, data quality flags, and program eligibility categories are discrete. Knowing which family of methods applies before you open a dataset is the first judgment call a reviewer should make.

Evaluation mindset

Your job is not to build these models. Your job is to evaluate whether they were built correctly and whether the results can be trusted for the purpose claimed. When a vendor, contractor, or internal analyst hands you a model report, the following questions should guide your review:

The rest of this chapter builds your technical vocabulary for each of these questions.


2. Data: synthetic ACS-like survey records

The examples in this chapter use a synthetic dataset that mimics an ACS-style person record file. Every record is generated by a known statistical process, not drawn from any real respondent. This makes the dataset portable -- no downloads or API keys required -- and lets you compare model output to the known data-generating process.

The dataset contains 1,200 person records with these columns:

ColumnDescription
stateOne of five states, with Illinois overrepresented (25%)
ageDrawn from a normal distribution centered at 42, clipped to 18--80
education_yearsDiscrete: 9, 12, 14, 16, or 18 years
hours_per_weekHours worked, centered at 38, clipped to 0--80
urban1 = urban tract (72% of records)
contact_attemptsPoisson-distributed contact attempts, clipped to 1--7
prior_response1 if the person responded in the prior survey cycle (68%)
incomeLog-normal, driven by education, age, and hours worked
respondedBinary: 1 = responded to this survey

Income is deliberately log-normal because wage distributions are right-skewed: most earners cluster at lower values, but a long tail of high earners pulls the mean upward. Nonresponse probability is a logistic function of contact attempts and prior response history, which mirrors what survey paradata actually show.

To generate the dataset:

cd examples/chapter-01
python 01_generate_survey_data.py

The script prints the response rate and summary statistics. Inspect those numbers before fitting any model: they tell you what the data looks like before you ask the model to explain it.


3. The train/test split

Before fitting any model, divide the data into a training set (used to fit the model) and a test set (used to estimate how well it generalizes). The test set is data the model has never seen. Reporting performance on training data is not performance evaluation -- it is memorization measurement.

A typical split is 80% training, 20% test. The split must be done before any model fitting. If you use the test set to make any decisions -- adjusting features, tuning parameters, choosing between models -- it is no longer an honest estimate of out-of-sample performance.

from sklearn.model_selection import train_test_split

# 80% train, 20% test; random_state fixes the shuffle for reproducibility
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg, test_size=0.20, random_state=42
)

# For classification, use stratify=y to preserve the response rate in both splits
X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(
    X_clf, y_clf, test_size=0.20, random_state=42, stratify=y_clf
)

4. Regression: predicting income

Income is a continuous variable. The goal is to predict it from age, education, hours worked, and urban status. A linear regression model assumes that income is a weighted sum of those features plus an intercept. Each weight (coefficient) represents the estimated change in predicted income per one-unit increase in the corresponding feature, holding all other features constant.

4.1 Fit a linear regression model

Fitting the model in scikit-learn requires three lines:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

reg = LinearRegression()
reg.fit(X_reg_train, y_reg_train)
y_reg_pred = reg.predict(X_reg_test)

Running 02_regression_income.py fits this model and prints MAE, MSE, and R².

4.2 Examine the coefficients

The coefficient table is the most directly interpretable output of a linear regression. A positive coefficient means the feature is associated with higher predicted income; a negative one means the opposite.

For this synthetic dataset, education has the largest coefficient by design: each additional year of education adds roughly $1,000-2,000 to predicted annual income. Age and hours worked contribute smaller increments. Urban status contributes a positive premium (urban workers in this dataset tend to have higher incomes).

Caution: these are correlations, not causal estimates. Multiple confounders exist, and a real analysis would report coefficients with confidence intervals and discuss limitations explicitly.

4.3 Diagnostic plots

Two plots tell you whether the linear model fits the data reasonably:

Residual plot: Plot residuals (true minus predicted) on the vertical axis against predicted values on the horizontal axis. A well-fitting model shows points scattered randomly around zero. A fan shape (residuals grow larger at higher predicted values) signals that the model’s errors are larger for high-income individuals -- common with right-skewed distributions. A curve signals a non-linear relationship the model is missing.

Parity plot: Plot true values on the horizontal axis and predicted values on the vertical axis. A perfect model’s points would lie on the 45-degree diagonal. Systematic offset above or below the diagonal indicates bias. Wide scatter indicates high variance.

Both plots are produced by 02_regression_income.py and saved as PNG files.

4.4 How split choice affects accuracy

The metrics you report depend partly on which records ended up in the test set. This is not a flaw -- it is a property of finite data. Running the split with 30 different random seeds and measuring the variability of MAE shows you how much your reported number could shift under a different split.

The sensitivity analysis in 02_regression_income.py sweeps across test sizes of 10%, 20%, and 30% and reports mean and standard deviation of MAE across 30 seeds. Smaller test sets produce noisier estimates because you are averaging over fewer predictions.

4.5 Regularization: Ridge and Lasso

When features correlate with each other, a plain linear regression can place unreasonably large weights on individual features. Regularization adds a penalty to the loss function that shrinks coefficients toward zero.

In practice:

from sklearn.linear_model import Ridge, Lasso

Ridge(alpha=100).fit(X_train, y_train)
Lasso(alpha=50, max_iter=10000).fit(X_train, y_train)

The alpha parameter controls how much shrinkage is applied. Larger alpha means more shrinkage. 02_regression_income.py compares all three models side by side.

What to look for in a regression report

When an analyst or vendor presents regression results, work through this checklist before accepting the findings:


5. Classification: predicting nonresponse

Whether a sampled unit responds is a binary outcome. Predicting it in advance allows field staff to prioritize follow-up contacts. High contact attempts and absent prior response history predict nonresponse; urban residents and habitual respondents are more likely to respond.

5.1 Fit a logistic regression model

Despite its name, logistic regression is a classification model. It predicts the probability of the positive class (responded = 1). A threshold -- default 0.5 -- converts that probability into a class label.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=500)
clf.fit(X_clf_train, y_clf_train)

y_clf_pred  = clf.predict(X_clf_test)          # class labels (0 or 1)
y_clf_proba = clf.predict_proba(X_clf_test)[:, 1]  # probability of responding

5.2 Classification metrics

Five metrics characterize a binary classifier. Each tells a different story:

For imbalanced outcomes -- which is typical of nonresponse -- accuracy is the least informative metric. A model that predicts “everyone responds” achieves high accuracy but zero recall for nonrespondents. F1 and AUC are better starting points.

Running 03_classification_nonresponse.py prints all five metrics for the default threshold and produces confusion matrix and ROC curve figures.

5.3 Confusion matrix and ROC curve

The confusion matrix shows where the model makes each type of error:

Predicted 0Predicted 1
True 0 (nonresponse)True Negative (TN)False Positive (FP)
True 1 (response)False Negative (FN)True Positive (TP)

In a nonresponse targeting system, False Negatives -- predicted to respond but actually did not -- mean missed field contacts: these households needed follow-up that was never scheduled. False Positives -- flagged for follow-up but would have responded anyway -- waste field budget. The right trade-off depends on the relative cost of each error type.

The ROC curve plots the true positive rate (recall) against the false positive rate at every possible threshold. It summarizes the full precision-recall trade-off in a single curve. A model no better than random produces the 45-degree diagonal. AUC is the area under that curve.

5.4 Threshold sensitivity

The default threshold of 0.5 is rarely optimal. Lowering the threshold to 0.35 flags more records as likely nonrespondents (higher recall) but also generates more false alarms (lower precision). The threshold choice is ultimately a budget and policy decision.

The pre-computed results below show how metrics shift across thresholds for the synthetic dataset:

threshold  accuracy  precision  recall    f1
0.30       0.612     0.768      0.726     0.746
0.35       0.643     0.782      0.710     0.744
0.40       0.674     0.802      0.688     0.741
0.45       0.693     0.820      0.669     0.737
0.50       0.710     0.838      0.651     0.732
0.55       0.719     0.858      0.627     0.725
0.60       0.725     0.882      0.601     0.715
0.65       0.720     0.904      0.565     0.696
0.70       0.708     0.928      0.519     0.666

As the threshold rises, precision improves and recall falls. The F1 score peaks somewhere in the middle. Your agency’s field operations staff should weigh in on what matters more: catching more nonrespondents at the cost of unnecessary contacts, or conserving field budget at the cost of missed follow-ups.

5.5 Interpreting logistic regression coefficients

Logistic regression coefficients are on a log-odds scale. The odds ratio -- the exponent of the coefficient -- is easier to interpret:

For example, an odds ratio of roughly 2.0 for prior_response means people who responded in the prior cycle are about twice as likely (in odds terms) to respond again, all else equal. Survey researchers call this “habitual respondents,” and it is one of the most reliable predictors in any nonresponse model (Groves & Couper, 1998).

What to look for in a classification report

When reviewing a classification model for nonresponse or any binary survey outcome:


6. From regression to categories: income brackets

Sometimes a continuous prediction is binned into categories for reporting or program eligibility. Income brackets -- low, middle, high -- appear throughout federal statistics for exactly this reason.

Binning continuous income into three categories and treating the result as a multi-class classification problem illustrates both the mechanics of multi-class modeling and the trade-offs involved in discretization.

The synthetic dataset bins income as follows:

BracketRangeApproximate share
Low< $40,000~25%
Middle40,00040,000--90,000~50%
High> $90,000~25%

A multi-class logistic regression fits one set of coefficients per class and predicts the bracket with the highest probability.

When discretization is appropriate -- and when it loses information

Discretizing a continuous variable is appropriate when:

Discretization loses information when:

The general rule: keep continuous variables continuous as long as possible and discretize only at the reporting or decision-making step. When brackets are policy-defined, document the boundary values explicitly and flag cases near the boundary.

6.1 Multi-class metrics: macro vs. weighted averaging

When a target has more than two classes, precision, recall, and F1 must be averaged across classes. Two averaging strategies exist:

The confusion matrix for three classes is a 3x3 grid. The diagonal cells are correct predictions; off-diagonal cells are errors. Looking at which adjacent bracket gets confused most often (low vs. middle, not low vs. high) reveals whether the model is making plausible near-boundary errors or systematic misclassification.


7. Quick reference


8. Evaluate this output

These exercises ask you to interpret pre-computed results, not to run code. Work through the questions first; then run examples/chapter-01/05_exercises.py to verify your reasoning.

8.1 Interpret a regression report

A colleague fits a linear regression to predict income using only age and hours worked (two features, 90/10 split). They report:

MAE: $18,420
R²:  0.138

Questions:

Discussion

R² = 0.138 means the model explains about 14% of income variance -- well below the four-feature model. The missing features (education and urban status) carry substantial predictive power. A 90/10 split produces a smaller test set, so each individual prediction error has more influence on the reported metric. Before approving for production: check residuals for patterns, compare performance across education and urban subgroups, ask whether the two included features are proxies for anything protected.


8.2 Interpret a threshold decision

A field operations supervisor is reviewing the nonresponse model. The model team has run the threshold sensitivity analysis and reported this table:

threshold  precision  recall    f1
0.35       0.782      0.710     0.744
0.50       0.838      0.651     0.732
0.65       0.904      0.565     0.696

The supervisor has a field budget that allows for 400 follow-up contacts in the 240-record test set.

Questions:

Discussion

At threshold 0.35 the model flags more records (higher recall), likely exceeding 400 contacts; at 0.65 it flags fewer. The supervisor maximizing recall wants 0.35. The budget analyst minimizing unnecessary contacts wants 0.65. The right answer is a conversation about the cost ratio: how many unnecessary contacts is one missed nonrespondent worth? That is an operational judgment, not a statistical one. The model team’s job is to surface the trade-off clearly, not to make the policy decision.


8.3 Spot the error in a confusion matrix narrative

A report states: “Our nonresponse model achieved 91% accuracy on the test set. We are confident it correctly identifies nonrespondents.”

The confusion matrix for the test set (240 records) is:

                Predicted: Did not respond   Predicted: Responded
True: Did not respond           18                  22
True: Responded                  1                 199

Questions:

Discussion

Accuracy = (18 + 199) / 240 = 90.4%, close to the reported 91%. But recall for nonrespondents = 18 / 40 = 45%. The model misses more than half of true nonrespondents. Accuracy is high because 200 of 240 test records are responders; a model that predicted “everyone responds” would achieve 83% accuracy with zero nonrespondent recall. The review memo should flag that recall for the target class (nonrespondents) is the operationally relevant metric, and that 45% recall is not sufficient for a follow-up targeting system.


8.4 Optional: run the code

If you want to verify the numbers above or extend any analysis, all scripts are in examples/chapter-01/. Run them in order (01 generates the data; 02, 03, 04 analyze it; 05 contains exercise solutions).


Key takeaways for survey methodology