Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 4 - Neural Networks Basics

The building blocks of deep learning, without the hype. You will understand what a neural network does, when to use one, and when a Random Forest is the better choice.

Full runnable code for all examples is in examples/chapter-04/.

Learning goals


1. What is a neural network?

1.1 The neuron

A single artificial neuron (Rosenblatt, 1958) takes a weighted sum of its inputs and passes it through a nonlinear function called an activation function:

a=σ ⁣(jwjxj+b)a = \sigma\!\left(\sum_{j} w_j x_j + b\right)

A single neuron with a sigmoid activation is logistic regression. The power of neural networks comes from stacking many neurons in layers.

1.2 Layers

A multilayer perceptron (MLP) stacks neurons into three types of layers:

The architecture diagram in examples/chapter-04/01_dataset_setup.py visualises a network with 5 inputs, two hidden layers of 6 units each, and one binary output. Every node in each layer connects to every node in the next layer — a fully connected network. Those connections carry the learned weights.

1.3 Activation functions

Without nonlinear activations, stacking layers is mathematically equivalent to a single linear transformation — no more powerful than logistic regression. Activation functions break the linearity, allowing the network to learn complex patterns.

The choice of activation function rarely matters more than a few tenths of a percent on tabular survey data. ReLU is the right default.

1.4 How training works

Training adjusts all the weights wjw_j to minimize a loss function (prediction error). The process:

  1. Forward pass: feed a batch of training records through the network, compute predicted outputs.

  2. Loss: compute the error between predictions and labels. Classification uses cross-entropy loss. Regression uses mean squared error.

  3. Backpropagation (Rumelhart, Hinton & Williams, 1986): compute the gradient of the loss with respect to every weight using the chain rule of calculus. This tells each weight “if you increase, does the loss go up or down?”

  4. Gradient descent: nudge every weight slightly in the direction that reduces loss. The learning rate controls how big each nudge is.

  5. Repeat for many passes through the training data (epochs).

You do not need to implement backpropagation. Scikit-learn does it automatically. What you do need to understand is the training curve: a plot of loss vs. epoch. A well-behaved curve shows loss decreasing and then flattening as the model converges. Loss still falling steeply at the last epoch means the model is undertrained (increase max_iter). Loss bouncing erratically means the learning rate is too high.

The gradient descent illustration in examples/chapter-04/01_dataset_setup.py shows a toy one-dimensional loss surface. Notice that a learning rate that is too large causes the optimizer to overshoot the minimum and oscillate — the exact pattern you see when an MLP training curve never settles.


2. Setup: the same survey dataset

We use the same synthetic dataset from Chapters 1-3. This is the final model comparison point for Part I.

The dataset has n=1,200 synthetic survey respondents with five classification features (age, education_years, urban, contact_attempts, prior_response) and a binary nonresponse outcome, plus four regression features for income prediction.


3. MLP for classification: nonresponse prediction

The MLP classifier in examples/chapter-04/02_mlp_classification.py uses two hidden layers of 100 and 50 units, ReLU activations, the Adam optimizer (Kingma & Ba, 2015), and early stopping. Early stopping monitors the held-out validation loss and halts training when it stops improving — the simplest overfitting defense in scikit-learn’s MLP.

The training curve is the primary convergence diagnostic. Read it before trusting any metrics. A curve that is still falling steeply when training ends means the model was stopped too early; a curve that oscillates without settling means the learning rate is too high.


4. MLP for regression: income prediction

examples/chapter-04/03_mlp_regression.py demonstrates regression on the income target. Two additions beyond the classification setup:

  1. The target is standardized before training (mean 0, std 1). Without this, the output layer weights must span a 10,00010,000–250,000 range, which conflicts with the small random values used at initialization.

  2. Predictions are de-standardized after training before computing MAE and R².

The parity plot (actual vs. predicted income) is the standard regression diagnostic. Systematic below-diagonal bias in the high-income range would indicate the model is not capturing the upper tail — a common pattern when a log-income distribution is modeled on limited data.


5. Hyperparameters: what to tune

Neural networks have more tunable hyperparameters than logistic regression or decision trees. The key ones for scikit-learn’s MLPClassifier:

ParameterWhat it controlsTypical range
hidden_layer_sizesNetwork depth and width(64,), (64,64), (128,64)
activationNonlinearity in hidden layers"relu" (default), "tanh"
learning_rate_initStep size for gradient descent0.0001 to 0.01
alphaL2 regularization (controls overfitting)0.0001 to 1.0
max_iterMaximum training epochs200 to 1,000
early_stoppingStop when validation loss plateausTrue (recommended)

examples/chapter-04/04_architecture_search.py benchmarks six configurations from a single hidden layer of 50 units to a three-layer pyramid (100, 50, 25). The result on n=1,200 survey records is almost always the same: the AUC spread across all configurations is less than one percentage point. Larger architectures have more parameters, take longer to converge, and are more sensitive to the learning rate — without offering a measurable accuracy advantage.

The lesson is not to find the optimal architecture. The lesson is that architecture tuning is largely wasted effort on modest tabular datasets. If the architecture spread is 0.5 AUC points and the RF-to-MLP gap in Section 6 is also 0.5 AUC points, the “neural network improvement” disappears into tuning noise.

5.2 Regularization with alpha (L2 penalty)

examples/chapter-04/05_regularization.py sweeps alpha across five orders of magnitude. The diagnostic output is the train-test AUC gap:


6. The full four-model comparison

examples/chapter-04/06_four_model_comparison.py refits all four model families on the same training split and evaluates them on the same held-out test set. The results below are representative of what this script produces on the n=1,200 synthetic dataset:

ModelAccuracyPrecisionRecallF1AUC
Logistic Regression0.7540.7710.8810.8220.782
Decision Tree (depth 3)0.7420.7580.8780.8140.759
Random Forest (200 trees)0.7710.7840.8930.8350.813
MLP (100, 50)0.7680.7810.8870.8310.809

7. The interpretability cost

examples/chapter-04/07_interpretability.py demonstrates the contrast directly. The output below is representative:

Logistic Regression — interpretable coefficients:
  prior_response:    -1.18  (strongest negative predictor of nonresponse)
  contact_attempts:  +0.24  (more attempts → more likely to not respond)
  urban:             -0.29  (urban respondents slightly more likely to respond)
  age:               +0.01  (small positive effect)
  education_years:   -0.02  (negligible)

Neural Network (100,50) — weight matrix shapes:
  Layer 1 weights: (5, 100)   — 500 parameters
  Layer 1 bias:    (100,)     — 100 parameters
  Layer 2 weights: (100, 50)  — 5,000 parameters
  Layer 2 bias:    (50,)      — 50 parameters
  Output weights:  (50, 1)    — 50 parameters
  Total trainable parameters: 5,700
  → No coefficient you can print and explain

The logistic regression output is a methodology table. The MLP output is a weight matrix — 5,700 numbers that do not translate into decision rules or odds ratios.

Partial dependence plots (PDPs; Friedman, 2001) provide the best available aggregate explanation for a neural network. They show the marginal effect of one feature on predictions, averaged over all other features. The PDP for prior_response will tell you “as prior_response increases from 0 to 1, the predicted nonresponse probability drops by X points.” That is useful. But a PDP cannot tell you why the model predicted 0.72 nonresponse probability for household 1042 specifically. For individual-case explanation, you need SHAP values (covered in later chapters) or a simpler model.

7.1 The complexity burden

Interpretability is one dimension of a broader cost structure. In federal environments, each additional dimension matters:

Interpretability cost: cannot print decision rules; hard to explain a specific decision in response to a respondent inquiry or IG question.

Deployment cost: scikit-learn’s MLPClassifier runs on CPU and requires only numpy and scipy — it will work in Colab or a standard agency Python environment. PyTorch and TensorFlow may require additional IT approval, GPU allocation, or cloud infrastructure that is not in the current ATO.

Maintenance cost: retraining pipelines, monitoring for distribution shift, version control for model weights. A logistic regression retrained quarterly is a spreadsheet operation. An MLP retraining pipeline is a software engineering project.

Approval cost: OMB review for new statistical methodology, ATO process for new infrastructure, and potentially a Data Governance Board sign-off. The ATO timeline for new infrastructure can exceed the shelf life of the model.

Auditability cost: harder to defend in response to FOIA requests, IG audits, or congressional inquiries. “The algorithm determined it” is not an answer when the agency is required to explain individual decisions under the Privacy Act or agency program rules.

These are real costs in federal environments. A one-percent AUC improvement rarely covers them.


8. When to use a neural network

The model selection guide below summarises the decision for federal survey work. References to “Chapter 11” (Transformers) and “Chapter 12” (LLMs and language models) indicate where unstructured-data applications are covered.

SituationRecommended modelReason
Small tabular dataset (< 10K records)Logistic regression or Random ForestNN overfits easily; simpler models generalize better
Medium tabular dataset (10K–1M records)Random Forest or gradient boostingStrong performance; interpretable feature importance
Large tabular dataset (> 1M records)Neural network or gradient boostingNN can learn complex interactions at scale
Text data (survey open-ends)Fine-tuned language model (Chapter 12)NNs dominate unstructured text
Image data (form processing)CNN (covered in later chapters on language models)Spatial hierarchy requires NNs
Need printable decision rulesDecision tree (shallow)Rules are auditable and attachable to methodology reports
Need coefficients for methodologyLogistic regressionDirect odds-ratio interpretation
Constrained IT environment / no GPULogistic regression or Random Forestsklearn MLP uses CPU; PyTorch/TF may require ATO

8.1 When neural networks earn their keep

To be specific about the cases where the complexity is justified:

8.2 Questions to ask when a vendor proposes a neural network

Before accepting a vendor’s claim that a neural network outperforms existing methods on your data, ask these seven questions:

  1. How much training data was used? On typical tabular datasets, tree-based models consistently match or outperform neural networks, especially at sample sizes under 10,000 (Grinsztajn, Oyallon & Varoquaux, 2022). Neural networks begin to close the gap only on larger datasets. Under 10,000 records with tabular features, the neural network is likely overfit.

  2. What is the baseline comparison? Did they compare to a Random Forest on the same data, with the same train-test split, evaluated on the same metric?

  3. What is the performance gap? If the improvement is less than one to two AUC points, is the additional complexity justified by the agency’s actual decision requirements?

  4. How is the model explained? PDPs? SHAP? Or “trust the system”? For federal programs, “trust the system” is not an acceptable methodology defense.

  5. What is the deployment environment? GPU required? Cloud dependency? Is it on the approved software list? What does the ATO timeline look like?

  6. What is the retraining cadence? Neural networks can degrade when the data distribution shifts (survey population changes, operational procedure changes), and standard MLPs on tabular data have shown larger robustness gaps than well-tuned tree-based models in benchmark comparisons (Grinsztajn, Oyallon & Varoquaux, 2022). Who owns the retraining pipeline?

  7. What happens if the model fails? Is there a fallback strategy? Can the agency revert to a rule-based system or a logistic regression while the neural network is retrained or audited?


9. In-class activity

You are evaluating four modeling approaches for a nonresponse prediction task at a regional office. Your office has the same 300-tract dataset used throughout Part I. The following pre-computed results table represents what a full comparison produces on this dataset:

ModelAccuracyF1AUC-ROC
Logistic Regression(run the script)(run the script)(run the script)
Decision Tree (depth 3)
Random Forest (100 trees)
MLP (64, 64)

Exercise questions:

  1. Run examples/chapter-04/08_exercises.py and record the results table. Which model would you recommend deploying at this regional office? Write a one-paragraph justification that cites specific evidence from the metrics.

  2. A vendor proposes replacing all four models with a deep neural network. Using the checklist in Section 8.2, write out all seven questions as they apply to this specific tract-level prediction task.

  3. The IT department says PyTorch is not on the approved software list. What are your options? (Hint: scikit-learn’s MLPClassifier uses numpy and scipy, not PyTorch. What does that tell you about the approval question for the sklearn MLP specifically? What are the remaining questions you would still need to answer?)

  4. If the MLP achieves 0.815 AUC vs. the Random Forest’s 0.813 AUC on the tract dataset, would you recommend the switch? Identify at least three factors from Section 7.1 (The complexity burden) that govern the answer.

  5. Optional: modify the solution in 08_exercises.py to add a fifth model (gradient boosting via sklearn.ensemble.GradientBoostingClassifier). Does it change the recommendation?


Key takeaways for survey methodology


Part I summary

You have now seen four model families applied to the same federal survey prediction task. For most survey prediction tasks on tabular data, logistic regression or a Random Forest is the right starting point. Use a decision tree when you need printable rules that can be attached to a methodology report. Consider a neural network only when the data is very large, unstructured, or when simpler models demonstrably and substantially underperform on a well-designed benchmark.

Part II introduces methods for specific federal data challenges: record linkage, dimension reduction, and imputation — problems where the right algorithm choice depends on data structure and agency context, not on maximizing AUC on a standard benchmark.