Chapter - Model Evaluation

Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program and is not derived from Automate the Boring Stuff with Python; it is written in a similar tone for continuity with the other chapters.

About this chapter

You have trained a model. Now the real question: is it any good, and how do you know? Training a model is easy; trusting one takes evidence. This chapter covers how to measure model performance, why a single number like accuracy can mislead, and how to investigate where a model fails — the Tier 2–4 capability the maturity model calls Algorithm Evaluation.

We will keep the running example from the rest of the program: a model that inspects weld images and flags defects.

Evaluate on data the model has not seen

A model's score on its training data tells you how well it memorized, not how well it will work. Always report performance on a held-out test set (and tune on a separate validation set). If you train and score on the same rows, you will be fooled every time.

Classification metrics

Most inspection problems are classification: defect or no defect. Start from the four outcomes a binary classifier can produce.

	Predicted defect	Predicted OK
Actually defect	True Positive (TP)	False Negative (FN)
Actually OK	False Positive (FP)	True Negative (TN)

A confusion matrix is just this table filled with counts. Almost every metric below is a ratio of these four cells.

The diagonal (TP, TN) is correct; the off-diagonal is the two kinds of error. For weld inspection a False Negative — a missed defect — is usually far more costly than a False Positive.

Accuracy, and why it misleads

Accuracy is the fraction of predictions that were correct:

text

accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy is intuitive but dangerous on imbalanced data. If only 2% of welds are defective, a model that predicts "OK" for everything scores 98% accuracy while catching zero defects. The number looks great and the model is useless.

Precision and recall

When one class is rare and important, use precision and recall.

Precision = of the welds we flagged, how many were really defective? TP / (TP + FP). High precision means few false alarms.
Recall = of the welds that were defective, how many did we catch? TP / (TP + FN). High recall means few misses.

There is usually a trade-off. Flag more aggressively and recall rises but precision falls. For safety inspection you typically favor recall — missing a real defect is the expensive error.

F1 score

The F1 score is the harmonic mean of precision and recall, a single number that rewards a model only when both are high:

text

F1 = 2 * (precision * recall) / (precision + recall)

The ROC curve and AUC

Many classifiers output a probability, and you choose a threshold above which you flag a defect. The ROC curve plots recall (true-positive rate) against the false-positive rate as you sweep that threshold. AUC (area under the curve) summarizes it: 1.0 is perfect, 0.5 is random. AUC is useful for comparing models independent of any single threshold.

Regression metrics

If the target is a number (for example, predicted weld strength in MPa), use:

MAE (mean absolute error) — average size of the error, in the original units. Easy to explain.
RMSE (root mean squared error) — like MAE but punishes large errors more.
R² — the fraction of variance explained; 1.0 is perfect, 0 is no better than predicting the mean.

Look past the single number: error analysis

A score tells you how much the model is wrong, not where. Tier 3–4 work is the investigation:

Pull the misclassified examples and look for patterns — one lighting condition, one defect type, one camera.
Slice metrics by group (per defect class, per production line). A 90% average can hide a 40% recall on the rarest, most critical defect.
Check calibration: when the model says "90% sure," is it right about 90% of the time?

Interpretability

For image models, Grad-CAM highlights the pixels that drove a prediction. It is how you confirm the model is looking at the weld bead and not at a label or a shadow in the background. A model that is right for the wrong reason will fail in production. (See the Grad-CAM computer-vision lab.)

A short evaluation checklist

Score on a held-out test set, never the training set.
For classification, report a confusion matrix plus precision, recall, and F1 — not accuracy alone.
Favor the metric that matches the cost of each error (usually recall for safety).
Slice metrics by class and group to find hidden weak spots.
Use interpretability to confirm the model looks at the right thing.

Practice Questions

Why is it wrong to report a model's accuracy on its training data?
Define true positive, false positive, false negative, and true negative for weld inspection.
A defect detector reaches 98% accuracy on a dataset that is 2% defective. Why might it still be useless?
In your own words, what is the difference between precision and recall?
For safety-critical inspection, which error is usually worse: a false positive or a false negative? Why?
What does the F1 score combine, and why use the harmonic mean?
What does AUC measure, and what value means "random guessing"?
Name two regression metrics and what each tells you.
Give one example of error analysis that a single accuracy number would hide.
How does Grad-CAM help you trust an image classifier?

What you'll be able to do

Key terms in this chapter