What you'll be able to do

  • Measure classification performance with precision, recall, and F1
  • Read a confusion matrix and pick the metric that matches the cost of each error
  • Investigate where a model fails through error analysis and interpretability
Competencies you'll build
  • Build and interpret a confusion matrix
  • Explain why accuracy misleads on imbalanced data
  • Use Grad-CAM to confirm an image model looks at the right region

Key terms in this chapter

Chapter - Model Evaluation

Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program and is not derived from Automate the Boring Stuff with Python; it is written in a similar tone for continuity with the other chapters.

About this chapter

You have trained a model. Now the real question: is it any good, and how do you know? Training a model is easy; trusting one takes evidence. This chapter covers how to measure model performance, why a single number like accuracy can mislead, and how to investigate where a model fails — the Tier 2–4 capability the maturity model calls Algorithm Evaluation.

We will keep the running example from the rest of the program: a model that inspects weld images and flags defects.

Evaluate on data the model has not seen

A model's score on its training data tells you how well it memorized, not how well it will work. Always report performance on a held-out test set (and tune on a separate validation set). If you train and score on the same rows, you will be fooled every time.

Classification metrics

Most inspection problems are classification: defect or no defect. Start from the four outcomes a binary classifier can produce.

Predicted defect Predicted OK
Actually defect True Positive (TP) False Negative (FN)
Actually OK False Positive (FP) True Negative (TN)

A confusion matrix is just this table filled with counts. Almost every metric below is a ratio of these four cells.

Predicted defect OK Actual defect OK True Positivecaught the defect False Negativemissed defect (worst) False Positivefalse alarm True Negativecorrectly cleared
The diagonal (TP, TN) is correct; the off-diagonal is the two kinds of error. For weld inspection a False Negative — a missed defect — is usually far more costly than a False Positive.

Accuracy, and why it misleads

Accuracy is the fraction of predictions that were correct:

text
accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy is intuitive but dangerous on imbalanced data. If only 2% of welds are defective, a model that predicts "OK" for everything scores 98% accuracy while catching zero defects. The number looks great and the model is useless.

Precision and recall

When one class is rare and important, use precision and recall.

  • Precision = of the welds we flagged, how many were really defective? TP / (TP + FP). High precision means few false alarms.
  • Recall = of the welds that were defective, how many did we catch? TP / (TP + FN). High recall means few misses.

There is usually a trade-off. Flag more aggressively and recall rises but precision falls. For safety inspection you typically favor recall — missing a real defect is the expensive error.

F1 score

The F1 score is the harmonic mean of precision and recall, a single number that rewards a model only when both are high:

text
F1 = 2 * (precision * recall) / (precision + recall)

The ROC curve and AUC

Many classifiers output a probability, and you choose a threshold above which you flag a defect. The ROC curve plots recall (true-positive rate) against the false-positive rate as you sweep that threshold. AUC (area under the curve) summarizes it: 1.0 is perfect, 0.5 is random. AUC is useful for comparing models independent of any single threshold.

Regression metrics

If the target is a number (for example, predicted weld strength in MPa), use:

  • MAE (mean absolute error) — average size of the error, in the original units. Easy to explain.
  • RMSE (root mean squared error) — like MAE but punishes large errors more.
  • — the fraction of variance explained; 1.0 is perfect, 0 is no better than predicting the mean.

Look past the single number: error analysis

A score tells you how much the model is wrong, not where. Tier 3–4 work is the investigation:

  • Pull the misclassified examples and look for patterns — one lighting condition, one defect type, one camera.
  • Slice metrics by group (per defect class, per production line). A 90% average can hide a 40% recall on the rarest, most critical defect.
  • Check calibration: when the model says "90% sure," is it right about 90% of the time?

Interpretability

For image models, Grad-CAM highlights the pixels that drove a prediction. It is how you confirm the model is looking at the weld bead and not at a label or a shadow in the background. A model that is right for the wrong reason will fail in production. (See the Grad-CAM computer-vision lab.)

A short evaluation checklist

  1. Score on a held-out test set, never the training set.
  2. For classification, report a confusion matrix plus precision, recall, and F1 — not accuracy alone.
  3. Favor the metric that matches the cost of each error (usually recall for safety).
  4. Slice metrics by class and group to find hidden weak spots.
  5. Use interpretability to confirm the model looks at the right thing.
Practice Questions

Practice Questions

  1. Why is it wrong to report a model's accuracy on its training data?
  2. Define true positive, false positive, false negative, and true negative for weld inspection.
  3. A defect detector reaches 98% accuracy on a dataset that is 2% defective. Why might it still be useless?
  4. In your own words, what is the difference between precision and recall?
  5. For safety-critical inspection, which error is usually worse: a false positive or a false negative? Why?
  6. What does the F1 score combine, and why use the harmonic mean?
  7. What does AUC measure, and what value means "random guessing"?
  8. Name two regression metrics and what each tells you.
  9. Give one example of error analysis that a single accuracy number would hide.
  10. How does Grad-CAM help you trust an image classifier?

Try it: the precision–recall trade-off

Interactive

Each dot is a weld; filled dots are real defects. Drag the threshold — welds scoring at or above it are flagged as defects. Watch recall and precision move in opposite directions.

6true positive
2false negative
2false positive
10true negative
  • Precision75%
  • Recall75%
  • F175%
  • Accuracy80%
false-positive ratetrue-positive rate

The dot is your current threshold on the ROC curve. Sweeping the threshold traces the whole curve; the closer it hugs the top-left corner, the better the model separates the classes.

Lower the threshold and recall climbs (you catch more defects) but precision falls (more false alarms). For safety inspection you usually accept false alarms to keep recall high.

View the Python sourceconfusion matrix & ROC in scikit-learn

This is the runnable code behind the visual above. Open it in Google Colab to run and edit it in your browser — no setup, but you'll need a free Google account — or copy it into your own notebook.

Loading source…

Check your understanding

Tier 3 depth · Design & algorithm reasoning

0 / 5 correct
  1. A weld-defect detector scores 98% accuracy on data that is 2% defective. What should you suspect?

  2. Of the welds the model flagged as defective, how many were actually defective. Which metric is this?

  3. For safety-critical weld inspection, which error is usually the most costly?

  4. What does the F1 score combine?

  5. Why use Grad-CAM when evaluating an image classifier?

Go deeper

More in Additional Resources →
← Dimensionality Reduction Deployment and Monitoring →