What you'll be able to do

  • Distinguish regression vs classification and supervised vs unsupervised learning
  • Match a problem to an appropriate algorithm family
  • Reason about when a model is over- or under-fitting
Competencies you'll build
  • Identify the learning framework for a given target variable
  • Explain why accuracy can mislead on imbalanced data
  • Name a suitable baseline model for a new problem

Key terms in this chapter

Chapter - Algorithm Families

Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program and is not derived from Automate the Boring Stuff with Python; it is written in a similar tone for continuity with the other chapters.

About this chapter

So far, you have learned how to write Python scripts, summarize datasets, visualize patterns, and prepare features for modeling. Those skills help you ask: What information do I have, and is it ready for a model?

This chapter focuses on the next question: What kind of model should I try?

Machine learning algorithms can be grouped into algorithm families. A family is a group of methods that solve similar problems or learn in similar ways. Some algorithms predict numbers. Some predict categories. Some find groups without being told the right answer. Some learn from feedback over time.

The examples in this chapter use manufacturing, inspection, and welding language, but the ideas apply to many kinds of data.

Start with the prediction target

Before choosing an algorithm, identify the target.

The target is the value you want the model to predict, explain, or discover.

Ask:

  • Is the target a number?
  • Is the target a category?
  • Is there a target at all?
  • Is the goal to take actions and learn from rewards?

These questions point you toward the right learning framework and algorithm family.

Regression vs. classification

Two of the most common supervised learning problem types are regression and classification.

Two panels: left, a scatter of points with a fitted straight line predicting tensile strength from weld length; right, two colored classes of points separated by a learned boundary.
Regression fits a line (or curve) through the data to predict a continuous number. Classification learns a boundary that separates the classes so new points can be labelled.
Show the code that generated this plot
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4.2))

# Regression: fit a line through points
rng = np.random.default_rng(1)
x = np.linspace(20, 60, 30)
y = 1.8 * x + 380 + rng.normal(0, 12, size=30)
ax1.scatter(x, y, color='#1f2937', s=25)
ax1.plot(x, np.polyval(np.polyfit(x, y, 1), x), color='#2457c5', linewidth=2.5)
ax1.set_title('Regression — predict a number')

# Classification: learn a boundary between two classes
X, yc = make_blobs(n_samples=50, centers=[[-2, -2], [2, -1]], cluster_std=1.1, random_state=3)
clf = LogisticRegression().fit(X, yc)
xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 200),
                     np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 200))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax2.contourf(xx, yy, Z, alpha=0.6)
ax2.scatter(X[:, 0], X[:, 1], c=yc, edgecolor='white', s=35)
ax2.set_title('Classification — find a boundary')
plt.tight_layout()
plt.show()
Problem type What the model predicts Example target
Regression A numeric value Weld strength, defect size, remaining useful life
Classification A category or class pass / fail, defect type, material group

The difference is not the input data. The same input features might be used for either problem.

For example, these features:

  • weld_length_mm
  • voltage
  • travel_speed
  • mean_pixel
  • material

could be used to predict a numeric target:

text
tensile_strength_mpa = 485.2

That is a regression problem.

The same features could also be used to predict a category:

text
inspection_result = 'fail'

That is a classification problem.

When to use regression

Use regression when the thing you want to predict is numeric and ordered.

Common regression targets include:

  • Temperature.
  • Pressure.
  • Time until failure.
  • Defect length.
  • Weld strength.
  • Cost.
  • Probability-like scores, when treated as continuous values.

For example:

python
target = 'defect_length_mm'

If the model predicts 2.5, 3.0, or 7.8, those values have numeric meaning. A prediction of 7.8 mm is larger than 3.0 mm.

Regression models are usually evaluated with metrics such as:

  • Mean absolute error.
  • Mean squared error.
  • Root mean squared error.
  • R-squared.

The exact metric should match the practical question. If being wrong by 2 mm is easy to understand, mean absolute error may be useful.

When to use classification

Use classification when the thing you want to predict is a category.

Common classification targets include:

  • pass or fail.
  • defective or not_defective.
  • porosity, crack, undercut, or no_defect.
  • low_risk, medium_risk, or high_risk.
  • Material type.

For example:

python
target = 'defect_class'

If the model predicts crack, that is a class label. It is not larger or smaller than porosity. It is simply a different category.

Classification models are usually evaluated with metrics such as:

  • Accuracy.
  • Precision.
  • Recall.
  • F1 score.
  • Confusion matrix.
  • ROC-AUC, especially for binary classification.

The metric matters. In a rare-defect problem, high accuracy can be misleading if the model simply predicts no_defect for almost everything.

A quick decision checklist

Use this checklist before choosing an algorithm:

  1. Name the target. Write down exactly what the model should predict.
  2. Check the target type. Is it numeric, categorical, missing, or reward-based?
  3. Check when the prediction is needed. Do not use features that appear only after the answer is known.
  4. Check the cost of mistakes. Is a false negative worse than a false positive?
  5. Start simple. Use a baseline model before trying a complex model.
  6. Compare fairly. Train and test models using the same data split and metric.

Learning frameworks

A learning framework describes how the model receives information.

Framework What the model learns from Common goal
Supervised learning Examples with known answers Predict a target
Unsupervised learning Examples without known answers Find structure
Self-supervised learning Labels created from the data itself Learn useful representations
Reinforcement learning Actions, rewards, and feedback Learn a policy for decisions

These frameworks are not competing answers to every problem. They solve different kinds of problems.

Supervised learning

In supervised learning, the training data includes inputs and known answers.

For example:

weld_id voltage travel_speed defect_class
W001 22.1 4.8 no_defect
W002 19.5 5.5 porosity
W003 24.3 4.1 crack

The features are voltage and travel_speed. The label is defect_class.

Supervised learning is useful when:

  • You have reliable labels.
  • You know what you want to predict.
  • Future examples will look similar to training examples.
  • You can measure performance against known answers.

Supervised learning has limitations:

  • Labels can be expensive or slow to create.
  • Bad labels teach the model bad patterns.
  • The model may fail when future data differs from training data.
  • A model can learn shortcuts if leakage features are present.

Most regression and classification algorithms are supervised learning methods.

Unsupervised learning

In unsupervised learning, the data does not include known answers.

For example, you may have measurements from many welds but no defect labels:

weld_id voltage travel_speed mean_pixel
W001 22.1 4.8 82.4
W002 19.5 5.5 61.7
W003 24.3 4.1 91.2

An unsupervised method might look for groups, unusual examples, or hidden structure.

Unsupervised learning is useful when:

  • Labels are not available.
  • You want to explore a dataset.
  • You want to find clusters or outliers.
  • You want to reduce dimensionality before visualization.

Unsupervised learning has limitations:

  • There may be no single correct answer.
  • Clusters may not match real-world categories.
  • Results can be hard to validate.
  • Parameter choices can strongly affect the result.

Unsupervised learning is often used early in a project to understand data before building a supervised model.

Self-supervised learning

In self-supervised learning, the model creates a training task from the data itself.

For example, a model might learn from weld images by trying to:

  • Predict a missing part of an image.
  • Tell whether two image crops came from the same original image.
  • Reconstruct a noisy image.
  • Predict the next value in a sensor sequence.

The goal is often to learn a useful internal representation before training on a smaller labeled dataset.

Self-supervised learning is useful when:

  • You have lots of unlabeled data.
  • Labels are expensive.
  • You want to pretrain a model before fine-tuning.
  • The raw data has structure, such as images, text, audio, or time series.

Self-supervised learning has limitations:

  • It can require large datasets and significant computing power.
  • The pretraining task may not match the final task.
  • It can be harder to explain than simpler supervised methods.
  • It usually requires more machine learning engineering.

Self-supervised learning is common in modern computer vision, language models, and sensor-sequence modeling.

Reinforcement learning

In reinforcement learning, an agent learns by taking actions and receiving rewards or penalties.

The model does not simply predict a label. It learns a policy, which is a strategy for choosing actions.

For example, a reinforcement learning setup might involve:

  • An agent controlling a simulated robot.
  • Actions that move the robot arm.
  • Rewards for completing a task safely and accurately.
  • Penalties for collisions, delays, or wasted motion.

Reinforcement learning is useful when:

  • The problem involves sequences of decisions.
  • Actions affect future states.
  • A simulator or safe training environment is available.
  • The goal can be expressed as a reward.

Reinforcement learning has limitations:

  • It can require many trials.
  • It can be unstable and hard to debug.
  • Poor reward design can teach unwanted behavior.
  • It is often risky to train directly on real equipment.

For many manufacturing analytics problems, supervised or unsupervised learning is the better starting point. Reinforcement learning becomes more relevant when the task is about control, planning, or sequential decision-making.

Algorithm families

An algorithm is a specific method for learning from data. An algorithm family is a group of related methods.

This chapter introduces these common families:

  • Linear Regression.
  • Decision Trees.
  • Random Forest.
  • Logistic Regression.
  • Support Vector Machines.
  • Naive Bayes.
  • Neural Networks.
  • Gradient Boosting.
  • Clustering.

No algorithm is best for every problem. The right choice depends on the target, data size, feature types, noise, interpretability needs, and deployment constraints.

Linear Regression

Linear Regression predicts a numeric value by fitting a straight-line relationship between features and the target.

For one feature, the idea looks like this:

text
predicted_strength = intercept + slope * weld_length_mm

With several features, the model assigns a weight to each feature:

text
predicted_strength = intercept
                     + weight_1 * voltage
                     + weight_2 * travel_speed
                     + weight_3 * heat_input

Linear Regression is usually used for regression problems.

Example:

python
from sklearn.linear_model import LinearRegression

X = df[['voltage', 'travel_speed', 'heat_input']]
y = df['tensile_strength_mpa']

model = LinearRegression()
model.fit(X, y)

predictions = model.predict(X)
A scatter of tensile strength versus weld length with a straight fitted regression line rising through the points.
Linear Regression fits the straight line that best predicts a numeric target — here, tensile strength from weld length.
Show the code that generated this plot
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

rng = np.random.default_rng(2)
weld_length_mm = np.linspace(20, 60, 40)
tensile = 1.8 * weld_length_mm + 380 + rng.normal(0, 14, size=40)

model = LinearRegression().fit(weld_length_mm.reshape(-1, 1), tensile)
line = model.predict(weld_length_mm.reshape(-1, 1))

plt.scatter(weld_length_mm, tensile, color='#1f2937', s=25, label='data')
plt.plot(weld_length_mm, line, color='#2457c5', linewidth=2.5, label='fitted line')
plt.xlabel('weld length (mm)')
plt.ylabel('tensile strength (MPa)')
plt.legend()
plt.show()

Linear Regression is useful when:

  • The target is numeric.
  • You need a simple baseline.
  • Relationships are roughly linear.
  • Interpretability matters.
  • The dataset is not extremely large or complex.

Linear Regression has limitations:

  • It struggles with nonlinear patterns unless you create useful features.
  • Outliers can strongly affect the fitted line.
  • It assumes the target changes in a smooth additive way.
  • It may underfit complex data such as images.

Linear Regression is often a good first model for numeric prediction because it is simple and easy to explain.

Logistic Regression

Logistic Regression is used for classification, even though its name includes the word regression.

It takes input features, calculates a score, and passes that score through a sigmoid function to produce a value between 0 and 1.

For binary classification:

text
probability_of_defect = sigmoid(score)

Then a threshold is used to choose a class:

text
if probability_of_defect >= 0.5:
    predict 'defective'
else:
    predict 'not_defective'

Example:

python
from sklearn.linear_model import LogisticRegression

X = df[['mean_pixel', 'std_pixel', 'defect_area']]
y = df['is_defective']

model = LogisticRegression()
model.fit(X, y)

predicted_classes = model.predict(X)
predicted_probabilities = model.predict_proba(X)
A sigmoid probability curve rising from 0 to 1 across mean pixel value, with a dashed 0.5 threshold line and the two classes of points at the bottom and top.
Logistic Regression passes a score through a sigmoid to output a probability; the 0.5 line is the default class boundary.
Show the code that generated this plot
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

rng = np.random.default_rng(4)
mean_pixel = np.concatenate([rng.normal(70, 8, 40), rng.normal(100, 8, 40)])
is_defective = np.concatenate([np.ones(40), np.zeros(40)])  # darker -> defect

model = LogisticRegression().fit(mean_pixel.reshape(-1, 1), is_defective)
grid = np.linspace(mean_pixel.min(), mean_pixel.max(), 300).reshape(-1, 1)
prob = model.predict_proba(grid)[:, 1]

plt.scatter(mean_pixel, is_defective, c=is_defective, edgecolor='white', s=30, alpha=0.8)
plt.plot(grid, prob, color='#2457c5', linewidth=2.5, label='P(defect)')
plt.axhline(0.5, color='#94a3b8', linestyle='--', label='0.5 threshold')
plt.xlabel('mean pixel value')
plt.ylabel('probability of defect')
plt.legend()
plt.show()

Logistic Regression is useful when:

  • The target is categorical.
  • You need a strong simple baseline.
  • You want probability-like outputs.
  • Interpretability matters.
  • Features are already meaningful and well prepared.

Logistic Regression has limitations:

  • It draws mostly linear decision boundaries unless features are transformed.
  • It may struggle with complex image, text, or sensor patterns.
  • It can be sensitive to feature scaling.
  • It can perform poorly if classes are not separable using the available features.

Logistic Regression is a common starting point for binary classification.

Decision Trees

A Decision Tree makes predictions by asking a sequence of yes-or-no questions.

For example:

text
Is mean_pixel < 70?
    yes -> Is defect_area > 12?
        yes -> predict 'defective'
        no  -> predict 'not_defective'
    no  -> predict 'not_defective'

Decision Trees can be used for classification or regression.

Example:

python
from sklearn.tree import DecisionTreeClassifier

X = df[['mean_pixel', 'std_pixel', 'defect_area']]
y = df['defect_class']

model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X, y)

predictions = model.predict(X)
Three colored classes of points with rectangular, axis-aligned decision regions filling the background.
A Decision Tree carves the feature space into axis-aligned rectangles — one region per sequence of yes/no splits.
Show the code that generated this plot
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.tree import DecisionTreeClassifier

X, y = make_blobs(n_samples=60, centers=[[-2, -2], [2, -1], [0, 2.5]],
                  cluster_std=1.1, random_state=0)
clf = DecisionTreeClassifier(max_depth=4, random_state=0).fit(X, y)

xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 300),
                     np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 300))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.7)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='white', s=40)
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.title('Decision Tree — axis-aligned regions')
plt.show()

Decision Trees are useful when:

  • You want a model that is easy to visualize.
  • The data has nonlinear rules.
  • Features include mixed scales.
  • You want to understand decision paths.

Decision Trees have limitations:

  • A deep tree can overfit the training data.
  • Small data changes can create a different tree.
  • Single trees are often less accurate than ensemble methods.
  • Very large trees can become hard to explain.

Decision Trees are helpful teaching models because they match how people often describe rules.

Random Forest

A Random Forest is an ensemble of many Decision Trees.

An ensemble combines multiple models to produce a stronger result. A Random Forest trains many trees on different samples of the data and different subsets of features. The final prediction is based on voting or averaging.

For classification:

text
tree_1 predicts 'porosity'
tree_2 predicts 'crack'
tree_3 predicts 'porosity'

forest prediction = 'porosity'

Example:

python
from sklearn.ensemble import RandomForestClassifier

X = df[['mean_pixel', 'std_pixel', 'defect_area']]
y = df['defect_class']

model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
)
model.fit(X, y)

predictions = model.predict(X)
The same three classes of points as the decision tree, but with smoother, less blocky decision regions in the background.
A Random Forest votes across many trees, producing smoother boundaries than the blocky regions of a single tree.
Show the code that generated this plot
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier

X, y = make_blobs(n_samples=60, centers=[[-2, -2], [2, -1], [0, 2.5]],
                  cluster_std=1.1, random_state=0)
clf = RandomForestClassifier(n_estimators=200, random_state=0).fit(X, y)

xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 300),
                     np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 300))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.7)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='white', s=40)
plt.title('Random Forest — smoother regions')
plt.show()

Random Forest is useful when:

  • You want strong performance with limited tuning.
  • The data has nonlinear patterns.
  • You have tabular data with many feature types.
  • You want feature importance estimates.
  • You want a model less fragile than one Decision Tree.

Random Forest has limitations:

  • It is less interpretable than a single tree.
  • Large forests can be slower and use more memory.
  • Predictions may be less smooth for regression.
  • It may not perform as well as gradient boosting on some structured datasets.

Random Forest is a strong general-purpose choice for many tabular classification and regression problems.

Support Vector Machines

Support Vector Machines, or SVMs, try to find a boundary that separates classes with the widest possible margin.

For a binary classification problem, imagine separating two groups of points with a line. The SVM tries to place the line so that it leaves as much space as possible between the classes.

SVMs can also use kernels, which allow them to create nonlinear boundaries.

Example:

python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

X = df[['mean_pixel', 'std_pixel', 'defect_area']]
y = df['is_defective']

model = make_pipeline(
    StandardScaler(),
    SVC(kernel='rbf')
)
model.fit(X, y)

predictions = model.predict(X)
Two classes of points separated by a solid boundary line with dashed margin lines on either side; the points lying on the margins are circled as support vectors.
An SVM places the boundary (solid) to maximize the margin (dashed). Only the circled support vectors on the margin determine where it goes.
Show the code that generated this plot
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.svm import SVC

X, y = make_blobs(n_samples=50, centers=[[-2, -2], [2, -1]], cluster_std=1.1, random_state=5)
clf = SVC(kernel='linear', C=1.0).fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='white', s=40)
ax = plt.gca()
xx, yy = np.meshgrid(np.linspace(*ax.get_xlim(), 200), np.linspace(*ax.get_ylim(), 200))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors='#1f2937',
           linestyles=['--', '-', '--'])
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
           s=160, facecolors='none', edgecolors='#2f7d32', linewidths=2)
plt.title('SVM — margin & support vectors')
plt.show()

SVMs are useful when:

  • The dataset is small or medium-sized.
  • The classes have a clear boundary.
  • Feature scaling is handled carefully.
  • You want a strong classifier without using a neural network.
  • Nonlinear kernels may help separate classes.

SVMs have limitations:

  • They can be slow on large datasets.
  • They are sensitive to scaling and parameter choices.
  • Results can be harder to explain than trees or linear models.
  • Probability estimates require extra calibration.

SVMs can work well, but they usually require more care with scaling and tuning than simpler baselines.

Naive Bayes

Naive Bayes is a classification family based on probability.

It estimates how likely each class is given the observed features, then predicts the most likely class.

It is called "naive" because it assumes features are conditionally independent given the class. In real datasets, this assumption is often not perfectly true, but the method can still work surprisingly well.

For example, a text classifier might look at words in an inspection note:

text
"visible crack near edge"

Words such as crack and edge may increase the probability of the crack class.

Example:

python
from sklearn.naive_bayes import GaussianNB

X = df[['mean_pixel', 'std_pixel', 'defect_area']]
y = df['defect_class']

model = GaussianNB()
model.fit(X, y)

predictions = model.predict(X)

Naive Bayes is useful when:

  • You need a fast classification baseline.
  • The dataset is small.
  • The features are counts, words, or simple measurements.
  • You want a model that trains quickly.
  • You are working with text classification.

Naive Bayes has limitations:

  • The independence assumption may be unrealistic.
  • It may underperform when feature interactions matter.
  • Probability estimates can be poorly calibrated.
  • It is usually not the best choice for complex image patterns.

Naive Bayes is especially useful as a quick baseline and for some text problems.

Neural Networks

Neural Networks are models made of connected layers of simple mathematical units.

Each layer transforms the data into a new representation. With enough data and careful training, neural networks can learn complex patterns in images, text, audio, and sensor sequences.

A simple neural network might take tabular features:

text
voltage, travel_speed, heat_input -> hidden layers -> defect probability

A convolutional neural network might take image pixels:

text
weld image -> image filters -> learned features -> defect class

Example:

python
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X = df[['mean_pixel', 'std_pixel', 'defect_area']]
y = df['defect_class']

model = make_pipeline(
    StandardScaler(),
    MLPClassifier(hidden_layer_sizes=(32, 16), random_state=42)
)
model.fit(X, y)

predictions = model.predict(X)

Neural Networks are useful when:

  • The data has complex nonlinear patterns.
  • You have many examples.
  • You are working with images, text, audio, or sequences.
  • Feature engineering by hand is difficult.
  • Predictive performance is more important than simple explanation.

Neural Networks have limitations:

  • They often require more data than simpler models.
  • They can require significant computing power.
  • They can be hard to interpret.
  • Training involves many choices, such as architecture and learning rate.
  • They can overfit if the dataset is small.

Neural Networks are powerful, but they are not automatically the best choice for every dataset.

Gradient Boosting

Gradient Boosting is an ensemble method that builds models one after another.

Each new model tries to correct mistakes made by the previous models. In many tabular datasets, gradient boosting methods are among the strongest traditional machine learning approaches.

The basic idea:

text
model_1 makes initial predictions
model_2 learns from model_1's errors
model_3 learns from remaining errors
final prediction combines all models

Example:

python
from sklearn.ensemble import GradientBoostingClassifier

X = df[['mean_pixel', 'std_pixel', 'defect_area']]
y = df['defect_class']

model = GradientBoostingClassifier(random_state=42)
model.fit(X, y)

predictions = model.predict(X)

Gradient Boosting is useful when:

  • You want strong performance on tabular data.
  • The data has nonlinear relationships.
  • You can spend time tuning parameters.
  • You want an ensemble that focuses on hard examples.
  • You need classification or regression.

Gradient Boosting has limitations:

  • It can overfit if not tuned carefully.
  • It can be slower to train than simpler models.
  • It has more parameters to understand.
  • It is usually less interpretable than a single tree.

Gradient Boosting is often worth trying after simple baselines and Random Forest.

Clustering

Clustering means grouping examples that are similar to each other.

Unlike classification, clustering usually does not use known labels. It is an unsupervised learning method.

For example, clustering might group welds based on sensor measurements:

text
cluster 0 -> stable process measurements
cluster 1 -> unusually high heat input
cluster 2 -> low brightness images

The cluster numbers do not automatically have real-world meaning. A person still needs to inspect and interpret them.

Example:

python
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X = df[['voltage', 'travel_speed', 'mean_pixel']]

model = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=3, random_state=42)
)
cluster_ids = model.fit_predict(X)

df['cluster_id'] = cluster_ids
Three colored groups of points with a large black X marking the centroid of each group.
K-Means groups unlabeled points into clusters and marks each cluster's centroid. The cluster numbers carry no meaning until a person interprets them.
Show the code that generated this plot
python
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=60, centers=[[-2, -2], [2, -1], [0, 2.5]],
                  cluster_std=1.1, random_state=0)
km = KMeans(n_clusters=3, n_init=10, random_state=0).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=km.labels_, edgecolor='white', s=40)
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
            marker='X', s=220, c='black', label='centroids')
plt.legend()
plt.title('K-Means — clusters and centroids')
plt.show()

Clustering is useful when:

  • You do not have labels.
  • You want to explore natural groups.
  • You want to find unusual examples.
  • You want to create rough segments for review.
  • You want to compare machine-discovered groups with domain knowledge.

Clustering has limitations:

  • The number of clusters may be unclear.
  • Different algorithms can produce different groups.
  • Scaling can strongly affect results.
  • Clusters may not match useful business or engineering categories.
  • A cluster label is not the same as a known defect label.

Clustering is best used as an exploration tool unless it has been carefully validated.

Comparing common algorithms

Algorithm Common problem type Strengths Weaknesses
Linear Regression Regression Simple, interpretable, good baseline Struggles with complex nonlinear patterns
Logistic Regression Classification Simple, fast, probability-like outputs Mostly linear boundaries unless features are transformed
Decision Trees Classification or regression Easy to explain, handles nonlinear rules Can overfit and be unstable
Random Forest Classification or regression Strong general-purpose tabular model Less interpretable and larger than one tree
SVMs Mostly classification Strong margins, useful on small or medium data Sensitive to scaling and tuning; slow on large data
Naive Bayes Classification Very fast, good text baseline Assumes feature independence
Neural Networks Classification or regression Learns complex patterns, strong for images and sequences Needs data, compute, and careful tuning
Gradient Boosting Classification or regression Often excellent on tabular data More tuning; can overfit
Clustering Unsupervised grouping Finds structure without labels Results can be hard to validate

This table is a starting point, not a rulebook. Always compare models using the same validation method and the metric that matches the real goal.

Choosing a reasonable first model

For many projects, start with a simple baseline before trying complex methods.

Situation Reasonable starting point
Numeric target with tabular features Linear Regression
Binary or multi-class label with tabular features Logistic Regression
Nonlinear tabular problem Decision Tree or Random Forest
Strong tabular performance needed Random Forest or Gradient Boosting
Small scaled dataset with clear class boundary SVM
Text classification baseline Naive Bayes
Image classification with enough data Neural Network
No labels and you want groups Clustering

Starting simple gives you a baseline. A complex model should earn its place by improving performance, reliability, or usefulness.

Interpreting strengths and weaknesses

When comparing algorithms, do not only ask which one has the highest score.

Also ask:

  • Can the model be explained to the people who will use it?
  • Does it require features that will be available at prediction time?
  • Does it perform well on rare but important cases?
  • Is it stable when the data changes slightly?
  • How much data does it need?
  • How much computing power does it need?
  • How difficult will it be to maintain?

For example, a neural network may produce the highest accuracy on weld images, but a Random Forest may be easier to explain for a small tabular dataset. A Logistic Regression model may be less flexible, but it may be enough if the decision boundary is simple.

A practical model-selection workflow

When choosing an algorithm family, use a repeatable workflow.

  1. Define the target. Decide whether the task is regression, classification, clustering, or something else.
  2. Inspect the data. Check class counts, missing values, outliers, feature types, and leakage risks.
  3. Choose a metric. Pick a metric that matches the practical cost of mistakes.
  4. Build a simple baseline. Use Linear Regression for numeric targets or Logistic Regression for classification when appropriate.
  5. Try one or two stronger models. For tabular data, Random Forest and Gradient Boosting are common next steps.
  6. Compare on validation data. Do not judge models only by training performance.
  7. Inspect errors. Look at examples the model gets wrong.
  8. Consider explainability and deployment. A slightly less accurate model may be better if it is easier to trust and maintain.
  9. Document the decision. Record why the selected model was chosen and which alternatives were tested.

Common mistakes

Here are a few traps to avoid:

  • Using classification for a numeric target. If the target is defect_length_mm, regression is usually more natural than turning lengths into arbitrary bins.
  • Using regression for a category. If the target is crack, porosity, or no_defect, the model should treat these as classes, not ordered numbers.
  • Starting with the most complex model. A neural network may not be needed for a small tabular dataset.
  • Ignoring class imbalance. A classifier can look accurate while missing rare defects.
  • Comparing models using different data splits. Use the same train, validation, and test split for fair comparison.
  • Trusting training performance. A model can memorize training data and fail on new examples.
  • Assuming clusters are labels. Cluster 0 is not automatically a defect type.
  • Forgetting feature scaling. Algorithms such as SVMs, Logistic Regression, and Neural Networks often need scaled numeric features.
  • Choosing only by accuracy. The best model also needs to fit the cost of mistakes, explainability needs, and deployment constraints.

Summary

Algorithm selection begins with the target. If the target is numeric, the problem is usually regression. If the target is a category, the problem is usually classification. If there are no labels and the goal is to find structure, unsupervised learning methods such as clustering may be appropriate.

Supervised learning uses examples with known answers. Unsupervised learning finds structure without known answers. Self-supervised learning creates training tasks from the data itself, often to learn useful representations. Reinforcement learning learns actions from rewards and is most useful for sequential decision-making and control problems.

Common algorithm families include Linear Regression, Logistic Regression, Decision Trees, Random Forest, SVMs, Naive Bayes, Neural Networks, Gradient Boosting, and Clustering. Each has strengths and weaknesses. Good model selection means starting with the problem, building a baseline, comparing fairly, inspecting errors, and documenting the decision.

Topic Key ideas
Regression Predicts numeric values
Classification Predicts categories or classes
Supervised learning Learns from labeled examples
Unsupervised learning Finds structure without labels
Self-supervised learning Creates learning signals from the data itself
Reinforcement learning Learns actions from rewards
Linear Regression Simple numeric prediction baseline
Logistic Regression Simple classification baseline
Decision Trees Rule-like model for classification or regression
Random Forest Ensemble of trees; strong tabular model
SVMs Margin-based classifier; sensitive to scaling
Naive Bayes Fast probability-based classifier
Neural Networks Flexible models for complex patterns
Gradient Boosting Sequential ensemble; often strong on tabular data
Clustering Groups similar examples without labels
Practice Questions

Practice Questions

  1. In your own words, what is an algorithm family?
  2. What is the difference between regression and classification?
  3. If the target is weld_strength_mpa, is the problem regression or classification?
  4. If the target is defect_class, is the problem regression or classification?
  5. Why can accuracy be misleading in a rare-defect classification problem?
  6. What learning framework uses labeled examples?
  7. What learning framework finds structure without known answers?
  8. What is one reason self-supervised learning can be useful?
  9. What kind of problem is reinforcement learning designed for?
  10. What does Linear Regression predict?
  11. Why is Logistic Regression used for classification even though it includes the word regression?
  12. What is one strength and one weakness of a Decision Tree?
  13. How is a Random Forest different from a single Decision Tree?
  14. Why do SVMs often require feature scaling?
  15. What does the "naive" assumption mean in Naive Bayes?
  16. Name one situation where a Neural Network may be a good choice.
  17. Name one reason a Neural Network may not be a good first model.
  18. How does Gradient Boosting build an ensemble?
  19. What is clustering used for?
  20. Why should cluster IDs not automatically be treated as real defect labels?
  21. Choose a reasonable first model for a numeric tabular target and explain why.
  22. Choose a reasonable first model for a binary classification target and explain why.
  23. Name two questions you should ask before choosing a model metric.
  24. Why should models be compared using the same validation split?
  25. Write a short checklist you would use before selecting an algorithm family for a new welding dataset.

Try it: how the boundary changes

Interactive

Blue region = model predicts class A; coral = class B. Training points are overlaid. Compare a k-nearest neighbors boundary with a simple decision tree.

Moderate k balances smoothness with local detail.

View the Python sourcek-NN & decision tree in scikit-learn

This is the runnable code behind the visual above. Open it in Google Colab to run and edit it in your browser — no setup, but you'll need a free Google account — or copy it into your own notebook.

Loading source…

Check your understanding

Tier 1 depth · Concepts & interpretation

0 / 5 correct
  1. A model predicts weld_strength_mpa, a continuous number. What kind of problem is this?

  2. A model predicts defect_class (crack, porosity, or none). What kind of problem is this?

  3. Which learning framework trains on inputs paired with known answers?

  4. Only 2% of welds are defective. A model that always predicts 'no defect' scores 98% accuracy. Why is accuracy misleading here?

  5. You have many weld images but no labels, and you want to group visually similar ones. Which approach fits best?

Go deeper

More in Additional Resources →
View Code Examples ← Feature Engineering Introduction to Neural Networks →