What you'll be able to do

  • Create new features from raw columns (ratios, aggregates, date parts, derived measures)
  • Encode categorical variables and scale numeric features for modeling
  • Select informative features and avoid data leakage
Competencies you'll build
  • Apply one-hot/ordinal encoding and standard/min-max scaling
  • Build features with pandas transformations
  • Use a train/validation split correctly to prevent leakage

Key terms in this chapter

Chapter - Feature Engineering

Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program and is not derived from Automate the Boring Stuff with Python; it is written in a similar tone for continuity with the other chapters.

About this chapter

So far, you have learned how to write Python scripts, load files, organize data in tables, and summarize datasets. Those skills help you answer an important question before modeling: What is in my data?

This chapter focuses on the next question: How should the data be prepared so a model can use it?

Feature engineering is the process of creating, transforming, selecting, and organizing input data for analysis or machine learning. A feature may be a column in a spreadsheet, a measurement from a sensor, a pixel statistic from an image, or a value calculated from other fields.

The examples in this chapter use manufacturing, inspection, and welding language, but the ideas apply to many types of data.

What is a feature?

A feature is an input value used by a model.

For example, imagine a table where each row describes one weld inspection:

weld_id material weld_length_mm operator_birthdate defect_class
W001 steel 42.5 1988-04-12 no_defect
W002 alloy 38.0 1976-09-03 porosity
W003 steel 51.2 1992-01-25 crack

Possible features include:

  • material
  • weld_length_mm
  • operator_birthdate
  • A new column calculated from operator_birthdate, such as operator_age

The label, or target, is usually the thing you want the model to predict. In this example, defect_class is the label.

Why feature engineering matters

Models learn from the information you give them. If important information is missing, poorly formatted, or hidden inside another field, the model may struggle.

Feature engineering helps you:

  • Turn raw fields into useful measurements.
  • Convert text categories into numbers.
  • Scale numeric values so one large column does not dominate another.
  • Put data into the shape expected by a model.
  • Reduce noise by dropping weak or duplicate features.
  • Handle imbalanced classes before training.
  • Explain clearly why certain inputs were used.

Feature engineering does not replace domain knowledge. A welding engineer, inspector, or process owner may know which measurements are meaningful and which are artifacts of the data collection process.

Categorical and continuous variables

Most features can be grouped into two broad types:

Variable type Meaning Example
Categorical Values represent groups, names, labels, or categories Material type, shift, defect class
Continuous Values are numeric measurements that can vary over a range Temperature, weld length, voltage

This distinction matters because the two types are usually prepared differently.

Categorical values often need to be encoded. Continuous values often need to be scaled, transformed, or checked for outliers.

Types of categorical data

Categorical variables describe group membership. There are several common types.

Type Meaning Example
Nominal Categories have no natural order steel, aluminum, titanium
Ordinal Categories have a meaningful order low, medium, high
Binary Only two categories pass, fail
Multi-label One row may belong to more than one category porosity and undercut in the same image

Nominal categories

Nominal categories are names or groups without a built-in ranking.

For example:

python
materials = ['steel', 'aluminum', 'titanium']

It would be misleading to say steel = 1, aluminum = 2, and titanium = 3 if the numbers imply an order. A model might think titanium is "greater than" steel in a mathematical way, even though these are just names.

Ordinal categories

Ordinal categories have a meaningful order.

For example:

python
risk_level = ['low', 'medium', 'high']

In this case, mapping values to numbers may make sense:

python
risk_map = {
    'low': 0,
    'medium': 1,
    'high': 2,
}

The important question is whether the order is real and useful. If the categories are just names, do not treat them like ordered values.

Binary categories

Binary categories have two values:

python
inspection_result = ['pass', 'fail']

These can often be represented with 0 and 1:

python
result_map = {
    'pass': 0,
    'fail': 1,
}

Always document what 0 and 1 mean. Otherwise, another person may not know which value represents the positive class.

Multi-label categories

Sometimes one row can have more than one label.

For example, a weld image might contain both porosity and undercut.

python
labels = ['porosity', 'undercut']

This is different from ordinary multi-class classification, where each row belongs to exactly one class. Multi-label problems usually require a separate yes-or-no column for each possible label.

Types of continuous variables

Continuous variables are numeric measurements. They can take many possible values across a range.

Type Meaning Example
Interval Differences are meaningful, but zero does not mean absence Temperature in Celsius
Ratio Differences and ratios are meaningful; zero means none Length, area, mass, time
Count Whole-number quantity Number of defects in an image
Time-based Dates, timestamps, durations, or repeated readings over time Inspection date, sensor timestamp

Interval variables

Temperature in Celsius is an interval variable. The difference between 20 and 30 degrees is meaningful, but 0 degrees Celsius does not mean "no temperature."

Ratio variables

Weld length is a ratio variable. A weld that is 40 mm long is twice as long as a weld that is 20 mm long. A value of 0 mm means no length.

Counts

Counts are whole numbers:

python
number_of_indications = [0, 1, 2, 3, 4]

Counts are numeric, but they often have skew. Many rows may have 0, while a small number of rows may have large counts.

Time-based variables

Dates and timestamps are common in real datasets:

python
inspection_date = '2026-04-26'

A model usually cannot use a raw date string directly. You may need to extract useful pieces, such as year, month, day of week, shift, or age at inspection time.

Creating new features from existing fields

Raw fields are not always the best model inputs. Sometimes the useful information is hidden inside a field.

For example, operator_birthdate may be less useful than operator_age.

python
import pandas as pd

df = pd.DataFrame({
    'operator_birthdate': ['1988-04-12', '1976-09-03', '1992-01-25'],
    'inspection_date': ['2026-04-26', '2026-04-26', '2026-04-26'],
})

df['operator_birthdate'] = pd.to_datetime(df['operator_birthdate'])
df['inspection_date'] = pd.to_datetime(df['inspection_date'])

age_days = df['inspection_date'] - df['operator_birthdate']
df['operator_age'] = (age_days.dt.days / 365.25).astype(int)

print(df[['operator_birthdate', 'operator_age']])

Output:

text
  operator_birthdate  operator_age
0         1988-04-12            38
1         1976-09-03            49
2         1992-01-25            34

The new operator_age feature is easier for many models to use than the original birthdate.

More examples of derived features

Feature engineering often means turning existing data into more direct measurements.

Existing fields New feature Why it may help
birthdate, inspection_date age_at_inspection Converts date strings into a useful numeric value
start_time, end_time inspection_duration_seconds Captures how long a process took
defect_width, defect_height defect_area Combines two dimensions into size
max_temperature, min_temperature temperature_range Captures process variation
image_pixels mean_pixel, std_pixel Summarizes image brightness and contrast
x_position, y_position distance_from_center Captures location in a part or image

Here is a small example:

python
df = pd.DataFrame({
    'defect_width_mm': [2.0, 5.5, 1.2],
    'defect_height_mm': [1.5, 2.0, 0.8],
    'max_temperature_c': [925, 940, 910],
    'min_temperature_c': [870, 880, 865],
})

df['defect_area_mm2'] = df['defect_width_mm'] * df['defect_height_mm']
df['temperature_range_c'] = df['max_temperature_c'] - df['min_temperature_c']

print(df)

Feature engineering is useful when the new feature has a reason to exist. Do not create hundreds of random combinations just because you can.

Avoiding data leakage

Data leakage happens when a feature gives the model information that would not be available at prediction time.

Train (≈60%) Validation Test fit scaler / encoder apply the same fitted transform (do not re-fit)
Split first, then fit every preprocessing step (scalers, encoders, imputers) on the training set only and apply those fitted transforms to validation and test. Fitting on all the data leaks information from the test set into training.

For example, suppose you are building a model to predict whether a weld will fail inspection before final review. These columns would be suspicious:

  • final_inspection_result
  • repair_required
  • failure_code
  • disposition_notes

Those fields may be created after the inspection outcome is already known. If you use them as inputs, the model may look excellent during testing but fail in real use.

Before using a feature, ask:

  1. Would this value be available when the model makes its prediction?
  2. Was this value created before or after the label?
  3. Is this feature just another version of the answer?

One-hot encoding categorical data

Many models require numeric input. A category like material = 'steel' must be converted into numbers.

One common method is one-hot encoding. It creates one column for each category.

material steel aluminum copper one-hot is_steel is_aluminum is_copper 1 0 0 0 1 0 0 0 1
One-hot encoding replaces one categorical column with one binary column per category. Each row gets a 1 in the column for its value and 0 everywhere else — no false ordering is implied between categories.

Original data:

weld_id material
W001 steel
W002 alloy
W003 steel
W004 titanium

One-hot encoded data:

weld_id material_alloy material_steel material_titanium
W001 0 1 0
W002 1 0 0
W003 0 1 0
W004 0 0 1

In pandas, use get_dummies():

python
import pandas as pd

df = pd.DataFrame({
    'weld_id': ['W001', 'W002', 'W003', 'W004'],
    'material': ['steel', 'alloy', 'steel', 'titanium'],
})

encoded_df = pd.get_dummies(df, columns=['material'], dtype=int)

print(encoded_df)

Output:

text
  weld_id  material_alloy  material_steel  material_titanium
0    W001               0               1                  0
1    W002               1               0                  0
2    W003               0               1                  0
3    W004               0               0                  1
A grid with one row per weld and one column per material (alloy, steel, titanium); each row has a single filled cell marked 1 for its material and 0 elsewhere.
One-hot encoding replaces the material column with one binary column per category — exactly one 1 per row.

One-hot encoding works well for categories with a small number of possible values. It can become difficult when a column has thousands of unique values, such as part serial numbers or free-text operator notes.

Scaling numeric values

Numeric features can have very different ranges:

Feature Example range
weld_length_mm 10 to 100
temperature_c 800 to 1200
defect_area_pixels 0 to 200000

Some models are sensitive to these ranges. A large-valued feature may dominate a smaller-valued feature even if it is not more important.

Scaling changes numeric values to a more consistent range.

Min-max scaling

Min-max scaling moves values into a fixed range, often 0 to 1.

python
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

df = pd.DataFrame({
    'weld_length_mm': [25, 50, 75, 100],
})

scaler = MinMaxScaler()
df['weld_length_scaled'] = scaler.fit_transform(df[['weld_length_mm']])

print(df)

Output:

text
   weld_length_mm  weld_length_scaled
0              25            0.000000
1              50            0.333333
2              75            0.666667
3             100            1.000000

Standard scaling

Standard scaling changes values so they have a mean near 0 and a standard deviation near 1.

python
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    'temperature_c': [870, 885, 900, 930, 950],
})

scaler = StandardScaler()
df['temperature_scaled'] = scaler.fit_transform(df[['temperature_c']])

print(df)

This is often useful for models such as logistic regression, support vector machines, k-nearest neighbors, and neural networks.

Robust scaling

Robust scaling uses the median and quartiles, so it is less affected by outliers.

python
from sklearn.preprocessing import RobustScaler

df = pd.DataFrame({
    'defect_area_pixels': [12, 15, 18, 20, 5000],
})

scaler = RobustScaler()
df['defect_area_scaled'] = scaler.fit_transform(df[['defect_area_pixels']])

print(df)

Robust scaling can be helpful when a small number of extreme values are real and should not simply be deleted.

Plotting the same feature (with a few extreme outliers) under each scaler shows how they differ — only the shape and axis change, not the relative spacing:

Four histograms of the same feature: raw values, min-max scaled to 0-1, standard scaled to mean 0, and robust scaled using the median and IQR.
The same feature under three scalers. Min-max squeezes into [0, 1], standard centers on 0, and robust resists the outliers.
Show the code that generated this plot
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

rng = np.random.default_rng(0)
# A feature with a few extreme outliers (e.g. defect area in pixels)
values = np.concatenate([rng.normal(50, 10, 200), rng.normal(250, 8, 8)]).reshape(-1, 1)

panels = [
    ('Raw', values.ravel()),
    ('Min-max [0, 1]', MinMaxScaler().fit_transform(values).ravel()),
    ('Standard (mean 0, std 1)', StandardScaler().fit_transform(values).ravel()),
    ('Robust (median, IQR)', RobustScaler().fit_transform(values).ravel()),
]
fig, axes = plt.subplots(2, 2, figsize=(10, 6))
for ax, (title, data) in zip(axes.ravel(), panels):
    ax.hist(data, bins=30, color='#2457c5')
    ax.set_title(title)
plt.tight_layout()
plt.show()

Fit scalers only on training data

When building a model, split your data first. Then fit the scaler only on the training set.

python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Notice the difference:

  • fit_transform() is used on training data.
  • transform() is used on test data.

This prevents information from the test set from leaking into the training process.

Transforming size, shape, and label type for modeling

Machine learning libraries expect data in specific formats. A common source of errors is giving the model the right data in the wrong shape.

Tabular model input

For many scikit-learn models:

  • X should be a 2D table with shape (number_of_rows, number_of_features).
  • y should be a 1D array with shape (number_of_rows,).
python
import numpy as np

X = np.array([
    [42.5, 880],
    [38.0, 910],
    [51.2, 895],
])

y = np.array(['no_defect', 'porosity', 'crack'])

print(X.shape)
print(y.shape)

Output:

text
(3, 2)
(3,)

The first dimension must match. If X has 3 rows, y must have 3 labels.

Reshaping one feature

If you have one numeric feature, it may start as a 1D array:

python
weld_lengths = np.array([42.5, 38.0, 51.2])
print(weld_lengths.shape)

Output:

text
(3,)

Many scikit-learn tools expect a 2D input:

python
X = weld_lengths.reshape(-1, 1)
print(X.shape)

Output:

text
(3, 1)

The -1 tells NumPy to infer the number of rows.

Image model input

Image models usually expect arrays with a specific height, width, and number of channels.

For example:

python
from PIL import Image
import numpy as np

image = Image.open('weld_xray.png').convert('L')
image = image.resize((224, 224))

pixels = np.array(image)
pixels = pixels / 255.0

print(pixels.shape)

Output:

text
(224, 224)

Some neural networks expect a channel dimension:

python
pixels = pixels.reshape(224, 224, 1)
print(pixels.shape)

Output:

text
(224, 224, 1)

If you have many images, the batch may have shape:

text
(number_of_images, height, width, channels)

For example:

text
(1000, 224, 224, 1)

Label types

Labels also need the correct format.

For binary classification, labels are often represented as 0 and 1:

python
y = np.array([0, 1, 0, 1])

For multi-class classification, labels may be integer class IDs:

python
y = np.array([0, 2, 1, 0])

Or one-hot encoded labels:

python
from sklearn.preprocessing import LabelBinarizer

labels = ['no_defect', 'porosity', 'crack', 'no_defect']

encoder = LabelBinarizer()
y_one_hot = encoder.fit_transform(labels)

print(y_one_hot)

The correct label format depends on the model and loss function. Always check the documentation for the library you are using.

Rebalancing classes

A dataset is imbalanced when one class has many more examples than another.

For example:

defect_class count
no_defect 950
porosity 35
crack 15

A model trained on this dataset may learn to predict no_defect most of the time. Accuracy may look high, but the model may miss rare defects.

Checking class balance

python
class_counts = df['defect_class'].value_counts()
class_percentages = df['defect_class'].value_counts(normalize=True) * 100

print(class_counts)
print(class_percentages)

Class balance should be checked before training and again after splitting into train and test sets.

Strategy 1: Collect more data

The best solution is often to collect more examples of the rare classes.

Pros:

  • Adds real information.
  • Usually improves model reliability.
  • Helps rare classes represent real variation.

Cons:

  • May be expensive or slow.
  • Rare defects may not occur often.
  • Requires labeling effort.

Strategy 2: Oversampling

Oversampling increases the number of minority-class examples in the training set. The simplest version duplicates existing rows.

python
from sklearn.utils import resample

majority = df[df['defect_class'] == 'no_defect']
minority = df[df['defect_class'] == 'crack']

minority_oversampled = resample(
    minority,
    replace=True,
    n_samples=len(majority),
    random_state=42,
)

balanced_df = pd.concat([majority, minority_oversampled])

Pros:

  • Simple to understand.
  • Keeps all majority-class examples.
  • Useful when the minority class is small.

Cons:

  • Duplicates can cause overfitting.
  • Does not add truly new information.
  • Training may take longer because the dataset grows.

Strategy 3: Undersampling

Undersampling reduces the number of majority-class examples.

python
majority_undersampled = resample(
    majority,
    replace=False,
    n_samples=len(minority),
    random_state=42,
)

balanced_df = pd.concat([majority_undersampled, minority])

Pros:

  • Simple and fast.
  • Reduces training time.
  • Can help when the majority class is very large.

Cons:

  • Throws away data.
  • May remove useful variation from the majority class.
  • Can make the model less reliable if too much data is removed.

Oversampling and undersampling both end at balanced class counts, but they get there differently — one grows the minority class, the other shrinks the majority:

Three bar charts of class counts: the original is dominated by no_defect (950); after oversampling all three classes are equal and large; after undersampling all three are equal and small.
The same classes before and after balancing. Oversampling levels up the rare classes; undersampling levels down the common one.
Show the code that generated this plot
python
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import resample

df = pd.DataFrame({'defect_class':
                   ['no_defect'] * 950 + ['porosity'] * 35 + ['crack'] * 15})
majority = df[df['defect_class'] == 'no_defect']
crack = df[df['defect_class'] == 'crack']

over = pd.concat([df, resample(crack, replace=True,
                  n_samples=len(majority) - len(crack), random_state=42)])
under = pd.concat([resample(majority, replace=False, n_samples=15, random_state=42),
                   df[df['defect_class'] != 'no_defect']])

order = ['no_defect', 'porosity', 'crack']
fig, axes = plt.subplots(1, 3, figsize=(12, 3.8))
for ax, (title, d) in zip(axes, [('Original (imbalanced)', df),
                                 ('After oversampling', over),
                                 ('After undersampling', under)]):
    counts = d['defect_class'].value_counts()
    ax.bar(order, [counts.get(c, 0) for c in order], color='#2457c5')
    ax.set_title(title)
plt.tight_layout()
plt.show()

Strategy 4: Synthetic examples

Some methods create synthetic minority-class examples. One common tabular method is SMOTE, which creates new points between existing minority examples.

python
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Pros:

  • Adds more minority-class training examples.
  • Can reduce overfitting compared with simple duplication.
  • Useful for some tabular datasets.

Cons:

  • Synthetic examples may not be physically realistic.
  • Can create confusing samples near class boundaries.
  • Requires careful validation.

For image datasets, augmentation is a common way to create variation:

  • Rotate slightly.
  • Crop or shift.
  • Adjust brightness or contrast.
  • Flip only when the flip makes physical sense.

Do not use an augmentation that changes the meaning of the label.

Strategy 5: Class weights

Some models allow you to give rare classes more weight during training.

python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

Pros:

  • Does not duplicate or delete examples.
  • Easy to apply in many libraries.
  • Keeps the original training dataset size.

Cons:

  • May not be enough for severe imbalance.
  • Can increase false positives.
  • Still requires good examples of each class.

Strategy 6: Threshold adjustment

For binary classification, a model may output a probability. The default decision threshold is often 0.5, but that may not be best.

python
probabilities = model.predict_proba(X_test)[:, 1]
predictions = probabilities >= 0.30

Lowering the threshold may catch more defects, but it may also create more false alarms.

Pros:

  • Useful when the cost of missing a defect is high.
  • Does not require retraining the model.
  • Lets stakeholders choose a tradeoff.

Cons:

  • More false positives may burden inspectors.
  • The right threshold depends on business risk.
  • Must be evaluated with precision, recall, and confusion matrices.

Rebalancing only the training data

Rebalancing should usually be applied only to the training set.

Do not oversample or undersample the test set. The test set should represent real-world data as honestly as possible.

Identifying important features

Not every feature helps. Some features are noisy, redundant, misleading, or unavailable in real use.

Feature importance can come from several sources:

  • Domain knowledge.
  • Correlation with the label.
  • Model-based importance.
  • Permutation importance.
  • Error analysis.
  • Stakeholder review.

Start with domain knowledge

A domain expert may know that some values are physically meaningful:

  • Heat input
  • Travel speed
  • Material type
  • Inspection angle
  • Exposure settings
  • Defect location

They may also know that some values are not meaningful:

  • Random file order
  • Internal database row ID
  • Operator notes created after final disposition
  • A serial number that leaks production batch information

Good feature engineering combines data evidence with subject matter expertise.

Correlation and simple checks

For numeric features, correlation can reveal relationships.

python
numeric_columns = ['weld_length_mm', 'temperature_c', 'defect_area_mm2']

print(df[numeric_columns].corr())

Correlation is not proof that a feature is useful, but it can identify duplicate or highly related columns.

Model-based feature importance

Some models can estimate feature importance.

python
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': model.feature_importances_,
}).sort_values('importance', ascending=False)

print(importance_df)

A horizontal bar chart makes the ranking easy to read at a glance:

A horizontal bar chart ranking six features by Random Forest importance, with a few features clearly more important than the rest.
Random Forest feature importances, sorted. A short bar means the model barely relied on that feature.
Show the code that generated this plot
python
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

feature_names = ['mean_pixel', 'std_pixel', 'defect_area',
                 'weld_length', 'temperature', 'operator_id']
X, y = make_classification(n_samples=400, n_features=6, n_informative=3,
                           n_redundant=1, random_state=42)

model = RandomForestClassifier(random_state=42).fit(X, y)
imp = (pd.DataFrame({'feature': feature_names, 'importance': model.feature_importances_})
       .sort_values('importance'))

plt.barh(imp['feature'], imp['importance'], color='#2457c5')
plt.xlabel('importance')
plt.title('Random Forest feature importances')
plt.show()

Feature importance should be treated as evidence, not final truth. Some models give high importance to features with many possible values. Correlated features can also split importance between them.

Permutation importance

Permutation importance asks: How much worse does the model get if this feature is randomly shuffled?

python
from sklearn.inspection import permutation_importance

result = permutation_importance(
    model,
    X_test,
    y_test,
    n_repeats=10,
    random_state=42,
)

importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': result.importances_mean,
}).sort_values('importance', ascending=False)

print(importance_df)

If shuffling a feature hurts performance, the model was relying on that feature.

Dropping unimportant features

Dropping features can make a model simpler, faster, and easier to explain.

Reasons to drop a feature include:

  • It is not available at prediction time.
  • It duplicates another feature.
  • It contains many missing values.
  • It has the same value for almost every row.
  • It appears unrelated to the label.
  • It creates privacy, security, or compliance concerns.
  • It makes the model less stable on new data.

For example:

python
columns_to_drop = [
    'database_row_id',
    'final_disposition_notes',
    'operator_birthdate',
]

X = df.drop(columns=columns_to_drop + ['defect_class'])
y = df['defect_class']

In this example, operator_birthdate might be dropped after creating operator_age. The original date may be more sensitive and less directly useful than the derived age feature.

The iterative nature of feature selection

Feature selection is rarely a one-time step. It is usually iterative.

A practical workflow:

  1. Start with a reasonable set of features.
  2. Train a baseline model.
  3. Measure performance with appropriate metrics.
  4. Inspect feature importance and model errors.
  5. Remove weak, duplicate, leaking, or impractical features.
  6. Add better derived features based on what you learned.
  7. Retrain and compare against the baseline.
  8. Repeat until changes stop helping or the model is simple enough.

Feature selection is not just about getting the highest score. It is also about building a model that can be trusted, maintained, and explained.

Communicating feature selection to stakeholders

Stakeholders may not care about every technical detail, but they do need to understand why certain inputs are used.

When explaining feature selection, focus on plain language:

  • What problem are we trying to solve?
  • Which inputs are used?
  • Why are those inputs available and reliable?
  • Which inputs were removed?
  • Why were they removed?
  • What tradeoffs did we accept?
  • How did performance change?

Instead of saying:

We dropped collinear features after permutation importance showed low marginal utility.

You might say:

We removed duplicate measurements that were telling the model the same thing. The model became simpler, and test performance stayed about the same.

Defending feature selection decisions

To defend feature selection clearly, keep evidence.

Useful evidence includes:

  • A list of selected features.
  • A list of removed features.
  • The reason each feature was removed.
  • Before-and-after model performance.
  • Feature importance plots or tables.
  • Notes from domain experts.
  • Known limitations.

For example:

Feature Decision Reason
weld_length_mm Keep Physically meaningful and available before inspection
database_row_id Drop Identifier with no process meaning
operator_birthdate Drop Replaced with operator_age; reduces sensitivity
final_disposition_notes Drop Created after the outcome; leakage risk

Good communication builds confidence. It also makes the project easier to audit later.

The curse of dimensionality

Dimensionality means the number of features.

If a dataset has 5 input columns, it has 5 dimensions. If it has 10,000 input columns, it has 10,000 dimensions.

The curse of dimensionality describes problems that appear when data has too many dimensions compared with the number of examples.

As dimensions increase:

  • Data becomes sparse.
  • Distances between points become less meaningful.
  • Models may overfit.
  • Training may become slower.
  • More examples are needed to cover the feature space.
  • Visualization becomes difficult.

For example, one-hot encoding a high-cardinality column can create thousands of columns. Image data can also have many dimensions. A 224 x 224 grayscale image has 50,176 pixel values before any additional features are added.

High dimensionality is not always bad. Sometimes many features are necessary. The risk is having many weak or noisy features without enough examples to support them.

Dimensionality reduction

Dimensionality reduction means representing data with fewer features while preserving useful information.

Dimensionality reduction can help with:

  • Visualization.
  • Noise reduction.
  • Faster training.
  • Compressing features.
  • Finding structure in high-dimensional data.

It can also hide information, so it should be used thoughtfully.

PCA

Principal Component Analysis, or PCA, is a common dimensionality reduction method for numeric data.

PCA creates new features called principal components. These components are combinations of the original features. The first component captures as much variation as possible. The second captures the next most variation, and so on.

python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(X_pca.shape)
print(pca.explained_variance_ratio_)

Output:

text
(100, 2)
[0.42 0.21]

In this example, the first two principal components explain about 63% of the variance.

PCA is useful when:

  • Features are numeric.
  • Features are correlated.
  • You want a smaller set of components.
  • You need a repeatable transformation for training and future data.

PCA has limitations:

  • Components can be hard to explain.
  • It captures linear patterns better than nonlinear ones.
  • Scaling matters.
  • A component is not the same as an original physical measurement.

t-SNE

t-SNE stands for t-distributed Stochastic Neighbor Embedding. It is often used to visualize high-dimensional data in two dimensions.

python
from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
)

X_tsne = tsne.fit_transform(X_scaled)
print(X_tsne.shape)

t-SNE is useful for exploring whether examples form clusters. For example, weld images from different defect classes may group together.

t-SNE has limitations:

  • It is mainly a visualization tool.
  • Distances between far-apart clusters can be misleading.
  • Results depend on parameters such as perplexity.
  • It can be slow on large datasets.
  • It does not naturally provide a simple transformation for future data in the same way PCA does.

Use t-SNE for exploration, not as the only proof that classes are separable.

Run on the same 10-dimensional data, PCA and t-SNE both reveal the three classes in 2D, but with different geometry — PCA preserves global structure, t-SNE emphasizes local clusters:

Two scatter plots of the same 10-dimensional, three-class dataset reduced to 2D: PCA on the left and t-SNE on the right, each showing three colored groups.
The same high-dimensional data projected to 2D by PCA (linear) and t-SNE (non-linear), colored by class. UMAP gives a similar cluster view.
Show the code that generated this plot
python
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, n_features=10, centers=3,
                  cluster_std=2.5, random_state=42)
X_scaled = StandardScaler().fit_transform(X)
X_pca = PCA(n_components=2).fit_transform(X_scaled)
X_tsne = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X_scaled)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4.6))
ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=y, edgecolor='white', s=25)
ax1.set_title('PCA (linear)')
ax2.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, edgecolor='white', s=25)
ax2.set_title('t-SNE (non-linear)')
plt.show()

UMAP

UMAP stands for Uniform Manifold Approximation and Projection. Like t-SNE, it is often used for visualization, but it can also be useful for feature compression in some workflows.

python
import umap

reducer = umap.UMAP(
    n_components=2,
    random_state=42,
)

X_umap = reducer.fit_transform(X_scaled)
print(X_umap.shape)

UMAP is useful when:

  • You want to visualize high-dimensional data.
  • You want a method that is often faster than t-SNE.
  • You want to preserve some local structure in the data.
  • You want to transform future data after fitting the reducer.

UMAP has limitations:

  • Results depend on parameters such as n_neighbors and min_dist.
  • Clusters in a plot can look more certain than they really are.
  • It may require an extra package.
  • It should still be validated with real model performance and domain review.

Comparing PCA, t-SNE, and UMAP

Method Common use Strength Caution
PCA Compression and visualization Fast, repeatable, good for linear structure Components may be hard to explain
t-SNE Visualization Good at showing local clusters Mainly exploratory; distances can mislead
UMAP Visualization and some compression Often fast; can preserve local structure Parameter choices affect the plot

Dimensionality reduction plots are helpful conversation starters. They are not final proof that a model will work.

A practical feature engineering workflow

When preparing a dataset for modeling, use a repeatable checklist.

  1. Identify the label. Know exactly what the model is trying to predict.
  2. Separate feature types. Mark categorical, continuous, date/time, image, and text fields.
  3. Check availability. Remove fields that would not be available at prediction time.
  4. Create derived features. Calculate useful values such as age, duration, area, ratios, or pixel summaries.
  5. Encode categories. Use one-hot encoding for nominal categories and ordered mappings for true ordinal categories.
  6. Scale numeric values. Use min-max, standard, or robust scaling when appropriate.
  7. Fix shape and label format. Make sure X and y match what the model expects.
  8. Split the data. Create train, validation, and test sets before fitting transformations.
  9. Rebalance training data. Use collection, oversampling, undersampling, synthetic examples, class weights, or threshold tuning when needed.
  10. Select features. Keep useful, available, explainable features and drop weak or risky ones.
  11. Evaluate and iterate. Compare each feature change against a baseline.
  12. Document decisions. Record what changed and why.

Common mistakes

Here are a few traps to avoid:

  • Encoding categories as numbers when there is no order. This can create a fake ranking.
  • Scaling before splitting the data. This can leak test-set information into training.
  • Balancing the test set. The test set should reflect the real world.
  • Keeping leakage features. A model that sees the answer in disguise will not be useful in production.
  • Creating too many weak features. More columns do not automatically mean more information.
  • Dropping features only because one model ranked them low. Check stability, domain meaning, and error patterns.
  • Trusting a t-SNE or UMAP plot too much. A nice-looking cluster plot is not the same as a validated model.
  • Forgetting to document feature decisions. Future users need to understand why the model uses certain inputs.

Summary

Feature engineering prepares raw data for modeling. Categorical variables describe groups and may be nominal, ordinal, binary, or multi-label. Continuous variables are numeric measurements and may include interval values, ratio values, counts, and time-based fields.

Useful features can be created from existing fields, such as calculating age from birthdate, duration from timestamps, area from width and height, or pixel summaries from images. Categorical data can be converted to one-hot encoded columns. Numeric data can be scaled with methods such as min-max scaling, standard scaling, and robust scaling.

Models also require the correct size, shape, and label type. Rebalancing strategies such as collecting more data, oversampling, undersampling, synthetic examples, class weights, and threshold adjustment can help with imbalanced classes, but each has tradeoffs.

Feature selection is iterative. Important features should be identified using domain knowledge, model evidence, error analysis, and stakeholder review. Unimportant or risky features should be dropped when doing so improves simplicity, reliability, or explainability.

High-dimensional data can create problems known as the curse of dimensionality. PCA, t-SNE, and UMAP can reduce dimensionality or help visualize structure, but their results should be interpreted carefully.

Topic Key ideas
Categorical variables Represent groups; include nominal, ordinal, binary, and multi-label data
Continuous variables Numeric measurements; include interval, ratio, count, and time-based data
Derived features Create useful inputs such as age, duration, area, ratio, and pixel summaries
One-hot encoding Converts nominal categories into numeric columns
Scaling Puts numeric values into comparable ranges
Shape and label type Models expect specific input shapes and target formats
Class rebalancing Handles uneven class counts using several strategies with tradeoffs
Feature importance Helps identify useful, weak, duplicate, or risky inputs
Feature selection Iterative process of keeping useful features and dropping unhelpful ones
Stakeholder communication Explains selected and removed features in plain language
Curse of dimensionality Too many dimensions can cause sparsity, overfitting, and slow training
PCA Reduces numeric features into principal components
t-SNE Exploratory visualization of high-dimensional data
UMAP Visualization and possible compression of high-dimensional data
Practice Questions

Practice Questions

  1. In your own words, what is feature engineering?
  2. What is the difference between a feature and a label?
  3. What is the difference between categorical and continuous variables?
  4. Give one example of a nominal categorical variable.
  5. Give one example of an ordinal categorical variable.
  6. Why can it be misleading to encode nominal categories as 1, 2, and 3?
  7. What is a binary categorical variable?
  8. What is a multi-label problem?
  9. Give one example of a ratio variable.
  10. Why might a raw birthdate be converted into age?
  11. Write pandas code that calculates defect_area from defect_width and defect_height.
  12. What does one-hot encoding do?
  13. Use pandas to one-hot encode a column named material.
  14. Why do some models need numeric features to be scaled?
  15. What is the difference between min-max scaling and standard scaling?
  16. Why should a scaler be fit only on the training data?
  17. For scikit-learn, what shape should X usually have?
  18. If a one-feature NumPy array has shape (100,), how can you reshape it to (100, 1)?
  19. Why should class imbalance be checked before training?
  20. Name three strategies for handling imbalanced classes.
  21. What is one advantage and one disadvantage of oversampling?
  22. What is one advantage and one disadvantage of undersampling?
  23. Why should rebalancing usually be applied only to the training set?
  24. Name two reasons to drop a feature.
  25. What is data leakage?
  26. Why is feature selection an iterative process?
  27. How would you explain to a non-technical stakeholder why a feature was removed?
  28. What is the curse of dimensionality?
  29. What does PCA do?
  30. Why should t-SNE and UMAP plots be interpreted carefully?

Try it: why scaling matters for distance

Interactive

Each dot is a weld: OK (outline) vs defect (filled). The star is a new weld. Its nearest neighbor depends on how you scale bead width and IR signal — the same issue k-NN and SVMs face.

Bead width (mm)IR signal (0–1)

Nearest neighbor: defect weld (12.8 mm, IR 0.28)

Bead width is in millimeters; IR signal is 0–1. Width differences swamp signal in raw Euclidean distance — the star matches a defect by width alone.

View the Python sourceStandardScaler in scikit-learn

This is the runnable code behind the visual above. Open it in Google Colab to run and edit it in your browser — no setup, but you'll need a free Google account — or copy it into your own notebook.

Loading source…

Check your understanding

Tier 2 depth · Applied coding

0 / 5 correct
  1. A table has a `material` column with values steel, alloy, and titanium. These categories have no natural order. How should you encode them for most models?

  2. Your data has `weld_length_mm` (range 30–60) and `voltage` (range 0–1 after a sensor scaled it). Why scale features before a distance- or gradient-based model?

  3. The dataset stores `operator_birthdate`. A better feature for predicting defects is usually:

  4. A `severity` column has values low, medium, high. What kind of categorical variable is this, and what does that allow?

  5. Which of these is a feature-engineering decision a domain expert (welding engineer) is best positioned to inform?

Go deeper

More in Additional Resources →
View Code Examples ← Dataset Statistics and Visualization Algorithm Families →