Lab — Heart disease risk¶
This notebook uses heart.csv.
Models: Build and train PyTorch MLPs. scikit-learn is for preprocessing, splitting, and metrics only.
Important¶
This lab is for learning machine learning on tabular clinical-style data. It is not a medical device, not validated for real patients, and must not be used for diagnosis or treatment decisions. Real cardiac care requires qualified professionals, regulated systems, and rigorous validation.
Goals¶
- Explore the
heart.csvfeatures and theHeartDiseasetarget. - Visualize relationships between variables and the outcome.
- Preprocess mixed numeric and categorical columns for neural network training.
- Train and compare several PyTorch MLP architectures on the same split.
- Report classification metrics, ROC/PR curves, and discuss errors and ethics.
from __future__ import annotations
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.compose import ColumnTransformer
from sklearn.metrics import (
RocCurveDisplay,
PrecisionRecallDisplay,
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
average_precision_score,
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from torch.utils.data import DataLoader, TensorDataset
plt.rcParams['figure.figsize'] = (8, 4)
torch.manual_seed(42)
np.random.seed(42)
DATA_PATH = Path('heart.csv')
df = pd.read_csv(DATA_PATH)
print('Rows, columns:', df.shape)
print('Target distribution:')
print(df['HeartDisease'].value_counts())
df.head()
Task 1 — Explore the dataset¶
Answer these questions (use code in the next cell):
- What does one row represent?
- What is the target
HeartDisease, and what do its values mean? - Which columns are numeric vs categorical?
- Are there missing values?
- How balanced are the classes? Why might accuracy alone be misleading?
# TODO: Use `info`, `describe`, groupby summaries, and missing-value checks to answer Task 1.
Task 2 — Visual exploration¶
Create at least four plots that help you understand risk factors. Ideas (you may choose others):
- Age distribution by
HeartDiseaselabel (histogram or KDE). - Boxplots of
Oldpeak,MaxHR, orCholesterolby label. - A categorical feature (for example
ChestPainType) vsHeartDiseasecounts or proportions. - Correlation heatmap for numeric features plus the target.
After plotting, write a short note for each figure: what pattern you see and how strong you think it is.
# TODO: Build at least four informative visualizations for Task 2.
Task 3 — Preprocess for neural networks¶
- Split first into train / validation / test with stratification on
HeartDisease. - Fit the preprocessor only on training rows:
StandardScaleron numeric columns,OneHotEncoderon categoricals (handle_unknown='ignore', dense output so you get anumpyarray). - Convert features to
float32NumPy arrays for PyTorch. The binary target should work withBCEWithLogitsLossusing one logit (a single output neuron before sigmoid).
Columns (you can verify from the data): numeric — Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak; categorical — Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope.
TARGET = 'HeartDisease'
# TODO: Build X (features dataframe), y (float32 0/1 array), lists numeric_features / categorical_features.
# TODO: ColumnTransformer with StandardScaler + OneHotEncoder(sparse_output=False, handle_unknown='ignore').
# TODO: Stratified splits → X_train_df, X_val_df, X_test_df and matching y vectors (~64% / 16% / 20%).
# TODO: preprocessor.fit(train only); transform train/val/test → X_train, X_val, X_test as float32.
# TODO: input_dim = X_train.shape[1]; print train/val/test sizes and input_dim.
Task 4 — PyTorch MLPs: three architectures¶
Train three classifiers and compare them:
| Model | Description |
|---|---|
| MLP_Small | One hidden layer (32 units). Strong baseline for small tabular data. |
| MLP_Deep | Three hidden layers (64→48→24) with dropout between hidden layers. |
| MLP_Wide | Two wide layers (128→128) with dropout. |
Use ReLU hidden activations and nn.BCEWithLogitsLoss. Optimizer Adam, batch size 32, and early stopping on validation loss (stop if validation loss does not improve for many epochs — try ~25).
Implement:
- A small helper to build
DataLoaders from NumPy arrays (TensorDataset). - An
MLPmodule: stackLinear → ReLU(andDropoutwhere you use it), then a finalLinearto one output. - A training function that records train/validation loss per epoch and restores the best validation weights.
if torch.cuda.is_available():
device = torch.device('cuda')
elif torch.backends.mps.is_available():
device = torch.device('mps')
else:
device = torch.device('cpu')
def make_loaders(X_tr, y_tr, X_va, y_va, batch_size: int = 32):
tr_ds = TensorDataset(torch.from_numpy(X_tr), torch.from_numpy(y_tr))
va_ds = TensorDataset(torch.from_numpy(X_va), torch.from_numpy(y_va))
return (
DataLoader(tr_ds, batch_size=batch_size, shuffle=True),
DataLoader(va_ds, batch_size=batch_size, shuffle=False),
)
# TODO: class MLP(nn.Module): ...
# TODO: def train_mlp(...) -> trained model + history dict with 'train_loss' and 'val_loss' lists
# TODO: Instantiate MLP_Small, MLP_Deep, MLP_Wide; train each; store in `trained` and `histories`
trained: dict[str, nn.Module] = {}
histories: dict[str, dict] = {}
Task 5 — Compare validation learning curves¶
Plot validation loss vs epoch for each model. If training loss keeps falling but validation loss rises, what does that suggest?
# TODO: One figure: validation loss curves for all models in `histories`.
Task 6 — Test-set metrics and ROC / precision–recall¶
On the held-out test set, report accuracy, ROC-AUC, average precision (PR-AUC), and show confusion matrices. Use classification_report and comment on recall for the positive class (disease present) — why might it matter in screening?
# TODO: Implement predict_proba(model, X_numpy) using torch.no_grad(), logits, torch.sigmoid.
# TODO: For each model in `trained`, collect test probabilities, threshold at 0.5 for confusion matrix / report.
# TODO: Build a small metrics table (accuracy, roc_auc, pr_auc, parameter count).
metrics_df = None # replace with your summary DataFrame
test_probs: dict[str, np.ndarray] = {}
# TODO: Side-by-side ROC and precision–recall curves for all models (test set). Use RocCurveDisplay / PrecisionRecallDisplay.
# TODO: Plot confusion matrices (one subplot per model) at threshold 0.5.
Task 7 — Predicted probabilities¶
For your best model by ROC-AUC (or another rule you justify), plot histograms of predicted probabilities for true negatives vs true positives on the test set. Do the distributions separate?
# TODO: Pick best model (e.g. metrics_df['roc_auc'].idxmax()), plot overlapping histograms of predicted probabilities by true label.
Task 8 — Interpretation, limitations, and ethics¶
Write short paragraphs (bullet points are fine) on:
- Which features showed the clearest signal in your EDA?
- Which architecture performed best on your metrics, and might it be overfitting?
- Cost of errors: Compare false negatives vs false positives in a screening-style setting.
- Data and fairness: Who is represented in this table? What could go wrong if a model like this were deployed without rigorous validation?
- What would you do next to make this workflow more trustworthy (data, evaluation, or process)?