Lab — Heart disease risk¶

This notebook uses heart.csv.

Models: Build and train PyTorch MLPs. scikit-learn is for preprocessing, splitting, and metrics only.

Important¶

This lab is for learning machine learning on tabular clinical-style data. It is not a medical device, not validated for real patients, and must not be used for diagnosis or treatment decisions. Real cardiac care requires qualified professionals, regulated systems, and rigorous validation.

Goals¶

  • Explore the heart.csv features and the HeartDisease target.
  • Visualize relationships between variables and the outcome.
  • Preprocess mixed numeric and categorical columns for neural network training.
  • Train and compare several PyTorch MLP architectures on the same split.
  • Report classification metrics, ROC/PR curves, and discuss errors and ethics.
In [ ]:
from __future__ import annotations

from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.compose import ColumnTransformer
from sklearn.metrics import (
    RocCurveDisplay,
    PrecisionRecallDisplay,
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    average_precision_score,
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from torch.utils.data import DataLoader, TensorDataset

plt.rcParams['figure.figsize'] = (8, 4)
torch.manual_seed(42)
np.random.seed(42)

DATA_PATH = Path('heart.csv')
df = pd.read_csv(DATA_PATH)

print('Rows, columns:', df.shape)
print('Target distribution:')
print(df['HeartDisease'].value_counts())
df.head()

Task 1 — Explore the dataset¶

Answer these questions (use code in the next cell):

  1. What does one row represent?
  2. What is the target HeartDisease, and what do its values mean?
  3. Which columns are numeric vs categorical?
  4. Are there missing values?
  5. How balanced are the classes? Why might accuracy alone be misleading?
In [ ]:
# TODO: Use `info`, `describe`, groupby summaries, and missing-value checks to answer Task 1.

Task 2 — Visual exploration¶

Create at least four plots that help you understand risk factors. Ideas (you may choose others):

  • Age distribution by HeartDisease label (histogram or KDE).
  • Boxplots of Oldpeak, MaxHR, or Cholesterol by label.
  • A categorical feature (for example ChestPainType) vs HeartDisease counts or proportions.
  • Correlation heatmap for numeric features plus the target.

After plotting, write a short note for each figure: what pattern you see and how strong you think it is.

In [ ]:
# TODO: Build at least four informative visualizations for Task 2.

Task 3 — Preprocess for neural networks¶

  • Split first into train / validation / test with stratification on HeartDisease.
  • Fit the preprocessor only on training rows: StandardScaler on numeric columns, OneHotEncoder on categoricals (handle_unknown='ignore', dense output so you get a numpy array).
  • Convert features to float32 NumPy arrays for PyTorch. The binary target should work with BCEWithLogitsLoss using one logit (a single output neuron before sigmoid).

Columns (you can verify from the data): numeric — Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak; categorical — Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope.

In [ ]:
TARGET = 'HeartDisease'

# TODO: Build X (features dataframe), y (float32 0/1 array), lists numeric_features / categorical_features.
# TODO: ColumnTransformer with StandardScaler + OneHotEncoder(sparse_output=False, handle_unknown='ignore').
# TODO: Stratified splits → X_train_df, X_val_df, X_test_df and matching y vectors (~64% / 16% / 20%).
# TODO: preprocessor.fit(train only); transform train/val/test → X_train, X_val, X_test as float32.
# TODO: input_dim = X_train.shape[1]; print train/val/test sizes and input_dim.

Task 4 — PyTorch MLPs: three architectures¶

Train three classifiers and compare them:

Model Description
MLP_Small One hidden layer (32 units). Strong baseline for small tabular data.
MLP_Deep Three hidden layers (64→48→24) with dropout between hidden layers.
MLP_Wide Two wide layers (128→128) with dropout.

Use ReLU hidden activations and nn.BCEWithLogitsLoss. Optimizer Adam, batch size 32, and early stopping on validation loss (stop if validation loss does not improve for many epochs — try ~25).

Implement:

  1. A small helper to build DataLoaders from NumPy arrays (TensorDataset).
  2. An MLP module: stack Linear → ReLU (and Dropout where you use it), then a final Linear to one output.
  3. A training function that records train/validation loss per epoch and restores the best validation weights.
In [ ]:
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')


def make_loaders(X_tr, y_tr, X_va, y_va, batch_size: int = 32):
    tr_ds = TensorDataset(torch.from_numpy(X_tr), torch.from_numpy(y_tr))
    va_ds = TensorDataset(torch.from_numpy(X_va), torch.from_numpy(y_va))
    return (
        DataLoader(tr_ds, batch_size=batch_size, shuffle=True),
        DataLoader(va_ds, batch_size=batch_size, shuffle=False),
    )


# TODO: class MLP(nn.Module): ...
# TODO: def train_mlp(...) -> trained model + history dict with 'train_loss' and 'val_loss' lists
# TODO: Instantiate MLP_Small, MLP_Deep, MLP_Wide; train each; store in `trained` and `histories`

trained: dict[str, nn.Module] = {}
histories: dict[str, dict] = {}

Task 5 — Compare validation learning curves¶

Plot validation loss vs epoch for each model. If training loss keeps falling but validation loss rises, what does that suggest?

In [ ]:
# TODO: One figure: validation loss curves for all models in `histories`.

Task 6 — Test-set metrics and ROC / precision–recall¶

On the held-out test set, report accuracy, ROC-AUC, average precision (PR-AUC), and show confusion matrices. Use classification_report and comment on recall for the positive class (disease present) — why might it matter in screening?

In [ ]:
# TODO: Implement predict_proba(model, X_numpy) using torch.no_grad(), logits, torch.sigmoid.
# TODO: For each model in `trained`, collect test probabilities, threshold at 0.5 for confusion matrix / report.
# TODO: Build a small metrics table (accuracy, roc_auc, pr_auc, parameter count).

metrics_df = None  # replace with your summary DataFrame
test_probs: dict[str, np.ndarray] = {}
In [ ]:
# TODO: Side-by-side ROC and precision–recall curves for all models (test set). Use RocCurveDisplay / PrecisionRecallDisplay.
In [ ]:
# TODO: Plot confusion matrices (one subplot per model) at threshold 0.5.

Task 7 — Predicted probabilities¶

For your best model by ROC-AUC (or another rule you justify), plot histograms of predicted probabilities for true negatives vs true positives on the test set. Do the distributions separate?

In [ ]:
# TODO: Pick best model (e.g. metrics_df['roc_auc'].idxmax()), plot overlapping histograms of predicted probabilities by true label.

Task 8 — Interpretation, limitations, and ethics¶

Write short paragraphs (bullet points are fine) on:

  1. Which features showed the clearest signal in your EDA?
  2. Which architecture performed best on your metrics, and might it be overfitting?
  3. Cost of errors: Compare false negatives vs false positives in a screening-style setting.
  4. Data and fairness: Who is represented in this table? What could go wrong if a model like this were deployed without rigorous validation?
  5. What would you do next to make this workflow more trustworthy (data, evaluation, or process)?