Open-Ended Lab - Teen Mental Health and Social Media¶

This lab uses Teen_Mental_Health_Dataset.csv.

Important note¶

This lab is for learning data exploration and machine learning workflows. It is not a medical or clinical diagnostic tool. Mental health data should be handled carefully, respectfully, and with attention to privacy and bias.

Goals¶

  • Explore social media, sleep, activity, stress, anxiety, and depression-label variables.
  • Visualize relationships among behavioral and mental health indicators.
  • Engineer features that may support modeling.
  • Train a simple model to predict depression_label.
  • Reflect on ethical limits and risks.
In [ ]:
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

DATA_PATH = Path('Teen_Mental_Health_Dataset.csv')
df = pd.read_csv(DATA_PATH)

print('Rows:', len(df))
df.head()

Task 1 - Explore the dataset¶

Answer these questions:

  1. What does one row represent?
  2. Which variables describe social media behavior?
  3. Which variables describe health, wellbeing, or lifestyle?
  4. What values appear in depression_label?
  5. Are the classes balanced or imbalanced?
  6. Are there missing values?
In [ ]:
# TODO: inspect info, summary statistics, missing values, and target class counts.

Task 2 - Visual exploration¶

Create at least four visualizations. Suggested ideas:

  • Distribution of daily social media hours.
  • Sleep hours by depression label.
  • Stress level by depression label.
  • Anxiety level by depression label.
  • Platform usage counts.
  • Scatter plot of social media hours vs sleep hours.

After each chart, write what you notice and whether the pattern seems strong, weak, or uncertain.

In [ ]:
# TODO: create at least four visualizations.

Task 3 - Feature engineering¶

Create features_df and add at least three engineered features. Suggested ideas:

  • high_social_media_use: daily social media hours at or above a threshold you choose.
  • low_sleep: sleep hours below a threshold you choose.
  • total_distress_score: stress + anxiety + addiction.
  • screen_sleep_ratio: screen time before sleep divided by sleep hours.
  • active_social_balance: physical activity minus daily social media hours.

Explain why each feature might be useful.

In [ ]:
features_df = df.copy()

# TODO: create engineered features.

# TODO: display your engineered features.

Task 4 - Build a simple model¶

Train a simple model to predict depression_label.

Guidelines:

  • Use a train/test split.
  • Include both numeric and categorical features.
  • Print a classification report and confusion matrix.
  • Pay attention to recall for the positive class.

Reminder: this model is for learning only and should not be used for diagnosis.

In [ ]:
# TODO: choose features, train a model, and evaluate it.

Task 5 - Open-ended interpretation and ethics¶

Answer in complete sentences:

  1. Which variables appear most related to depression_label?
  2. Which engineered features seem useful? Why?
  3. How well did the model perform, especially on the positive class?
  4. What mistakes would be most concerning in this use case?
  5. What privacy, fairness, or ethical concerns should be considered?
  6. What additional data or safeguards would be needed before using a model like this in the real world?

Your answers:

  1. TODO
  2. TODO
  3. TODO
  4. TODO
  5. TODO
  6. TODO