Open-Ended Lab - Teen Mental Health and Social Media¶
This lab uses Teen_Mental_Health_Dataset.csv.
Important note¶
This lab is for learning data exploration and machine learning workflows. It is not a medical or clinical diagnostic tool. Mental health data should be handled carefully, respectfully, and with attention to privacy and bias.
Goals¶
- Explore social media, sleep, activity, stress, anxiety, and depression-label variables.
- Visualize relationships among behavioral and mental health indicators.
- Engineer features that may support modeling.
- Train a simple model to predict
depression_label. - Reflect on ethical limits and risks.
In [ ]:
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
DATA_PATH = Path('Teen_Mental_Health_Dataset.csv')
df = pd.read_csv(DATA_PATH)
print('Rows:', len(df))
df.head()
Task 1 - Explore the dataset¶
Answer these questions:
- What does one row represent?
- Which variables describe social media behavior?
- Which variables describe health, wellbeing, or lifestyle?
- What values appear in
depression_label? - Are the classes balanced or imbalanced?
- Are there missing values?
In [ ]:
# TODO: inspect info, summary statistics, missing values, and target class counts.
Task 2 - Visual exploration¶
Create at least four visualizations. Suggested ideas:
- Distribution of daily social media hours.
- Sleep hours by depression label.
- Stress level by depression label.
- Anxiety level by depression label.
- Platform usage counts.
- Scatter plot of social media hours vs sleep hours.
After each chart, write what you notice and whether the pattern seems strong, weak, or uncertain.
In [ ]:
# TODO: create at least four visualizations.
Task 3 - Feature engineering¶
Create features_df and add at least three engineered features. Suggested ideas:
high_social_media_use: daily social media hours at or above a threshold you choose.low_sleep: sleep hours below a threshold you choose.total_distress_score: stress + anxiety + addiction.screen_sleep_ratio: screen time before sleep divided by sleep hours.active_social_balance: physical activity minus daily social media hours.
Explain why each feature might be useful.
In [ ]:
features_df = df.copy()
# TODO: create engineered features.
# TODO: display your engineered features.
Task 4 - Build a simple model¶
Train a simple model to predict depression_label.
Guidelines:
- Use a train/test split.
- Include both numeric and categorical features.
- Print a classification report and confusion matrix.
- Pay attention to recall for the positive class.
Reminder: this model is for learning only and should not be used for diagnosis.
In [ ]:
# TODO: choose features, train a model, and evaluate it.
Task 5 - Open-ended interpretation and ethics¶
Answer in complete sentences:
- Which variables appear most related to
depression_label? - Which engineered features seem useful? Why?
- How well did the model perform, especially on the positive class?
- What mistakes would be most concerning in this use case?
- What privacy, fairness, or ethical concerns should be considered?
- What additional data or safeguards would be needed before using a model like this in the real world?
Your answers:
- TODO
- TODO
- TODO
- TODO
- TODO
- TODO