Open-Ended Lab - Netflix Titles Catalog¶
This lab uses netflix_titles.csv.
Goals¶
- Explore a real catalog-style dataset.
- Create useful visualizations.
- Engineer features from dates, text, categories, and durations.
- Train and evaluate a simple model.
- Answer open-ended questions using evidence from the data.
Setup¶
Run this cell first. You should not need to edit it.
In [ ]:
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
DATA_PATH = Path('netflix_titles.csv')
df = pd.read_csv(DATA_PATH)
print('Rows:', len(df))
print('Columns:', len(df.columns))
df.head()
Task 1 - Understand the dataset¶
Explore the dataset structure. Answer these questions in notes or markdown:
- What does one row represent?
- Which columns are categorical?
- Which columns are numeric or date-like?
- Which columns have missing values?
- Which columns might be useful for predicting whether a title is a Movie or TV Show?
In [ ]:
# TODO: inspect columns, data types, missing values, and summary statistics.
# Suggested tools: df.info(), df.describe(), df.isna().sum(), df['type'].value_counts()
Task 2 - Visual exploration¶
Create at least three visualizations. Suggested ideas:
- Count of Movies vs TV Shows.
- Titles by release year.
- Most common ratings.
- Most common countries.
- Most common genres in
listed_in.
After each chart, write 1-2 sentences explaining what you notice.
In [ ]:
# TODO: create at least three charts.
Task 3 - Feature engineering¶
Create a new table named features_df. Add useful features such as:
year_addedfromdate_added.duration_numberfromduration.genre_countfromlisted_in.description_lengthfromdescription.country_missingas a flag for missing country.
You may add other features if you think they are useful.
In [ ]:
features_df = df.copy()
# TODO: create engineered features.
# TODO: display your engineered columns.
Task 4 - Build a simple model¶
Train a simple model to predict type (Movie or TV Show).
Guidelines:
- Do not use
show_id,title, ortypeas input features. - Use a train/test split.
- Include both numeric and categorical features if possible.
- Print a classification report and confusion matrix.
In [ ]:
# TODO: choose feature columns, split the data, build a preprocessing pipeline, train a model, and evaluate it.
Task 5 - Open-ended interpretation¶
Answer in complete sentences:
- What patterns did you find during exploration?
- Which engineered features seemed most useful or least useful? Why?
- How well did the model perform?
- What mistakes did the model make?
- What additional data would make the model more useful?
Your answers:
- TODO
- TODO
- TODO
- TODO
- TODO