Open-Ended Lab - Netflix Titles Catalog¶

This lab uses netflix_titles.csv.

Goals¶

  • Explore a real catalog-style dataset.
  • Create useful visualizations.
  • Engineer features from dates, text, categories, and durations.
  • Train and evaluate a simple model.
  • Answer open-ended questions using evidence from the data.

Setup¶

Run this cell first. You should not need to edit it.

In [ ]:
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

DATA_PATH = Path('netflix_titles.csv')
df = pd.read_csv(DATA_PATH)

print('Rows:', len(df))
print('Columns:', len(df.columns))
df.head()

Task 1 - Understand the dataset¶

Explore the dataset structure. Answer these questions in notes or markdown:

  1. What does one row represent?
  2. Which columns are categorical?
  3. Which columns are numeric or date-like?
  4. Which columns have missing values?
  5. Which columns might be useful for predicting whether a title is a Movie or TV Show?
In [ ]:
# TODO: inspect columns, data types, missing values, and summary statistics.
# Suggested tools: df.info(), df.describe(), df.isna().sum(), df['type'].value_counts()

Task 2 - Visual exploration¶

Create at least three visualizations. Suggested ideas:

  • Count of Movies vs TV Shows.
  • Titles by release year.
  • Most common ratings.
  • Most common countries.
  • Most common genres in listed_in.

After each chart, write 1-2 sentences explaining what you notice.

In [ ]:
# TODO: create at least three charts.

Task 3 - Feature engineering¶

Create a new table named features_df. Add useful features such as:

  • year_added from date_added.
  • duration_number from duration.
  • genre_count from listed_in.
  • description_length from description.
  • country_missing as a flag for missing country.

You may add other features if you think they are useful.

In [ ]:
features_df = df.copy()

# TODO: create engineered features.

# TODO: display your engineered columns.

Task 4 - Build a simple model¶

Train a simple model to predict type (Movie or TV Show).

Guidelines:

  • Do not use show_id, title, or type as input features.
  • Use a train/test split.
  • Include both numeric and categorical features if possible.
  • Print a classification report and confusion matrix.
In [ ]:
# TODO: choose feature columns, split the data, build a preprocessing pipeline, train a model, and evaluate it.

Task 5 - Open-ended interpretation¶

Answer in complete sentences:

  1. What patterns did you find during exploration?
  2. Which engineered features seemed most useful or least useful? Why?
  3. How well did the model perform?
  4. What mistakes did the model make?
  5. What additional data would make the model more useful?

Your answers:

  1. TODO
  2. TODO
  3. TODO
  4. TODO
  5. TODO