Chapter - SQL for Data Science

Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program.

About this chapter

Most real data lives in relational databases, not CSV files. SQL (Structured Query Language) is how you ask those databases questions. The maturity model pairs SQL with Python as a Tier 2 capability that grows to fluency at Tier 4 — you will use it to pull and shape the data your models train on.

This chapter covers the core of SQL you will use daily. It complements the SQLite chapter, which focuses on driving a database from Python.

Tables, rows, and columns

A relational database stores data in tables. Each table is a grid: columns are fields (with a type), rows are records. A welds table might look like:

weld_id	line	material	strength_mpa	passed
1	A	steel	540	1
2	A	aluminum	310	0
3	B	steel	505	1

A primary key (here weld_id) uniquely identifies each row.

SELECT: the core query

Every question starts with SELECT. Read it as "select these columns from this table where this condition holds."

sql

SELECT weld_id, strength_mpa
FROM welds
WHERE material = 'steel'
ORDER BY strength_mpa DESC
LIMIT 5;

SELECT — which columns to return (* means all).
FROM — which table.
WHERE — keep only rows that match a condition.
ORDER BY — sort the results (DESC for high-to-low).
LIMIT — return at most N rows.

Common WHERE operators: =, <> (not equal), <, >, BETWEEN, IN (...), LIKE 'ste%' for pattern matches, and IS NULL for missing values.

Aggregation: GROUP BY

To summarize, combine an aggregate function with GROUP BY. Aggregates include COUNT, SUM, AVG, MIN, MAX.

sql

SELECT line, COUNT(*) AS n, AVG(strength_mpa) AS avg_strength
FROM welds
GROUP BY line;

This returns one row per production line with its weld count and average strength. To filter on an aggregate, use HAVING (not WHERE):

sql

SELECT line, AVG(strength_mpa) AS avg_strength
FROM welds
GROUP BY line
HAVING AVG(strength_mpa) < 500;

JOIN: combining tables

Relational data is split across tables to avoid repetition. A JOIN stitches them back together on a shared key. Suppose inspections references each weld by weld_id:

sql

SELECT w.weld_id, w.material, i.inspector, i.result
FROM welds AS w
INNER JOIN inspections AS i
  ON w.weld_id = i.weld_id;

An inner join returns one combined row wherever a weld_id appears in both tables. Rows with no match on the other side are dropped (a LEFT JOIN would keep all welds and leave inspection columns empty where none exists).

INNER JOIN keeps only matched rows; LEFT JOIN keeps every row from the left table even when the right has no match.

The shaded area is what each join keeps. INNER keeps only matching rows; LEFT and RIGHT keep all rows from one side plus matches; FULL keeps everything from both, filling gaps with nulls.

Why this matters for modeling

The dataset you train on is almost always the output of a SQL query: select the columns you need, filter to the right population, join in the labels, and aggregate to the right grain. Getting the query right is half of feature engineering.

Practice Questions

What are tables, rows, columns, and a primary key?
Write a query that returns all columns for welds on line B.
What is the difference between WHERE and HAVING?
Write a query for the average strength_mpa per material.
In plain words, what does a JOIN do?
What is the difference between an INNER JOIN and a LEFT JOIN?
Why does ORDER BY ... DESC plus LIMIT 5 give you the "top 5"?
Why is SQL often the first step of building a training dataset?

What you'll be able to do

Key terms in this chapter