Chapter - SQL for Data Science
Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program.
About this chapter
Most real data lives in relational databases, not CSV files. SQL (Structured Query Language) is how you ask those databases questions. The maturity model pairs SQL with Python as a Tier 2 capability that grows to fluency at Tier 4 — you will use it to pull and shape the data your models train on.
This chapter covers the core of SQL you will use daily. It complements the SQLite chapter, which focuses on driving a database from Python.
Tables, rows, and columns
A relational database stores data in tables. Each table is a grid: columns are fields (with a type), rows are records. A welds table might look like:
| weld_id | line | material | strength_mpa | passed |
|---|---|---|---|---|
| 1 | A | steel | 540 | 1 |
| 2 | A | aluminum | 310 | 0 |
| 3 | B | steel | 505 | 1 |
A primary key (here weld_id) uniquely identifies each row.
SELECT: the core query
Every question starts with SELECT. Read it as "select these columns from this table where this condition holds."
SELECT weld_id, strength_mpa
FROM welds
WHERE material = 'steel'
ORDER BY strength_mpa DESC
LIMIT 5;SELECT— which columns to return (*means all).FROM— which table.WHERE— keep only rows that match a condition.ORDER BY— sort the results (DESCfor high-to-low).LIMIT— return at most N rows.
Common WHERE operators: =, <> (not equal), <, >, BETWEEN, IN (...), LIKE 'ste%' for pattern matches, and IS NULL for missing values.
Aggregation: GROUP BY
To summarize, combine an aggregate function with GROUP BY. Aggregates include COUNT, SUM, AVG, MIN, MAX.
SELECT line, COUNT(*) AS n, AVG(strength_mpa) AS avg_strength
FROM welds
GROUP BY line;This returns one row per production line with its weld count and average strength. To filter on an aggregate, use HAVING (not WHERE):
SELECT line, AVG(strength_mpa) AS avg_strength
FROM welds
GROUP BY line
HAVING AVG(strength_mpa) < 500;JOIN: combining tables
Relational data is split across tables to avoid repetition. A JOIN stitches them back together on a shared key. Suppose inspections references each weld by weld_id:
SELECT w.weld_id, w.material, i.inspector, i.result
FROM welds AS w
INNER JOIN inspections AS i
ON w.weld_id = i.weld_id;INNER JOIN keeps only matched rows; LEFT JOIN keeps every row from the left table even when the right has no match.
Why this matters for modeling
The dataset you train on is almost always the output of a SQL query: select the columns you need, filter to the right population, join in the labels, and aggregate to the right grain. Getting the query right is half of feature engineering.
Practice Questions
Practice Questions
- What are tables, rows, columns, and a primary key?
- Write a query that returns all columns for welds on line
B. - What is the difference between
WHEREandHAVING? - Write a query for the average
strength_mpapermaterial. - In plain words, what does a JOIN do?
- What is the difference between an
INNER JOINand aLEFT JOIN? - Why does
ORDER BY ... DESCplusLIMIT 5give you the "top 5"? - Why is SQL often the first step of building a training dataset?