Engineering Toolkit Tier 2: Data Science Practitioner Module 4: Data Handling Week 5 ⏱ 4 min read

What you'll be able to do

  • Query tables with SELECT, WHERE, ORDER BY, and LIMIT
  • Summarize data with GROUP BY and aggregate functions
  • Combine tables with INNER and LEFT JOIN
Competencies you'll build
  • Write a filtered, sorted SELECT query
  • Aggregate per group with GROUP BY and HAVING
  • Join two tables on a shared key

Key terms in this chapter

Chapter - SQL for Data Science

Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program.

About this chapter

Most real data lives in relational databases, not CSV files. SQL (Structured Query Language) is how you ask those databases questions. The maturity model pairs SQL with Python as a Tier 2 capability that grows to fluency at Tier 4 — you will use it to pull and shape the data your models train on.

This chapter covers the core of SQL you will use daily. It complements the SQLite chapter, which focuses on driving a database from Python.

Tables, rows, and columns

A relational database stores data in tables. Each table is a grid: columns are fields (with a type), rows are records. A welds table might look like:

weld_id line material strength_mpa passed
1 A steel 540 1
2 A aluminum 310 0
3 B steel 505 1

A primary key (here weld_id) uniquely identifies each row.

SELECT: the core query

Every question starts with SELECT. Read it as "select these columns from this table where this condition holds."

sql
SELECT weld_id, strength_mpa
FROM welds
WHERE material = 'steel'
ORDER BY strength_mpa DESC
LIMIT 5;
  • SELECT — which columns to return (* means all).
  • FROM — which table.
  • WHERE — keep only rows that match a condition.
  • ORDER BY — sort the results (DESC for high-to-low).
  • LIMIT — return at most N rows.

Common WHERE operators: =, <> (not equal), <, >, BETWEEN, IN (...), LIKE 'ste%' for pattern matches, and IS NULL for missing values.

Aggregation: GROUP BY

To summarize, combine an aggregate function with GROUP BY. Aggregates include COUNT, SUM, AVG, MIN, MAX.

sql
SELECT line, COUNT(*) AS n, AVG(strength_mpa) AS avg_strength
FROM welds
GROUP BY line;

This returns one row per production line with its weld count and average strength. To filter on an aggregate, use HAVING (not WHERE):

sql
SELECT line, AVG(strength_mpa) AS avg_strength
FROM welds
GROUP BY line
HAVING AVG(strength_mpa) < 500;

JOIN: combining tables

Relational data is split across tables to avoid repetition. A JOIN stitches them back together on a shared key. Suppose inspections references each weld by weld_id:

sql
SELECT w.weld_id, w.material, i.inspector, i.result
FROM welds AS w
INNER JOIN inspections AS i
  ON w.weld_id = i.weld_id;
welds weld_id (key) material strength_mpa inspections weld_id (key) inspector result ON w.weld_id = i.weld_id joined rows
An inner join returns one combined row wherever a weld_id appears in both tables. Rows with no match on the other side are dropped (a LEFT JOIN would keep all welds and leave inspection columns empty where none exists).

INNER JOIN keeps only matched rows; LEFT JOIN keeps every row from the left table even when the right has no match.

INNER LEFT RIGHT FULL
The shaded area is what each join keeps. INNER keeps only matching rows; LEFT and RIGHT keep all rows from one side plus matches; FULL keeps everything from both, filling gaps with nulls.

Why this matters for modeling

The dataset you train on is almost always the output of a SQL query: select the columns you need, filter to the right population, join in the labels, and aggregate to the right grain. Getting the query right is half of feature engineering.

Practice Questions

Practice Questions

  1. What are tables, rows, columns, and a primary key?
  2. Write a query that returns all columns for welds on line B.
  3. What is the difference between WHERE and HAVING?
  4. Write a query for the average strength_mpa per material.
  5. In plain words, what does a JOIN do?
  6. What is the difference between an INNER JOIN and a LEFT JOIN?
  7. Why does ORDER BY ... DESC plus LIMIT 5 give you the "top 5"?
  8. Why is SQL often the first step of building a training dataset?

Check your understanding

Tier 2 depth · Applied coding

0 / 5 correct
  1. Which clause keeps only the rows that match a condition in a SELECT query?

  2. You want the average strength per production line. Which combination does this?

  3. You need to filter on an aggregated value — only lines whose average strength is below 500. Which clause?

  4. An INNER JOIN between welds and inspections on weld_id returns:

  5. What does a primary key (like weld_id) guarantee about a table?

Go deeper

  • SQLBolt open access Interactive lessons that teach SQL from the ground up in the browser.
  • SQLZoo open access Practice SQL against real queries, in increasing difficulty.
  • pandas documentation open access The user guide and API for the library behind most data work here.
More in Additional Resources →
← Git and Version Control Testing and Unit Tests →