Chapter - Git and Version Control
Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program.
About this chapter
Version control is how teams track changes to code over time, work on the same project without overwriting each other, and recover when something breaks. The maturity model lists Git as a core Tier 2 capability that deepens through Tiers 3 and 4 — every data scientist on the team is expected to be comfortable with it.
Git is the version-control tool the industry uses. This chapter covers the mental model and the everyday commands.
Why version control
Imagine a folder full of files named analysis_final.py, analysis_final_v2.py, analysis_final_REALLY_final.py. That is version control done badly. Git replaces it with one folder and a complete, labeled history you can move through. It lets you:
- See exactly what changed, when, and who changed it.
- Return to any earlier working state.
- Work on a new idea without disturbing the code that already works.
- Combine your work with a teammate's safely.
The mental model: snapshots
A Git repository (repo) is a project folder plus a hidden history. You work, then take a labeled snapshot called a commit. Each commit points to the one before it, forming a timeline. Three places matter:
- Working directory — your actual files as they are right now.
- Staging area — the changes you have marked to include in the next commit (
git add). - Repository — the committed history (
git commit).
edit files -> git add -> git commit
(working dir) (staging) (history)The everyday commands
git init # start tracking a new project
git clone <url> # copy an existing repo (including its history)
git status # what has changed and what is staged
git add <file> # stage a file for the next commit
git add . # stage everything changed
git commit -m "message" # save a snapshot with a description
git log --oneline # view the history
git push # send your commits to the shared remote
git pull # bring teammates' commits into your copyA good commit message says why, briefly: Fix off-by-one in defect crop beats update.
Branches: work without breaking things
A branch is a movable label on a line of work. You create one to develop a feature or try an experiment, leaving the main branch — the known-good code — untouched. When the work is ready, you merge it back.
git branch feature-x # create a branch
git checkout feature-x # switch to it (or: git switch feature-x)
git merge feature-x # from main, fold feature-x back inRemotes, pull requests, and review
A remote is a shared copy of the repo (on GitHub, GitLab, or an internal server). You push your branch to it and open a pull request (PR) — a request to merge your branch, where teammates review the changes before they land. Code review is itself a maturity-model capability; the PR is where it happens.
.gitignore
Some files should never be committed: large datasets, model weights, secrets, virtual environments, __pycache__. List them in a .gitignore file so Git skips them. This keeps the repo small and keeps credentials out of history.
Merge conflicts
When two people change the same lines, Git cannot decide automatically and reports a merge conflict. It marks the spot:
<<<<<<< HEAD
your version
=======
their version
>>>>>>> feature-xYou edit the file to the correct combined result, remove the markers, then git add and git commit. Handling conflicts calmly is the Tier 3–4 skill.
Practice Questions
Practice Questions
- In your own words, what problem does version control solve?
- What is the difference between the working directory, the staging area, and the repository?
- What do
git addandgit commiteach do? - Why work on a branch instead of committing directly to
main? - What is a remote, and what does
git pushdo? - What belongs in a
.gitignorefile, and why? - What causes a merge conflict, and how do you resolve one?
- Write a clear commit message for fixing a bug that cropped weld images one pixel too small.