Chapter - Transformers, RAG Models, and LLMs

Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program and is not derived from Automate the Boring Stuff with Python; it is written in a similar tone for continuity with the other chapters.

About this chapter

So far, you have learned how to prepare data, choose algorithm families, train simple models, and understand CNNs for images. This chapter introduces another major family of modern machine learning systems: Transformers.

Transformers are the foundation of many large language models, or LLMs. They are also used for code, images, audio, time series, search, translation, summarization, and question answering.

This chapter focuses on four core Transformer ideas:

Tokenization and embeddings.
Positional encoding.
Self-attention.
Encoder-decoder architectures.

Then the chapter explains Retrieval Augmented Generation, or RAG, which combines a language model with a search system so the model can use external documents while answering.

By the end of this chapter, you should be able to:

Explain how Transformers process sequences.
Describe tokenization, embeddings, positional encoding, and self-attention.
Compare encoder-only, decoder-only, and encoder-decoder architectures.
Explain why modern LLMs are built using Transformers.
Explain what gap RAG fills.
Describe how to build, train, evaluate, and run inference with a RAG system.

Required concept checklist

This chapter is organized around the following required ideas:

Required concept	Where it appears in this chapter	What you should be able to explain
Tokenization and embeddings	`Tokenization`, `Token IDs`, `Embeddings`	How raw text becomes token IDs, then learned numeric vectors
Positional encoding	`Positional encoding`, `Types of positional information`	Why attention needs position information and how order is represented
Self-attention mechanism	`Self-attention`, `Query, key, and value`, `Scaled dot-product attention`, `Multi-head attention`	How tokens compare to each other and combine information from context
Encoder-decoder architectures	`Encoder`, `Decoder`, `Encoder-decoder models`	How encoder-only, decoder-only, and encoder-decoder Transformers differ
LLM construction	`How LLMs are built using Transformers`, `Typical LLM development stages`	Why modern LLMs are large Transformer stacks trained with next-token prediction and later tuned
RAG gap	`The gap RAG fills`, `RAG vs. fine-tuning`	Why retrieval is useful when knowledge is private, current, long, or source-grounded
RAG training and inference	`Training a RAG system`, `RAG training workflow`, `RAG inference workflow`	How to build/tune the retrieval pipeline and how a user question becomes a grounded answer

Why Transformers matter

Before Transformers, many sequence models processed text one step at a time.

For example, a model might read:

text

The weld inspection report found porosity near the edge.

from left to right, word by word.

That can work, but it creates challenges:

Long documents are hard to remember.
Training can be slow because steps depend on previous steps.
Important words far apart may be hard to connect.

Transformers changed this by using attention. Attention lets the model compare tokens in a sequence directly and decide which tokens matter to each other.

This makes Transformers useful for:

Summarizing reports.
Translating text.
Answering questions.
Writing code.
Searching documents.
Extracting information from procedures or inspection records.
Building modern LLMs.

What is a sequence?

A sequence is an ordered list.

In language, a sequence might be words or pieces of words:

text

["weld", "inspection", "found", "porosity"]

In code, a sequence might be tokens such as:

text

["for", "item", "in", "items", ":"]

In time series, a sequence might be sensor readings:

text

[22.1, 22.4, 22.6, 23.0]

Transformers are sequence models. They are especially famous for language, but the same ideas can apply to other ordered data.

Tokenization

Computers do not understand raw text directly. The text must first be converted into smaller pieces called tokens.

Tokenization is the process of splitting text into tokens.

For example:

text

"The weld has porosity."

might become:

text

["The", "weld", "has", "porosity", "."]

Many modern LLMs use subword tokenization. That means a rare word may be split into pieces.

For example:

text

"microcracking"

might become:

text

["micro", "crack", "ing"]

The exact tokens depend on the tokenizer.

Why subword tokenization is useful

Subword tokenization helps with unfamiliar words.

Suppose a model has never seen the exact word:

text

"ultrasonicity"

If it can split the word into smaller pieces, it may still recognize useful parts.

Subword tokenization helps models handle:

Rare words.
Technical vocabulary.
Misspellings.
Names.
Different word endings.
New combinations of familiar pieces.

Tokenization also affects model cost and limits. A model's context window is measured in tokens, not words.

Token IDs

After text is split into tokens, each token is mapped to a number called a token ID.

For example:

Token	Token ID
`The`	`104`
`weld`	`8921`
`has`	`713`
`porosity`	`22984`
`.`	`13`

The model does not use the token strings directly. It uses token IDs to look up learned numeric vectors called embeddings.

Embeddings

An embedding is a list of numbers that represents a token.

For example:

text

token: "weld"
embedding: [0.12, -0.44, 0.08, 0.91, ...]

The full embedding may have hundreds or thousands of numbers.

Embeddings are learned during training. Tokens used in similar contexts tend to get embeddings that are closer together.

For example, a language model may learn that these words often appear in related contexts:

weld
joint
bead
porosity
crack

Embeddings turn discrete tokens into continuous numeric values that neural networks can process.

Embedding intuition

An embedding is not a dictionary definition. It is a mathematical representation learned from data.

Think of an embedding as a location in a high-dimensional map.

Words with related use may be near each other:

text

"porosity" near "void"
"crack" near "fracture"
"inspection" near "review"

Pictured as a 2D map (real embeddings have hundreds of dimensions), related terms cluster together while unrelated ones sit apart:

A 2D scatter of word tokens where weld/joint/bead cluster together, porosity/void cluster together, crack/fracture cluster together, and inspection/review cluster together. — A simplified 2D embedding space: tokens used in similar contexts (like "porosity" and "void") land near each other.

Show the code that generated this plot

python

import matplotlib.pyplot as plt

# Hand-placed 2D positions to illustrate the idea; real embeddings are learned
points = {
    'weld': (0.5, 0.6), 'joint': (0.7, 0.7), 'bead': (0.6, 0.45),
    'porosity': (2.4, 2.3), 'void': (2.6, 2.1),
    'crack': (2.3, 0.5), 'fracture': (2.5, 0.35),
    'inspection': (0.6, 2.5), 'review': (0.8, 2.6),
}
groups = {
    'weld terms': ['weld', 'joint', 'bead'],
    'porosity': ['porosity', 'void'],
    'crack': ['crack', 'fracture'],
    'process': ['inspection', 'review'],
}
for label, words in groups.items():
    plt.scatter([points[w][0] for w in words], [points[w][1] for w in words],
                s=90, label=label, edgecolor='white')
    for w in words:
        plt.annotate(w, points[w], textcoords='offset points', xytext=(6, 6))
plt.xlabel('dimension 1')
plt.ylabel('dimension 2')
plt.legend(fontsize=8)
plt.show()

But embeddings can also reflect biases or patterns in the training data. They should not be treated as objective truth.

Positional encoding

Self-attention compares tokens to each other, but by itself it does not know their order.

The sentences below use the same words but mean different things:

text

The inspector approved the procedure.
The procedure approved the inspector.

Order matters.

Positional encoding gives the model information about where each token appears in the sequence.

For example:

Position	Token
0	`The`
1	`inspector`
2	`approved`
3	`the`
4	`procedure`

The model combines token embeddings with position information so it can understand both meaning and order.

The original Transformer builds that position information from sine and cosine waves of different frequencies — one row per position, one column per dimension:

A heatmap with position in sequence on the vertical axis and embedding dimension on the horizontal axis, showing smooth sinusoidal stripes that vary quickly in the low dimensions and slowly in the high dimensions. — Sinusoidal positional encoding. Each position gets a unique pattern across dimensions, so the model can tell positions apart.

Show the code that generated this plot

python

import numpy as np
import matplotlib.pyplot as plt

positions, d_model = 40, 64
pe = np.zeros((positions, d_model))
pos = np.arange(positions)[:, None]
div = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
pe[:, 0::2] = np.sin(pos * div)
pe[:, 1::2] = np.cos(pos * div)

plt.imshow(pe, cmap='RdBu', aspect='auto')
plt.colorbar()
plt.xlabel('embedding dimension')
plt.ylabel('position in sequence')
plt.title('Sinusoidal positional encoding')
plt.show()

Types of positional information

There are several ways to represent position.

Common approaches include:

Approach	Basic idea
Fixed sinusoidal encodings	Use mathematical wave patterns based on position
Learned positional embeddings	Learn a vector for each position during training
Relative position methods	Represent distance between tokens
Rotary position embeddings	Rotate embedding dimensions based on position

You do not need to memorize every method. The important idea is this:

Transformers need a way to know token order.

Without position information, a Transformer would struggle to distinguish sentences that use the same words in different orders.

Self-attention

Self-attention is the core idea behind Transformers.

Self-attention lets each token look at other tokens in the same sequence and decide which ones are relevant.

For example:

text

The weld failed because it had a crack.

The word it should pay attention to weld.

The word crack may pay attention to failed, weld, and had.

Self-attention helps the model connect related words even when they are far apart.

Attention as weighted focus

Imagine reading this sentence:

text

The procedure was rejected because it missed required inspection steps.

To understand it, you focus on procedure, not inspection.

Self-attention does something similar with numbers. For each token, the model assigns attention weights to other tokens.

Example:

Current token	Other token	Attention weight
`it`	`procedure`	`0.70`
`it`	`inspection`	`0.10`
`it`	`steps`	`0.05`
`it`	other tokens	`0.15`

The weights add up to about 1. Higher weight means the token is more important for that part of the calculation.

Query, key, and value

Transformer attention uses three learned representations:

Query.
Key.
Value.

Self-attention compares the query against every key to get match scores, softmax turns those into weights that sum to 1, and the output is the weighted sum of the values — so each token pulls most from the tokens it most relates to.

A simple way to think about them:

Term	Question it answers
Query	What am I looking for?
Key	What information do I contain?
Value	What information should I pass along?

For each token:

The model creates a query vector.
The model compares that query to key vectors from other tokens.
Similar query-key pairs receive higher attention weights.
The model combines value vectors using those weights.

This allows each token to build a richer representation using context from the sequence.

Scaled dot-product attention

The common attention calculation is called scaled dot-product attention.

In simplified form:

text

attention_scores = query dot keys
scaled_scores = attention_scores / sqrt(key_dimension)
attention_weights = softmax(scaled_scores)
output = attention_weights * values

The dot product measures similarity. Softmax converts scores into weights that add to 1.

The scaling step helps keep numbers stable when vectors have many dimensions.

Multi-head attention

A Transformer does not use only one attention calculation.

It uses multi-head attention.

Each attention head can learn a different kind of relationship.

For example, one head might focus on:

Subject-verb relationships.

Another might focus on:

Technical terms and definitions.

Another might focus on:

Nearby words.

Another might focus on:

Long-range references.

The outputs from multiple heads are combined, giving the model several views of the same sequence.

Transformer blocks

A Transformer is built from repeated Transformer blocks.

A simplified block includes:

text

input
-> self-attention
-> feed-forward neural network
-> output

Modern blocks also include details such as:

Layer normalization.
Residual connections.
Dropout.
Multi-head attention.

A stack of many Transformer blocks lets the model build increasingly rich representations.

Early layers may learn simple patterns. Later layers may learn more abstract relationships.

Encoder-decoder architectures

The original Transformer architecture used an encoder and a decoder.

The encoder reads the input sequence and builds representations.

The decoder generates the output sequence.

For translation:

text

input:  "The weld passed inspection."
output: "La soldadura pasó la inspección."

The encoder processes the English sentence. The decoder generates the Spanish sentence.

Encoder

An encoder reads the full input sequence and produces contextual embeddings.

Encoder-style models are useful when the model needs to understand an input.

Common uses include:

Text classification.
Search embeddings.
Document similarity.
Named entity recognition.
Extractive question answering.

Example model family:

text

BERT-style models

BERT-style models are often called encoder-only Transformers.

Decoder

A decoder generates output tokens, usually one token at a time.

Decoder-style models are useful when the model needs to produce text.

Common uses include:

Chat.
Writing assistance.
Code generation.
Summarization.
Open-ended question answering.

Example model family:

text

GPT-style models

GPT-style models are often called decoder-only Transformers.

When generating text, a decoder is usually prevented from looking at future tokens. This is called causal masking.

For example, when predicting the next token after:

text

The inspection report

the model should not be allowed to look at the answer token in advance.

Encoder-decoder models

An encoder-decoder model uses both parts.

The encoder understands the input. The decoder generates the output.

Common uses include:

Translation.
Summarization.
Structured transformation of text.
Some question answering systems.

Example model families:

text

T5-style models
BART-style models
Original Transformer translation models

Encoder-decoder models are useful when the task naturally has a separate input and output sequence.

Comparing Transformer architecture types

Architecture type	Main purpose	Common examples	Typical use
Encoder-only	Understand input	BERT-style models	Classification, embeddings, search
Decoder-only	Generate output	GPT-style models	Chat, generation, code completion
Encoder-decoder	Transform input into output	T5, BART	Translation, summarization

Modern LLM chat systems are often based on decoder-only Transformers, though full systems may include many supporting components.

How LLMs are built using Transformers

Modern large language models are usually very large Transformer models trained on huge amounts of text and code.

The basic training objective for many decoder-only LLMs is next-token prediction.

Given:

text

The weld inspection found

the model learns to predict likely next tokens:

text

porosity
cracking
no
...

During training, the model sees many examples and adjusts its weights to make better next-token predictions.

Why next-token prediction works

At first, next-token prediction may sound simple.

But to predict the next token well, the model must learn many patterns:

Grammar.
Facts.
Style.
Code syntax.
Reasoning patterns.
Domain language.
Relationships between ideas.

The model learns these patterns statistically from training data. It does not store knowledge in the same way a database does.

Typical LLM development stages

A simplified LLM development process may include:

Pretraining. Train a large Transformer on broad text and code data using next-token prediction.
Instruction tuning. Train the model to follow instructions using prompt-response examples.
Preference tuning or alignment. Improve helpfulness and reduce harmful behavior using human or model feedback.
Evaluation. Test the model on benchmarks and realistic tasks.
Deployment. Serve the model through an application, API, or internal tool.
Monitoring. Track quality, safety, latency, and cost.

Many organizations do not train LLMs from scratch. They use existing foundation models and adapt them through prompting, fine-tuning, tools, retrieval, or workflow design.

What LLMs are good at

LLMs can be useful for:

Drafting text.
Summarizing documents.
Explaining code.
Generating code.
Extracting information.
Rewriting content for different audiences.
Conversational interfaces.
Question answering over provided context.

LLMs also have limitations:

They can hallucinate.
They may be outdated.
They may not know private company documents.
They may produce plausible but wrong answers.
They can be sensitive to prompt wording.
They can reflect biases in training data.

These limitations are one reason RAG systems are useful.

The gap RAG fills

Retrieval Augmented Generation, or RAG, combines search with generation.

The gap RAG fills is this:

A language model may not know the specific, current, or private information needed to answer a question.

For example, a general LLM may not know:

A company's internal welding procedure.
The latest inspection policy.
A new training manual.
A specific dataset dictionary.
A maintenance log from last week.

RAG lets the system retrieve relevant documents and provide them to the model as context.

RAG in one sentence

RAG works like this:

text

user question -> retrieve relevant documents -> give documents to LLM -> generate answer with citations or context

Instead of asking the model to rely only on memory, the system gives it relevant information at inference time.

RAG vs. fine-tuning

RAG and fine-tuning solve different problems.

Need	Better starting point
Answer using current documents	RAG
Answer using private documents	RAG
Change model writing style	Fine-tuning or prompting
Teach a repeated response format	Fine-tuning or prompting
Add frequently changing knowledge	RAG
Improve behavior on a narrow task	Fine-tuning

Fine-tuning changes model weights. RAG changes what context the model receives.

If information changes often, RAG is usually easier to update. You update the document index instead of retraining the model.

RAG system components

A basic RAG system has several parts:

Document collection. The source documents, manuals, reports, tables, or records.
Chunking. Split documents into smaller pieces.
Embedding model. Convert chunks into numeric vectors.
Vector store or search index. Store and search chunk embeddings.
Retriever. Find relevant chunks for a user question.
Prompt builder. Put the question and retrieved context into a prompt.
Generator LLM. Produce the final answer.
Evaluator. Check answer quality, retrieval quality, and safety.

Each part affects the final answer.

Chunking documents

Chunking means splitting documents into pieces small enough to retrieve and pass to the model.

For example, a 50-page manual might be split into chunks:

text

chunk 1: title, section 1.1
chunk 2: section 1.2
chunk 3: section 1.3
...

Chunks should usually be:

Large enough to contain useful context.
Small enough to fit in the model prompt.
Split at natural boundaries when possible.
Stored with useful metadata, such as title, section, date, and source.

Bad chunking can make retrieval worse. If a chunk is too small, it may lack context. If it is too large, it may include unrelated information.

Embedding documents for retrieval

RAG often uses embeddings for search.

The system converts each chunk into an embedding:

text

chunk text -> embedding model -> vector

It also converts the user's question into an embedding:

text

question -> embedding model -> query vector

Then it searches for chunks with vectors close to the query vector.

This is called semantic search because it can find related meaning, not just exact keyword matches.

Retrieval

The retriever selects chunks likely to help answer the question.

For example:

text

Question: What inspection steps are required before final weld acceptance?
Retrieved chunks:
1. Procedure section on visual inspection.
2. Procedure section on nondestructive testing.
3. Acceptance criteria table.

The generator LLM then receives these chunks as context.

Retrieval quality is crucial. If the retriever finds the wrong chunks, the LLM may generate a poor answer even if the LLM is strong.

Generation with retrieved context

After retrieval, the system builds a prompt.

Example:

text

Use the context below to answer the question.
If the answer is not in the context, say you do not know.

Context:
[retrieved chunk 1]
[retrieved chunk 2]
[retrieved chunk 3]

Question:
What inspection steps are required before final weld acceptance?

The LLM uses the provided context to generate an answer.

A good RAG prompt often tells the model:

Use only the provided context when needed.
Cite sources or section names.
Say when the answer is not available.
Avoid guessing.
Be concise and clear.

Training a RAG system

People sometimes say "train a RAG model," but many RAG projects do not train the LLM itself.

Instead, "training a RAG system" often means building and tuning the retrieval and generation pipeline.

This may include:

Preparing documents.
Chunking documents.
Choosing an embedding model.
Building a vector index.
Creating evaluation questions.
Testing retrieval quality.
Tuning chunk size and overlap.
Tuning number of retrieved chunks.
Improving prompts.
Optionally fine-tuning the embedding model, reranker, or generator.

The key idea:

A RAG system can improve without changing the LLM weights.

RAG training workflow

A practical RAG development workflow:

Collect documents. Gather manuals, procedures, FAQs, reports, or other trusted sources.
Clean documents. Remove duplicates, fix encoding issues, and preserve headings or tables.
Chunk documents. Split into useful passages.
Add metadata. Store source, section, date, version, and access permissions.
Embed chunks. Use an embedding model to convert chunks into vectors.
Build the index. Store vectors in a vector database or search engine.
Create test questions. Include easy, hard, ambiguous, and unanswerable questions.
Evaluate retrieval. Check whether the right chunks are found.
Evaluate generation. Check whether answers are correct, grounded, and clear.
Tune the system. Adjust chunking, retrieval, reranking, prompts, or model choice.
Monitor after deployment. Track failures, stale documents, latency, and user feedback.

This is similar to training in spirit because the system is improved using evidence, but not always by updating neural network weights.

Optional fine-tuning in RAG

Some advanced RAG systems fine-tune one or more components.

Possible fine-tuning targets:

Component	Why fine-tune it
Embedding model	Improve retrieval for domain-specific language
Reranker	Better order retrieved chunks
Generator LLM	Improve answer style or task behavior

Fine-tuning may help when:

The domain vocabulary is very specialized.
Off-the-shelf retrieval misses important documents.
The answer format must be very consistent.
There are enough high-quality training examples.

Fine-tuning has costs:

More engineering effort.
More evaluation requirements.
Risk of overfitting.
Need to retrain or update when data changes.

For many projects, start with ordinary RAG before fine-tuning.

RAG inference workflow

Inference means using the system to answer a new question.

A RAG inference workflow:

Receive the user question.
Convert the question to an embedding.
Retrieve relevant chunks from the index.
Optionally rerank chunks.
Build a prompt containing the question and context.
Send the prompt to the LLM.
Generate an answer.
Return the answer with citations or source references when possible.

In simplified pseudocode:

text

question = user_input
query_vector = embed(question)
chunks = vector_search(query_vector, top_k=5)
prompt = build_prompt(question, chunks)
answer = llm.generate(prompt)
return answer

RAG inference example

Suppose the user asks:

text

What should I check before accepting a weld inspection image dataset?

The retriever might find chunks about:

Class counts.
Missing labels.
Image resolution.
Pixel value distributions.
Example inspection.

The prompt gives these chunks to the LLM. The model can then answer using the course materials instead of relying only on general training data.

Evaluating RAG systems

RAG systems should be evaluated at two levels:

Retrieval quality. Did the system find the right context?
Answer quality. Did the model answer correctly using that context?

Useful retrieval checks:

Are relevant chunks in the top results?
Are chunks too broad or too narrow?
Are important documents missing from the index?
Are stale or unauthorized documents retrieved?

Useful answer checks:

Is the answer grounded in the retrieved context?
Does the answer cite the right source?
Does the model say "I don't know" when context is missing?
Is the answer clear and useful?
Does it avoid unsupported claims?

Common RAG failure modes

Here are a few traps to avoid:

Poor chunking. The answer is split across chunks or buried in large chunks.
Weak retrieval. The retriever finds related but not actually useful passages.
Too much context. The prompt includes many chunks, making the answer noisy.
Missing documents. The correct source was never indexed.
Stale documents. The system retrieves old procedures.
No source control. Users cannot tell where the answer came from.
Hallucination. The LLM answers beyond the retrieved evidence.
Permission problems. Users retrieve documents they should not see.

RAG is powerful, but it does not remove the need for careful data management and evaluation.

RAG vs. ordinary search

Ordinary search returns documents or passages.

RAG returns an answer generated from retrieved documents.

System	Output
Keyword search	List of matching documents
Semantic search	List of meaning-related chunks
RAG	Generated answer using retrieved chunks

RAG can save users time, but it also creates risk if the generated answer is not faithful to the sources.

When to use RAG

RAG is a good fit when:

The answer depends on private documents.
The information changes over time.
Users need answers grounded in source material.
The model needs access to long documents.
Citations or source references matter.
You want to avoid retraining for every document update.

RAG may not be the best fit when:

The task does not require external knowledge.
The source documents are poor quality.
The correct answer requires complex calculations better handled by tools.
The system cannot safely manage document permissions.

Putting it together

A modern LLM application may combine several pieces:

text

user interface
-> prompt builder
-> retriever
-> vector database
-> LLM
-> response evaluator or guardrails
-> final answer

The LLM is only one part of the system.

For reliable applications, the surrounding system matters:

Data quality.
Document permissions.
Retrieval quality.
Prompt design.
Evaluation.
Monitoring.
User feedback.

Common mistakes

Here are a few traps to avoid:

Thinking tokens are the same as words. Tokens may be words, word pieces, punctuation, or other units.
Ignoring positional information. Transformers need a way to represent order.
Treating attention as a perfect explanation. Attention weights can be informative but are not always a complete explanation of model behavior.
Using RAG as a substitute for clean documents. Bad source material leads to bad answers.
Assuming RAG eliminates hallucination. RAG reduces hallucination risk when retrieval and prompts are good, but it does not guarantee correctness.
Fine-tuning when retrieval would solve the problem. If knowledge changes often, RAG may be easier and safer.
Skipping evaluation. RAG systems need retrieval tests and answer-quality tests.
Ignoring access control. Retrieval must respect document permissions.

Summary

Transformers are neural network architectures designed for sequence data. They use tokenization to split text into tokens, embeddings to represent tokens as vectors, positional encoding to preserve order, and self-attention to connect related tokens across a sequence.

Encoder-only Transformers are often used for understanding tasks such as classification and embeddings. Decoder-only Transformers are often used for generation and are the foundation of many modern LLMs. Encoder-decoder Transformers are useful when an input sequence is transformed into an output sequence, such as translation or summarization.

Modern LLMs are built from large stacks of Transformer blocks and are often trained with next-token prediction, followed by instruction tuning and alignment. They are powerful but can hallucinate, be outdated, or lack access to private documents.

RAG fills the gap between a general language model and specific source material. It retrieves relevant documents at inference time and gives them to the LLM as context. Building a RAG system usually means preparing documents, chunking, embedding, indexing, retrieving, prompting, evaluating, and tuning the pipeline.

Topic	Key ideas
Tokenization	Splits text into tokens or subword pieces
Embeddings	Numeric vectors representing tokens or chunks
Positional encoding	Gives the model information about token order
Self-attention	Lets tokens focus on relevant tokens in the same sequence
Multi-head attention	Learns several relationship patterns at once
Encoder	Builds contextual representations of input
Decoder	Generates output tokens
Encoder-decoder	Reads input and generates output
LLM	Large Transformer-based model trained on text and code
RAG	Retrieves documents and uses them as context for generation
RAG training	Builds and tunes retrieval, prompts, evaluation, and optional fine-tuning
RAG inference	Retrieves relevant chunks and generates an answer from context

Practice Questions

In your own words, what is a Transformer?
What is tokenization?
Why do modern LLMs often use subword tokenization?
What is a token ID?
What is an embedding?
Why are embeddings useful?
Why do Transformers need positional encoding?
What problem does self-attention solve?
In attention, what are queries, keys, and values?
What does softmax do in an attention calculation?
What is multi-head attention?
What is a Transformer block?
What is the difference between an encoder and a decoder?
What is an encoder-only model commonly used for?
What is a decoder-only model commonly used for?
What is an encoder-decoder model commonly used for?
Why are many modern LLMs based on decoder-only Transformers?
What is next-token prediction?
Why can next-token prediction teach useful language patterns?
Name two limitations of LLMs.
What gap does RAG fill?
How is RAG different from fine-tuning?
What is chunking in a RAG system?
Why is chunk size important?
What does an embedding model do in RAG?
What does the retriever do?
What happens during RAG inference?
What are two ways to evaluate a RAG system?
Name three common RAG failure modes.
Create a short workflow for building a RAG system over a folder of internal procedure documents.

Query	The	weld	bead	has	porosity
The	0.20	0.20	0.20	0.20	0.20
weld	0.16	0.23	0.23	0.16	0.23
bead	0.16	0.23	0.22	0.16	0.23
has	0.19	0.20	0.20	0.20	0.20
porosity	0.15	0.21	0.21	0.15	0.28

What you'll be able to do

Key terms in this chapter

Chapter - Transformers, RAG Models, and LLMs

About this chapter

Required concept checklist

Why Transformers matter

What is a sequence?

Tokenization

Why subword tokenization is useful

Token IDs

Embeddings

Embedding intuition

Positional encoding

Types of positional information

Self-attention

Attention as weighted focus

Query, key, and value

Scaled dot-product attention

Multi-head attention

Transformer blocks

Encoder-decoder architectures

Encoder

Decoder

Encoder-decoder models

Comparing Transformer architecture types

How LLMs are built using Transformers

Why next-token prediction works

Typical LLM development stages

What LLMs are good at

The gap RAG fills

RAG in one sentence

RAG vs. fine-tuning

RAG system components

Chunking documents

Embedding documents for retrieval

Retrieval

Generation with retrieved context

Training a RAG system

RAG training workflow

Optional fine-tuning in RAG

RAG inference workflow

RAG inference example

Evaluating RAG systems

Common RAG failure modes

RAG vs. ordinary search

When to use RAG

Putting it together

Common mistakes

Summary

Practice Questions

Try it: self-attention weights

Check your understanding

Go deeper