Chapter - Transformers, RAG Models, and LLMs
Supplementary chapter prepared for the BWXT Data Science Workforce Training Pilot. This material is original to the program and is not derived from Automate the Boring Stuff with Python; it is written in a similar tone for continuity with the other chapters.
About this chapter
So far, you have learned how to prepare data, choose algorithm families, train simple models, and understand CNNs for images. This chapter introduces another major family of modern machine learning systems: Transformers.
Transformers are the foundation of many large language models, or LLMs. They are also used for code, images, audio, time series, search, translation, summarization, and question answering.
This chapter focuses on four core Transformer ideas:
- Tokenization and embeddings.
- Positional encoding.
- Self-attention.
- Encoder-decoder architectures.
Then the chapter explains Retrieval Augmented Generation, or RAG, which combines a language model with a search system so the model can use external documents while answering.
By the end of this chapter, you should be able to:
- Explain how Transformers process sequences.
- Describe tokenization, embeddings, positional encoding, and self-attention.
- Compare encoder-only, decoder-only, and encoder-decoder architectures.
- Explain why modern LLMs are built using Transformers.
- Explain what gap RAG fills.
- Describe how to build, train, evaluate, and run inference with a RAG system.
Required concept checklist
This chapter is organized around the following required ideas:
| Required concept | Where it appears in this chapter | What you should be able to explain |
|---|---|---|
| Tokenization and embeddings | Tokenization, Token IDs, Embeddings |
How raw text becomes token IDs, then learned numeric vectors |
| Positional encoding | Positional encoding, Types of positional information |
Why attention needs position information and how order is represented |
| Self-attention mechanism | Self-attention, Query, key, and value, Scaled dot-product attention, Multi-head attention |
How tokens compare to each other and combine information from context |
| Encoder-decoder architectures | Encoder, Decoder, Encoder-decoder models |
How encoder-only, decoder-only, and encoder-decoder Transformers differ |
| LLM construction | How LLMs are built using Transformers, Typical LLM development stages |
Why modern LLMs are large Transformer stacks trained with next-token prediction and later tuned |
| RAG gap | The gap RAG fills, RAG vs. fine-tuning |
Why retrieval is useful when knowledge is private, current, long, or source-grounded |
| RAG training and inference | Training a RAG system, RAG training workflow, RAG inference workflow |
How to build/tune the retrieval pipeline and how a user question becomes a grounded answer |
Why Transformers matter
Before Transformers, many sequence models processed text one step at a time.
For example, a model might read:
The weld inspection report found porosity near the edge.from left to right, word by word.
That can work, but it creates challenges:
- Long documents are hard to remember.
- Training can be slow because steps depend on previous steps.
- Important words far apart may be hard to connect.
Transformers changed this by using attention. Attention lets the model compare tokens in a sequence directly and decide which tokens matter to each other.
This makes Transformers useful for:
- Summarizing reports.
- Translating text.
- Answering questions.
- Writing code.
- Searching documents.
- Extracting information from procedures or inspection records.
- Building modern LLMs.
What is a sequence?
A sequence is an ordered list.
In language, a sequence might be words or pieces of words:
["weld", "inspection", "found", "porosity"]In code, a sequence might be tokens such as:
["for", "item", "in", "items", ":"]In time series, a sequence might be sensor readings:
[22.1, 22.4, 22.6, 23.0]Transformers are sequence models. They are especially famous for language, but the same ideas can apply to other ordered data.
Tokenization
Computers do not understand raw text directly. The text must first be converted into smaller pieces called tokens.
Tokenization is the process of splitting text into tokens.
For example:
"The weld has porosity."might become:
["The", "weld", "has", "porosity", "."]Many modern LLMs use subword tokenization. That means a rare word may be split into pieces.
For example:
"microcracking"might become:
["micro", "crack", "ing"]The exact tokens depend on the tokenizer.
Why subword tokenization is useful
Subword tokenization helps with unfamiliar words.
Suppose a model has never seen the exact word:
"ultrasonicity"If it can split the word into smaller pieces, it may still recognize useful parts.
Subword tokenization helps models handle:
- Rare words.
- Technical vocabulary.
- Misspellings.
- Names.
- Different word endings.
- New combinations of familiar pieces.
Tokenization also affects model cost and limits. A model's context window is measured in tokens, not words.
Token IDs
After text is split into tokens, each token is mapped to a number called a token ID.
For example:
| Token | Token ID |
|---|---|
The |
104 |
weld |
8921 |
has |
713 |
porosity |
22984 |
. |
13 |
The model does not use the token strings directly. It uses token IDs to look up learned numeric vectors called embeddings.
Embeddings
An embedding is a list of numbers that represents a token.
For example:
token: "weld"
embedding: [0.12, -0.44, 0.08, 0.91, ...]The full embedding may have hundreds or thousands of numbers.
Embeddings are learned during training. Tokens used in similar contexts tend to get embeddings that are closer together.
For example, a language model may learn that these words often appear in related contexts:
weldjointbeadporositycrack
Embeddings turn discrete tokens into continuous numeric values that neural networks can process.
Embedding intuition
An embedding is not a dictionary definition. It is a mathematical representation learned from data.
Think of an embedding as a location in a high-dimensional map.
Words with related use may be near each other:
"porosity" near "void"
"crack" near "fracture"
"inspection" near "review"Pictured as a 2D map (real embeddings have hundreds of dimensions), related terms cluster together while unrelated ones sit apart:
Show the code that generated this plot
import matplotlib.pyplot as plt
# Hand-placed 2D positions to illustrate the idea; real embeddings are learned
points = {
'weld': (0.5, 0.6), 'joint': (0.7, 0.7), 'bead': (0.6, 0.45),
'porosity': (2.4, 2.3), 'void': (2.6, 2.1),
'crack': (2.3, 0.5), 'fracture': (2.5, 0.35),
'inspection': (0.6, 2.5), 'review': (0.8, 2.6),
}
groups = {
'weld terms': ['weld', 'joint', 'bead'],
'porosity': ['porosity', 'void'],
'crack': ['crack', 'fracture'],
'process': ['inspection', 'review'],
}
for label, words in groups.items():
plt.scatter([points[w][0] for w in words], [points[w][1] for w in words],
s=90, label=label, edgecolor='white')
for w in words:
plt.annotate(w, points[w], textcoords='offset points', xytext=(6, 6))
plt.xlabel('dimension 1')
plt.ylabel('dimension 2')
plt.legend(fontsize=8)
plt.show()But embeddings can also reflect biases or patterns in the training data. They should not be treated as objective truth.
Positional encoding
Self-attention compares tokens to each other, but by itself it does not know their order.
The sentences below use the same words but mean different things:
The inspector approved the procedure.
The procedure approved the inspector.Order matters.
Positional encoding gives the model information about where each token appears in the sequence.
For example:
| Position | Token |
|---|---|
| 0 | The |
| 1 | inspector |
| 2 | approved |
| 3 | the |
| 4 | procedure |
The model combines token embeddings with position information so it can understand both meaning and order.
The original Transformer builds that position information from sine and cosine waves of different frequencies — one row per position, one column per dimension:
Show the code that generated this plot
import numpy as np
import matplotlib.pyplot as plt
positions, d_model = 40, 64
pe = np.zeros((positions, d_model))
pos = np.arange(positions)[:, None]
div = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
pe[:, 0::2] = np.sin(pos * div)
pe[:, 1::2] = np.cos(pos * div)
plt.imshow(pe, cmap='RdBu', aspect='auto')
plt.colorbar()
plt.xlabel('embedding dimension')
plt.ylabel('position in sequence')
plt.title('Sinusoidal positional encoding')
plt.show()Types of positional information
There are several ways to represent position.
Common approaches include:
| Approach | Basic idea |
|---|---|
| Fixed sinusoidal encodings | Use mathematical wave patterns based on position |
| Learned positional embeddings | Learn a vector for each position during training |
| Relative position methods | Represent distance between tokens |
| Rotary position embeddings | Rotate embedding dimensions based on position |
You do not need to memorize every method. The important idea is this:
Transformers need a way to know token order.
Without position information, a Transformer would struggle to distinguish sentences that use the same words in different orders.
Self-attention
Self-attention is the core idea behind Transformers.
Self-attention lets each token look at other tokens in the same sequence and decide which ones are relevant.
For example:
The weld failed because it had a crack.The word it should pay attention to weld.
The word crack may pay attention to failed, weld, and had.
Self-attention helps the model connect related words even when they are far apart.
Attention as weighted focus
Imagine reading this sentence:
The procedure was rejected because it missed required inspection steps.To understand it, you focus on procedure, not inspection.
Self-attention does something similar with numbers. For each token, the model assigns attention weights to other tokens.
Example:
| Current token | Other token | Attention weight |
|---|---|---|
it |
procedure |
0.70 |
it |
inspection |
0.10 |
it |
steps |
0.05 |
it |
other tokens | 0.15 |
The weights add up to about 1. Higher weight means the token is more important for that part of the calculation.
Query, key, and value
Transformer attention uses three learned representations:
- Query.
- Key.
- Value.
A simple way to think about them:
| Term | Question it answers |
|---|---|
| Query | What am I looking for? |
| Key | What information do I contain? |
| Value | What information should I pass along? |
For each token:
- The model creates a query vector.
- The model compares that query to key vectors from other tokens.
- Similar query-key pairs receive higher attention weights.
- The model combines value vectors using those weights.
This allows each token to build a richer representation using context from the sequence.
Scaled dot-product attention
The common attention calculation is called scaled dot-product attention.
In simplified form:
attention_scores = query dot keys
scaled_scores = attention_scores / sqrt(key_dimension)
attention_weights = softmax(scaled_scores)
output = attention_weights * valuesThe dot product measures similarity. Softmax converts scores into weights that add to 1.
The scaling step helps keep numbers stable when vectors have many dimensions.
Multi-head attention
A Transformer does not use only one attention calculation.
It uses multi-head attention.
Each attention head can learn a different kind of relationship.
For example, one head might focus on:
- Subject-verb relationships.
Another might focus on:
- Technical terms and definitions.
Another might focus on:
- Nearby words.
Another might focus on:
- Long-range references.
The outputs from multiple heads are combined, giving the model several views of the same sequence.
Transformer blocks
A Transformer is built from repeated Transformer blocks.
A simplified block includes:
input
-> self-attention
-> feed-forward neural network
-> outputModern blocks also include details such as:
- Layer normalization.
- Residual connections.
- Dropout.
- Multi-head attention.
A stack of many Transformer blocks lets the model build increasingly rich representations.
Early layers may learn simple patterns. Later layers may learn more abstract relationships.
Encoder-decoder architectures
The original Transformer architecture used an encoder and a decoder.
The encoder reads the input sequence and builds representations.
The decoder generates the output sequence.
For translation:
input: "The weld passed inspection."
output: "La soldadura pasó la inspección."The encoder processes the English sentence. The decoder generates the Spanish sentence.
Encoder
An encoder reads the full input sequence and produces contextual embeddings.
Encoder-style models are useful when the model needs to understand an input.
Common uses include:
- Text classification.
- Search embeddings.
- Document similarity.
- Named entity recognition.
- Extractive question answering.
Example model family:
BERT-style modelsBERT-style models are often called encoder-only Transformers.
Decoder
A decoder generates output tokens, usually one token at a time.
Decoder-style models are useful when the model needs to produce text.
Common uses include:
- Chat.
- Writing assistance.
- Code generation.
- Summarization.
- Open-ended question answering.
Example model family:
GPT-style modelsGPT-style models are often called decoder-only Transformers.
When generating text, a decoder is usually prevented from looking at future tokens. This is called causal masking.
For example, when predicting the next token after:
The inspection reportthe model should not be allowed to look at the answer token in advance.
Encoder-decoder models
An encoder-decoder model uses both parts.
The encoder understands the input. The decoder generates the output.
Common uses include:
- Translation.
- Summarization.
- Structured transformation of text.
- Some question answering systems.
Example model families:
T5-style models
BART-style models
Original Transformer translation modelsEncoder-decoder models are useful when the task naturally has a separate input and output sequence.
Comparing Transformer architecture types
| Architecture type | Main purpose | Common examples | Typical use |
|---|---|---|---|
| Encoder-only | Understand input | BERT-style models | Classification, embeddings, search |
| Decoder-only | Generate output | GPT-style models | Chat, generation, code completion |
| Encoder-decoder | Transform input into output | T5, BART | Translation, summarization |
Modern LLM chat systems are often based on decoder-only Transformers, though full systems may include many supporting components.
How LLMs are built using Transformers
Modern large language models are usually very large Transformer models trained on huge amounts of text and code.
The basic training objective for many decoder-only LLMs is next-token prediction.
Given:
The weld inspection foundthe model learns to predict likely next tokens:
porosity
cracking
no
...During training, the model sees many examples and adjusts its weights to make better next-token predictions.
Why next-token prediction works
At first, next-token prediction may sound simple.
But to predict the next token well, the model must learn many patterns:
- Grammar.
- Facts.
- Style.
- Code syntax.
- Reasoning patterns.
- Domain language.
- Relationships between ideas.
The model learns these patterns statistically from training data. It does not store knowledge in the same way a database does.
Typical LLM development stages
A simplified LLM development process may include:
- Pretraining. Train a large Transformer on broad text and code data using next-token prediction.
- Instruction tuning. Train the model to follow instructions using prompt-response examples.
- Preference tuning or alignment. Improve helpfulness and reduce harmful behavior using human or model feedback.
- Evaluation. Test the model on benchmarks and realistic tasks.
- Deployment. Serve the model through an application, API, or internal tool.
- Monitoring. Track quality, safety, latency, and cost.
Many organizations do not train LLMs from scratch. They use existing foundation models and adapt them through prompting, fine-tuning, tools, retrieval, or workflow design.
What LLMs are good at
LLMs can be useful for:
- Drafting text.
- Summarizing documents.
- Explaining code.
- Generating code.
- Extracting information.
- Rewriting content for different audiences.
- Conversational interfaces.
- Question answering over provided context.
LLMs also have limitations:
- They can hallucinate.
- They may be outdated.
- They may not know private company documents.
- They may produce plausible but wrong answers.
- They can be sensitive to prompt wording.
- They can reflect biases in training data.
These limitations are one reason RAG systems are useful.
The gap RAG fills
Retrieval Augmented Generation, or RAG, combines search with generation.
The gap RAG fills is this:
A language model may not know the specific, current, or private information needed to answer a question.
For example, a general LLM may not know:
- A company's internal welding procedure.
- The latest inspection policy.
- A new training manual.
- A specific dataset dictionary.
- A maintenance log from last week.
RAG lets the system retrieve relevant documents and provide them to the model as context.
RAG in one sentence
RAG works like this:
user question -> retrieve relevant documents -> give documents to LLM -> generate answer with citations or contextInstead of asking the model to rely only on memory, the system gives it relevant information at inference time.
RAG vs. fine-tuning
RAG and fine-tuning solve different problems.
| Need | Better starting point |
|---|---|
| Answer using current documents | RAG |
| Answer using private documents | RAG |
| Change model writing style | Fine-tuning or prompting |
| Teach a repeated response format | Fine-tuning or prompting |
| Add frequently changing knowledge | RAG |
| Improve behavior on a narrow task | Fine-tuning |
Fine-tuning changes model weights. RAG changes what context the model receives.
If information changes often, RAG is usually easier to update. You update the document index instead of retraining the model.
RAG system components
A basic RAG system has several parts:
- Document collection. The source documents, manuals, reports, tables, or records.
- Chunking. Split documents into smaller pieces.
- Embedding model. Convert chunks into numeric vectors.
- Vector store or search index. Store and search chunk embeddings.
- Retriever. Find relevant chunks for a user question.
- Prompt builder. Put the question and retrieved context into a prompt.
- Generator LLM. Produce the final answer.
- Evaluator. Check answer quality, retrieval quality, and safety.
Each part affects the final answer.
Chunking documents
Chunking means splitting documents into pieces small enough to retrieve and pass to the model.
For example, a 50-page manual might be split into chunks:
chunk 1: title, section 1.1
chunk 2: section 1.2
chunk 3: section 1.3
...Chunks should usually be:
- Large enough to contain useful context.
- Small enough to fit in the model prompt.
- Split at natural boundaries when possible.
- Stored with useful metadata, such as title, section, date, and source.
Bad chunking can make retrieval worse. If a chunk is too small, it may lack context. If it is too large, it may include unrelated information.
Embedding documents for retrieval
RAG often uses embeddings for search.
The system converts each chunk into an embedding:
chunk text -> embedding model -> vectorIt also converts the user's question into an embedding:
question -> embedding model -> query vectorThen it searches for chunks with vectors close to the query vector.
This is called semantic search because it can find related meaning, not just exact keyword matches.
Retrieval
The retriever selects chunks likely to help answer the question.
For example:
Question: What inspection steps are required before final weld acceptance?
Retrieved chunks:
1. Procedure section on visual inspection.
2. Procedure section on nondestructive testing.
3. Acceptance criteria table.The generator LLM then receives these chunks as context.
Retrieval quality is crucial. If the retriever finds the wrong chunks, the LLM may generate a poor answer even if the LLM is strong.
Generation with retrieved context
After retrieval, the system builds a prompt.
Example:
Use the context below to answer the question.
If the answer is not in the context, say you do not know.
Context:
[retrieved chunk 1]
[retrieved chunk 2]
[retrieved chunk 3]
Question:
What inspection steps are required before final weld acceptance?The LLM uses the provided context to generate an answer.
A good RAG prompt often tells the model:
- Use only the provided context when needed.
- Cite sources or section names.
- Say when the answer is not available.
- Avoid guessing.
- Be concise and clear.
Training a RAG system
People sometimes say "train a RAG model," but many RAG projects do not train the LLM itself.
Instead, "training a RAG system" often means building and tuning the retrieval and generation pipeline.
This may include:
- Preparing documents.
- Chunking documents.
- Choosing an embedding model.
- Building a vector index.
- Creating evaluation questions.
- Testing retrieval quality.
- Tuning chunk size and overlap.
- Tuning number of retrieved chunks.
- Improving prompts.
- Optionally fine-tuning the embedding model, reranker, or generator.
The key idea:
A RAG system can improve without changing the LLM weights.
RAG training workflow
A practical RAG development workflow:
- Collect documents. Gather manuals, procedures, FAQs, reports, or other trusted sources.
- Clean documents. Remove duplicates, fix encoding issues, and preserve headings or tables.
- Chunk documents. Split into useful passages.
- Add metadata. Store source, section, date, version, and access permissions.
- Embed chunks. Use an embedding model to convert chunks into vectors.
- Build the index. Store vectors in a vector database or search engine.
- Create test questions. Include easy, hard, ambiguous, and unanswerable questions.
- Evaluate retrieval. Check whether the right chunks are found.
- Evaluate generation. Check whether answers are correct, grounded, and clear.
- Tune the system. Adjust chunking, retrieval, reranking, prompts, or model choice.
- Monitor after deployment. Track failures, stale documents, latency, and user feedback.
This is similar to training in spirit because the system is improved using evidence, but not always by updating neural network weights.
Optional fine-tuning in RAG
Some advanced RAG systems fine-tune one or more components.
Possible fine-tuning targets:
| Component | Why fine-tune it |
|---|---|
| Embedding model | Improve retrieval for domain-specific language |
| Reranker | Better order retrieved chunks |
| Generator LLM | Improve answer style or task behavior |
Fine-tuning may help when:
- The domain vocabulary is very specialized.
- Off-the-shelf retrieval misses important documents.
- The answer format must be very consistent.
- There are enough high-quality training examples.
Fine-tuning has costs:
- More engineering effort.
- More evaluation requirements.
- Risk of overfitting.
- Need to retrain or update when data changes.
For many projects, start with ordinary RAG before fine-tuning.
RAG inference workflow
Inference means using the system to answer a new question.
A RAG inference workflow:
- Receive the user question.
- Convert the question to an embedding.
- Retrieve relevant chunks from the index.
- Optionally rerank chunks.
- Build a prompt containing the question and context.
- Send the prompt to the LLM.
- Generate an answer.
- Return the answer with citations or source references when possible.
In simplified pseudocode:
question = user_input
query_vector = embed(question)
chunks = vector_search(query_vector, top_k=5)
prompt = build_prompt(question, chunks)
answer = llm.generate(prompt)
return answerRAG inference example
Suppose the user asks:
What should I check before accepting a weld inspection image dataset?The retriever might find chunks about:
- Class counts.
- Missing labels.
- Image resolution.
- Pixel value distributions.
- Example inspection.
The prompt gives these chunks to the LLM. The model can then answer using the course materials instead of relying only on general training data.
Evaluating RAG systems
RAG systems should be evaluated at two levels:
- Retrieval quality. Did the system find the right context?
- Answer quality. Did the model answer correctly using that context?
Useful retrieval checks:
- Are relevant chunks in the top results?
- Are chunks too broad or too narrow?
- Are important documents missing from the index?
- Are stale or unauthorized documents retrieved?
Useful answer checks:
- Is the answer grounded in the retrieved context?
- Does the answer cite the right source?
- Does the model say "I don't know" when context is missing?
- Is the answer clear and useful?
- Does it avoid unsupported claims?
Common RAG failure modes
Here are a few traps to avoid:
- Poor chunking. The answer is split across chunks or buried in large chunks.
- Weak retrieval. The retriever finds related but not actually useful passages.
- Too much context. The prompt includes many chunks, making the answer noisy.
- Missing documents. The correct source was never indexed.
- Stale documents. The system retrieves old procedures.
- No source control. Users cannot tell where the answer came from.
- Hallucination. The LLM answers beyond the retrieved evidence.
- Permission problems. Users retrieve documents they should not see.
RAG is powerful, but it does not remove the need for careful data management and evaluation.
RAG vs. ordinary search
Ordinary search returns documents or passages.
RAG returns an answer generated from retrieved documents.
| System | Output |
|---|---|
| Keyword search | List of matching documents |
| Semantic search | List of meaning-related chunks |
| RAG | Generated answer using retrieved chunks |
RAG can save users time, but it also creates risk if the generated answer is not faithful to the sources.
When to use RAG
RAG is a good fit when:
- The answer depends on private documents.
- The information changes over time.
- Users need answers grounded in source material.
- The model needs access to long documents.
- Citations or source references matter.
- You want to avoid retraining for every document update.
RAG may not be the best fit when:
- The task does not require external knowledge.
- The source documents are poor quality.
- The correct answer requires complex calculations better handled by tools.
- The system cannot safely manage document permissions.
Putting it together
A modern LLM application may combine several pieces:
user interface
-> prompt builder
-> retriever
-> vector database
-> LLM
-> response evaluator or guardrails
-> final answerThe LLM is only one part of the system.
For reliable applications, the surrounding system matters:
- Data quality.
- Document permissions.
- Retrieval quality.
- Prompt design.
- Evaluation.
- Monitoring.
- User feedback.
Common mistakes
Here are a few traps to avoid:
- Thinking tokens are the same as words. Tokens may be words, word pieces, punctuation, or other units.
- Ignoring positional information. Transformers need a way to represent order.
- Treating attention as a perfect explanation. Attention weights can be informative but are not always a complete explanation of model behavior.
- Using RAG as a substitute for clean documents. Bad source material leads to bad answers.
- Assuming RAG eliminates hallucination. RAG reduces hallucination risk when retrieval and prompts are good, but it does not guarantee correctness.
- Fine-tuning when retrieval would solve the problem. If knowledge changes often, RAG may be easier and safer.
- Skipping evaluation. RAG systems need retrieval tests and answer-quality tests.
- Ignoring access control. Retrieval must respect document permissions.
Summary
Transformers are neural network architectures designed for sequence data. They use tokenization to split text into tokens, embeddings to represent tokens as vectors, positional encoding to preserve order, and self-attention to connect related tokens across a sequence.
Encoder-only Transformers are often used for understanding tasks such as classification and embeddings. Decoder-only Transformers are often used for generation and are the foundation of many modern LLMs. Encoder-decoder Transformers are useful when an input sequence is transformed into an output sequence, such as translation or summarization.
Modern LLMs are built from large stacks of Transformer blocks and are often trained with next-token prediction, followed by instruction tuning and alignment. They are powerful but can hallucinate, be outdated, or lack access to private documents.
RAG fills the gap between a general language model and specific source material. It retrieves relevant documents at inference time and gives them to the LLM as context. Building a RAG system usually means preparing documents, chunking, embedding, indexing, retrieving, prompting, evaluating, and tuning the pipeline.
| Topic | Key ideas |
|---|---|
| Tokenization | Splits text into tokens or subword pieces |
| Embeddings | Numeric vectors representing tokens or chunks |
| Positional encoding | Gives the model information about token order |
| Self-attention | Lets tokens focus on relevant tokens in the same sequence |
| Multi-head attention | Learns several relationship patterns at once |
| Encoder | Builds contextual representations of input |
| Decoder | Generates output tokens |
| Encoder-decoder | Reads input and generates output |
| LLM | Large Transformer-based model trained on text and code |
| RAG | Retrieves documents and uses them as context for generation |
| RAG training | Builds and tunes retrieval, prompts, evaluation, and optional fine-tuning |
| RAG inference | Retrieves relevant chunks and generates an answer from context |
Practice Questions
Practice Questions
- In your own words, what is a Transformer?
- What is tokenization?
- Why do modern LLMs often use subword tokenization?
- What is a token ID?
- What is an embedding?
- Why are embeddings useful?
- Why do Transformers need positional encoding?
- What problem does self-attention solve?
- In attention, what are queries, keys, and values?
- What does softmax do in an attention calculation?
- What is multi-head attention?
- What is a Transformer block?
- What is the difference between an encoder and a decoder?
- What is an encoder-only model commonly used for?
- What is a decoder-only model commonly used for?
- What is an encoder-decoder model commonly used for?
- Why are many modern LLMs based on decoder-only Transformers?
- What is next-token prediction?
- Why can next-token prediction teach useful language patterns?
- Name two limitations of LLMs.
- What gap does RAG fill?
- How is RAG different from fine-tuning?
- What is chunking in a RAG system?
- Why is chunk size important?
- What does an embedding model do in RAG?
- What does the retriever do?
- What happens during RAG inference?
- What are two ways to evaluate a RAG system?
- Name three common RAG failure modes.
- Create a short workflow for building a RAG system over a folder of internal procedure documents.