LLM basics — GPT-style models
Before we begin
A Large Language Model (LLM) in the ChatGPT sense is usually a decoder-only transformer trained on huge text to predict the next token.
It does not “look up answers” in a database by default — it continues text in a plausible way.
Figure
Autoregressive generation
What you will learn
- Describe pretraining vs inference.
- Explain the token-by-token generation loop.
- Map chat roles (system / user / assistant) to prompts.
Before this lesson
Pretraining
Goal: minimize next-token loss on internet-scale text (books, code, forums — curated and filtered).
Result: weights that encode grammar, facts ( imperfectly ), reasoning patterns, and style.
Not included: your private company PDFs unless you add RAG or fine-tuning later.
Inference (what users see)
- You send a prompt (chat messages).
- Model outputs logits for the next token.
- Pick a token (greedy or sampled — Lesson 3).
- Append token to context; repeat until stop token or limit.
Context window caps total tokens (prompt + completion).
Chat API shape
Typical messages:
| Role | Purpose |
|---|---|
| system | Global instructions |
| user | Human question |
| assistant | Model reply (also in history for multi-turn) |
The API serializes roles into one token sequence the model was fine-tuned to follow (instruction tuning / RLHF — reinforcement learning from human feedback; awareness level only).
Capabilities vs limits
Strong at: drafting, summarizing, coding patterns, explaining concepts, format-following with good prompts.
Weak at: guaranteed facts, math without verification, private data not in context, real-time events unless tools/RAG added.
Embeddings in the LLM stack
Two related uses (quiz often conflates them):
- Input token embeddings — inside the LLM (Module 6).
- Retrieval embeddings — separate model maps sentences to vectors for search (Module 7 RAG).
Same word, two roles in production apps.