LLM basics — GPT-style models

Before we begin

A Large Language Model (LLM) in the ChatGPT sense is usually a decoder-only transformer trained on huge text to predict the next token.

It does not “look up answers” in a database by default — it continues text in a plausible way.

Figure

Autoregressive generation

Each new token is conditioned on everything before it.

What you will learn

Describe pretraining vs inference.
Explain the token-by-token generation loop.
Map chat roles (system / user / assistant) to prompts.

Before this lesson

Pretraining

Goal: minimize next-token loss on internet-scale text (books, code, forums — curated and filtered).

Result: weights that encode grammar, facts ( imperfectly ), reasoning patterns, and style.

Not included: your private company PDFs unless you add RAG or fine-tuning later.

Inference (what users see)

You send a prompt (chat messages).
Model outputs logits for the next token.
Pick a token (greedy or sampled — Lesson 3).
Append token to context; repeat until stop token or limit.

Context window caps total tokens (prompt + completion).

Chat API shape

Typical messages:

Role	Purpose
system	Global instructions
user	Human question
assistant	Model reply (also in history for multi-turn)

The API serializes roles into one token sequence the model was fine-tuned to follow (instruction tuning / RLHF — reinforcement learning from human feedback; awareness level only).

Capabilities vs limits

Strong at: drafting, summarizing, coding patterns, explaining concepts, format-following with good prompts.

Weak at: guaranteed facts, math without verification, private data not in context, real-time events unless tools/RAG added.

Embeddings in the LLM stack

Two related uses (quiz often conflates them):

Input token embeddings — inside the LLM (Module 6).
Retrieval embeddings — separate model maps sentences to vectors for search (Module 7 RAG).

Same word, two roles in production apps.

What's next

Lesson 2 — LLM training lifecycle