A function. Tokens in, probabilities out.
The goal is to take strings and feed them into a language model. That requires converting text into integers from a fixed vocabulary.
Those integers are the token IDs. They are lookup keys, not meanings by themselves.
The same text round-trips: encode → integer IDs → decode. The model never sees raw text.
Many behaviours that look like model intelligence failures trace back to the input representation.
<|endoftext|> behave oddly? Tokenization.
Character-level tokenization is the naive baseline. BPE starts small, then repeatedly merges frequent adjacent pieces.
Unusual words get split into more sub-word pieces — more tokens, higher cost, weaker representation.
A vocabulary is the tokenizer's fixed menu of possible token types. Tokenizers turn text into token IDs: integer numbers from that vocabulary.
Small vocabulary: ["q", "u", "e", "n", "k", "i", "g"]
queen → q + u + e + e + n = 5 token instances
Larger vocabulary: ["queen", "king", "Elizabeth", "Charles"]
queen → queen = 1 token instance
| Tokenizer type | Vocabulary size | Tokens needed |
|---|---|---|
| Character-level | Small | Many |
| Word/subword-level | Larger | Fewer |
Tokenizers are design decisions. They are optimized around the model's training data, target languages, code support, context length, and deployment cost.
Because models do tokenization differently, they also count tokens differently: the same text can become different chunks before the neural network runs.
GPT-2 struggled more than necessary with Python partly because its tokenizer handled whitespace poorly. Newer tokenizers encode code and multilingual text more efficiently.
Input and output tokens are priced differently. Output tokens cost more, so tokenization directly affects budgets.
After tokenisation, each token ID is used to look up a vector in the model's embedding table.
This vector is the token's starting representation before the transformer layers process it.
The embedding table starts randomly and is learned during pre-training. Tokens that appear in similar contexts tend to acquire related vectors.
A transformer is the building block underneath GPT-style language models. It takes a sequence of token embeddings and repeatedly updates each token's representation using the surrounding context.
The key idea is attention: each token can weight other tokens in the context when deciding what information matters.
Stack many transformer layers, train them to predict the next token, and you get the core architecture behind modern LLMs.
Here is an example from FineWeb1. The dataset has been tokenized with a vocabulary of 100,277 token types.
The text is now a long sequence of integer token IDs. Those IDs are what the model trains on.
The neural network takes windows of token IDs and learns to predict the next token.
In this example, the first four tokens are used as context to predict the 5th token.
A larger window means a larger context size: the model can pay attention to more previous tokens when making the next prediction.
During pre-training, backpropagation updates the network weights to increase the probability of the correct next token.
At each step, the model outputs one probability for every possible token in the vocabulary.
The mathematical expression combines the input token IDs with the learned weights to produce those probabilities.
The transformer takes the entire sequence of token embeddings and, at each step, predicts a probability distribution over the next token.
It generates one token at a time. At each step it looks at the prompt plus everything it has already generated, picks a token, appends it, and repeats.
DeepSeek V3.1 at temperature 0. "garage" wins at 85%, but "boardroom" (4%), "corporate" (3%) were candidates. Tiny logit differences can flip the choice.
The model produces logits (raw scores) for every token in the vocabulary. Temperature controls how those logits are converted to probabilities:
Other sampling parameters:
Every API call to an LLM is stateless: no memory, no continuity, no knowledge of what happened before unless it is explicitly provided again.
ChatGPT and similar systems can feel as if the model remembers past conversations. That memory is an application illusion. The model only sees what is inside the current request; any "memory" is previous messages, saved notes, or summaries being resent as context.
Because the model itself is stateless, intelligence shifts upward into the application. The application decides what to include, what to exclude or summarise, what to retrieve and prioritise, and what to forget.
Every training decision leaks into your research outputs.
Base models are trained via self-supervised learning: predict the next token across an enormous text corpus.
This is how knowledge gets compressed into the model. Training can cost hundreds of millions or even billions of dollars in compute: terabytes of text are compressed into a particular parameter shape that can generate text programmatically.
They learn language patterns and broad world knowledge, but are not tuned to follow instructions. A base model will complete your text, not answer your question.
What's in the corpus matters:
The first post-training step. Fine-tune the base model on curated instruction → response pairs written by humans.
Quality over quantity: relatively small, high-quality datasets yield strong instruction following. This is imitation learning — the model learns to mimic the demonstrations.
After SFT:
RLHF as the central alignment loop
RLHF is used for domains where the desired output is partly normative or judgement-based. In many tasks, we do not just want a factually correct answer; we want an answer that is useful, safe, calibrated, non-toxic, non-deceptive, and appropriate to the user's request.
The key object in reinforcement learning is the policy: the system being trained to choose actions. In an LLM, the policy is the model itself. It takes the current context — prompt plus generated tokens — and produces a probability distribution over the next token.
Each generated answer is a trajectory through this policy. The model chooses one token, appends it to the context, chooses the next token, and repeats. RLHF updates the model weights so trajectories judged to be better become more probable in future.
The reward signal usually comes indirectly. Humans compare candidate answers; a reward model learns to predict those preferences; reinforcement learning then optimises the LLM against that reward model. The result is a model whose probability distribution has shifted towards preferred forms of behaviour.
Related post-training methods: DPO, RLAIF, RLVR/GRPO.
Sample prompts and ask humans to write good responses. These demonstrations fine-tune the pre-trained base model in a supervised way.
Generate several answers per prompt from the SFT model. Humans rank them; those rankings train a reward model that predicts preference scores.
Use the reward model to update the SFT model, often with PPO, so higher-scoring responses become more likely while the model stays close to its starting behaviour.
RLVR · GRPO
RLVR uses tasks where correctness can be objectively verified: maths, code, logic. A deterministic verifier returns binary reward — 1 if correct, 0 otherwise.
No human or AI labellers. No subjective bias. Research shows RLVR implicitly incentivises correct chain-of-thought reasoning — because the only way to consistently get right answers is to produce valid intermediate steps.
Post-training is how we get LLMs to behave in particular ways. It refines behaviour after pre-training, and has been more productive than many expected.
GRPO (Group Relative Policy Optimisation) supports RLVR at scale. It estimates advantages without training a separate critic: multiple rollouts per prompt are grouped and whitened to compute relative advantage.
Post-training is usually cheaper than pre-training, but not always: high-compute RL systems can still be extremely expensive.
Figure 2 shows that LLM personas place far more variance on the first principal component than human GSS respondents, meaning their answers collapse too strongly onto a single ideological axis.
This is the kind of unidimensionality we should expect from alignment-style post-training: models are rewarded for coherent, normatively legible responses, but real people hold messier, cross-cutting, and only weakly constrained belief systems.
Source: "Synthetic personas distort the structure of human belief systems" by Chris Barrie.
The model's "knowledge" is the training data. If a population, language, or domain is underrepresented, outputs will be weaker — or confidently wrong.
Sycophancy, refusal, verbosity are not bugs — they're trained behaviours. They will systematically affect any study that treats model outputs as data.
Training data is undisclosed. Alignment methods are partially documented. You cannot fully audit the instrument you are using. Report what you can; acknowledge what you can't.
Where computation happens, and how models reason at answer time.
To run a model locally, you need access to the model's weights: the trained numerical parameters that encode what the model learned. Open-weight models make those parameters downloadable under a licence; closed models keep them inside the provider's infrastructure.
Run the model on your own hardware. Ollama, llama.cpp, vLLM. Your data never leaves your machine.
Requires a capable GPU. Limited to open-weights models.
OpenAI, Anthropic, Google host the model. You send data to their servers. Subject to their data retention and usage policies.
Easiest option. Best models available here.
Azure OpenAI, AWS Bedrock, GCP Vertex. Same models, hosted in your cloud region with enterprise data controls.
Zero-retention options. GDPR-compliant deployments possible.
| Dimension | Open-weight models | Closed/proprietary models |
|---|---|---|
| Examples | DeepSeek V3/V4/R1, Qwen3/Qwen3.5, Kimi K2/K2.6, GLM-4.5/GLM-5, Llama 4, Mistral | GPT-5.5, Claude Opus 4.8, Gemini 3.5 |
| Where they run | Local machine, institutional server, private cloud, or third-party inference provider | Provider-hosted service/API |
| Access to weights | Model weights are downloadable, subject to licence | Weights are not available |
| Data control | High if self-hosted; lower if using hosted inference | Depends on provider terms, contract, logging and retention settings |
| Reproducibility | Stronger if model, tokenizer, quantisation, runtime and prompts are pinned | Weaker: provider may update, retire, or alter models/API behaviour |
| Auditability | Better: model artefacts and deployment can be inspected/versioned | Limited: model internals and system changes are mostly opaque |
| Performance | Often competitive; usually behind frontier models on hardest tasks | Best frontier capability usually here |
| Cost/effort | More setup, hardware and maintenance | Easier to use; pay per use/subscription |
| Best use case | Sensitive data, reproducible pipelines, institutional control | Highest performance, rapid prototyping, general-purpose assistance |
Learning during inference — in-context learning — does not change model weights. It changes the context the model conditions on: instructions, prompts, examples, retrieved documents, and prior messages.
Parametric learning updates weights. In-context learning does not — and it absolutely works. Transformers are conditional next-token predictors over a sequence; give them the right sequence and behaviour can change radically without touching the parameters.
This is the foundation of prompt engineering:
The intelligence lives in the static parameters, but the apparent capability depends heavily on what is fed into the context window. Context management, prompt engineering, instruction tuning, and few-shot examples exploit that fact.
Early chatbots mostly tried to answer directly: prompt in, answer out.
Reasoning models change the allocation of compute. They spend more inference-time computation before producing the final answer: decomposing the problem, checking constraints, exploring alternatives, or using tools.
This does not require a separate symbolic reasoner. Mechanically, the model is still generating tokens. What changes is that some tokens are used to structure the path to the answer, not just to state the answer.
Chain-of-thought was the first visible version of this idea. Modern reasoning models often perform some of this intermediate work in hidden reasoning tokens.
User prompt → hidden reasoning tokens → visible answer tokens. The reasoning is trained, not emergent.
Reasoning strategies are different ways of using extra inference-time computation.
The common pattern is the same: the system spends more computation before committing to an answer.
Reasoning becomes more powerful when the model can call tools.
ReAct means reasoning + acting: the model alternates between internal reasoning and external actions, such as searching the web, querying a database, running code, or retrieving documents.
PAL means Program-Aided Language Models: instead of solving everything in natural language, the model writes code and lets a reliable executor do the calculation.
This is the bridge from chatbots to agents: inference is no longer just text generation; it becomes a loop of thinking, acting, observing, and deciding when to stop.
Decoding stops when:
Inference is not a neutral button press. The same model can behave differently depending on the prompt, context, tools, reasoning budget, sampling settings, and provider implementation.
Reasoning tokens, tool calls, and long contexts make inference more expensive than a simple prompt-response exchange.
"We used GPT-4" is not enough. Results can depend on model version, prompt, settings, tool access, context window, retrieval corpus, and date of access.
Local inference can keep data inside your infrastructure; cloud inference sends data to a provider or hosted service.
What you control — and what you don't — at each layer.
ChatGPT, Claude.ai, Gemini: the interface most people use. But the interaction is not just "you + model".
On every turn, the app may add or change things you do not see:
You often cannot fully observe or fix:
You can study chatbot interactions, but you cannot assume they are exactly reproducible unless the platform exposes and fixes the relevant configuration.
Direct access to the model. You send a JSON request, you get a JSON response. Nothing hidden.
You control:
You get back:
curl https://api.openai.com/v1/responses \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.1",
"input": [
{
"role": "developer",
"content": "You are a survey methodologist. Review survey questions for clarity, bias, construct validity, recall burden, and response option problems."
},
{
"role": "user",
"content": "Review this survey question: In the last 12 months, how often have you used AI tools such as ChatGPT, Claude, Gemini, or Copilot for work or study?"
}
],
"reasoning": {
"effort": "medium"
},
"text": {
"format": {
"type": "json_schema",
"name": "survey_question_review",
"schema": {
"type": "object",
"additionalProperties": false,
"properties": {
"construct": { "type": "string" },
"problems": {
"type": "array",
"items": { "type": "string" }
},
"revised_question": { "type": "string" },
"reporting_note": { "type": "string" }
},
"required": [
"construct",
"problems",
"revised_question",
"reporting_note"
]
},
"strict": true
}
},
"metadata": {
"project": "AI workshop demo",
"task": "survey_question_review"
}
}'
A thin wrapper around the API in Python, TypeScript, etc. Same control, less boilerplate.
SDKs make it easier to add batching, retries, streaming, schema validation, and structured outputs. But structured outputs are not a separate interaction mode: they are an option available through APIs and SDKs.
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
const response = await client.responses.create({
model: "gpt-5.1",
input: [
{
role: "developer",
content:
"You are a survey methodologist. Review survey questions for clarity, bias, construct validity, recall burden, and response option problems.",
},
{
role: "user",
content:
"Review this survey question: In the last 12 months, how often have you used AI tools such as ChatGPT, Claude, Gemini, or Copilot for work or study?",
},
],
reasoning: {
effort: "medium",
},
text: {
format: {
type: "json_schema",
name: "survey_question_review",
schema: {
type: "object",
additionalProperties: false,
properties: {
construct: { type: "string" },
problems: {
type: "array",
items: { type: "string" },
},
revised_question: { type: "string" },
reporting_note: { type: "string" },
},
required: [
"construct",
"problems",
"revised_question",
"reporting_note",
],
},
strict: true,
},
},
metadata: {
project: "AI workshop demo",
task: "survey_question_review",
},
});
console.log(response.output_text);
Claude Code, Cursor, Windsurf and similar systems are not just chat interfaces or SDK calls. They combine a model with tools, files, terminals, version control, and an iterative loop.
The model can inspect state, call tools, edit files, run checks, observe failures, and decide what to do next. This is the bridge from "using a model" to delegating a workflow.
Day 3 topic.
What can go wrong, what to log, and what to accept.
Greedy decoding removes sampling randomness, but not numerical variation in inference.
Temperature 0 + greedy decoding does not guarantee the same answer every time. The problem is not the transformer architecture itself, and it is not random sampling once temperature is 0.
The issue is floating-point arithmetic on GPUs. Floating-point addition is not associative: (a+b)+c ≠ a+(b+c).
Same prompt
↓
Different batch context
↓
Different floating-point reduction order
↓
Slightly different logits
↓
Different chosen token
↓
Different completion
(0.1 + 1e20) - 1e20 = 0
0.1 + (1e20 - 1e20) = 0.1
Many inference kernels are deterministic run-to-run, but not batch-invariant.
A kernel is batch-invariant if your request gets the same numerical result regardless of what other requests are processed alongside it.
In practice, output can change if the server processes your request alone, in a larger batch, or with a different batch shape.
LLM decoding is path dependent: a tiny numerical difference can flip one token, one token changes the next context, and the whole completion diverges.
Standard serving: dynamic batching → non-batch-invariant kernels → tiny logit differences → output may vary
Deterministic serving: batch-invariant kernels → same numerical path → same logits → same output
Thinking Machines Lab showed deterministic inference is possible: same prompt, temperature 0, batch-invariant deterministic kernels, same completion 1,000/1,000 times.
| Configuration | Time |
|---|---|
| vLLM default | 26 s |
| Deterministic vLLM (unoptimised) | 55 s |
| Deterministic vLLM (improved attention kernel) | 42 s |
Is the answer true?
Mitigation strategies:
Models have snapshots (claude-sonnet-4-20250514) and aliases (claude-sonnet-4). The alias points to the latest snapshot and can change under your feet.
The context window is finite (8K–2M tokens depending on model). As you fill it, behaviour changes:
If you publish anything that used an LLM, log and report:
| Model | Full name + snapshot ID |
| Temperature | And top-p, if set |
| Max tokens | Both input limit and output limit |
| System prompt | Verbatim, in full |
| User prompt | Template + any variable substitution |
| Seed | If the API supports it |
| Date of call | Models get updated; date anchors the version |
| Full response | Including tool calls, stop reason, usage metadata |
| Interface | Chat / API / SDK / agent — and version |
Day 2: how to control it — prompts, knowledge, memory, tools.