AI for Survey Researchers
(Part I)

When you type a prompt, what's happening?

NCRM · Paulo Serôdio · ISER, University of Essex

Section 1

What an LLM actually is

A function. Tokens in, probabilities out.

§1 What an LLM is

...a generative neural network

§1 What an LLM is

Tokenization

The goal is to take strings and feed them into a language model. That requires converting text into integers from a fixed vocabulary.

Those integers are the token IDs. They are lookup keys, not meanings by themselves.

string

→

token IDs

→

embedding vectors

→

transformer

The model never sees raw text. It sees vectors retrieved from an embedding table.

The same text round-trips: encode → integer IDs → decode. The model never sees raw text.

Tiktokenizer app

§1 What an LLM is

Why tokenization matters

Many behaviours that look like model intelligence failures trace back to the input representation.

Why can't LLMs spell or reverse strings reliably? Tokenization.
Why are simple arithmetic and punctuation-heavy formats awkward? Tokenization.
Why can non-English text cost more and perform worse? Tokenization.
Why can special strings such as <|endoftext|> behave oddly? Tokenization.

Tokenization lies at the heart of many unexpected LLM behaviours. The neural network may be fine; the representation it receives may be awkward.

Practical payoff: inspect tokenization before blaming the model, especially for survey items with jargon, code, tables, non-English text, or unusual punctuation.

§1 What an LLM is

How tokenizers are built

Character-level tokenization is the naive baseline. BPE starts small, then repeatedly merges frequent adjacent pieces.

Unusual words get split into more sub-word pieces — more tokens, higher cost, weaker representation.

§1 What an LLM is

How does vocabulary work?

A vocabulary is the tokenizer's fixed menu of possible token types. Tokenizers turn text into token IDs: integer numbers from that vocabulary.

Small vocabulary: ["q", "u", "e", "n", "k", "i", "g"]

queen → q + u + e + e + n = 5 token instances

Larger vocabulary: ["queen", "king", "Elizabeth", "Charles"]

queen → queen = 1 token instance

Tokenizer type	Vocabulary size	Tokens needed
Character-level	Small	Many
Word/subword-level	Larger	Fewer

Larger vocabulary = more possible tokens, but shorter token sequences. Smaller vocabulary = fewer possible tokens, but longer token sequences.

§1 What an LLM is

Why do models do/count tokens differently?

Tokenizers are design decisions. They are optimized around the model's training data, target languages, code support, context length, and deployment cost.

Because models do tokenization differently, they also count tokens differently: the same text can become different chunks before the neural network runs.

Vocabulary size and training corpus
Language, code, whitespace, and punctuation priorities
Special tokens and provider chat-message wrappers

GPT-2 struggled more than necessary with Python partly because its tokenizer handled whitespace poorly. Newer tokenizers encode code and multilingual text more efficiently.

Input and output tokens are priced differently. Output tokens cost more, so tokenization directly affects budgets.

Rule of thumb: if the task depends on exact characters, spelling, formatting, or multilingual nuance, inspect the tokens.

§1 What an LLM is

Embeddings

After tokenisation, each token ID is used to look up a vector in the model's embedding table.

This vector is the token's starting representation before the transformer layers process it.

The embedding table starts randomly and is learned during pre-training. Tokens that appear in similar contexts tend to acquire related vectors.

Build tokenizer vocabulary: e.g. 5,000 token types
Assign each token type an integer ID: IDs 0 to 4,999
Create embedding table: 5,000 rows × 768 columns
Initialise the table with small random values
Train the model
Backpropagation updates the table values
After training, each token ID has a learned 768-dimensional vector

§1 What an LLM is

Transformers

A transformer is the building block underneath GPT-style language models. It takes a sequence of token embeddings and repeatedly updates each token's representation using the surrounding context.

The key idea is attention: each token can weight other tokens in the context when deciding what information matters.

Stack many transformer layers, train them to predict the next token, and you get the core architecture behind modern LLMs.

We only need the mental model here: tokens become vectors, vectors pass through transformer layers, and the output becomes next-token probabilities.

Transformer Explainer Interactive attention and next-token visualisation. LLM architecture Visual gallery of modern LLM architectures.

§1 What an LLM is

Pre-training starts with token IDs

Here is an example from FineWeb1. The dataset has been tokenized with a vocabulary of 100,277 token types.

The text is now a long sequence of integer token IDs. Those IDs are what the model trains on.

§1 What an LLM is

Training uses windows of tokens

The neural network takes windows of token IDs and learns to predict the next token.

In this example, the first four tokens are used as context to predict the 5th token.

A larger window means a larger context size: the model can pay attention to more previous tokens when making the next prediction.

§1 What an LLM is

Weights turn context into probabilities

During pre-training, backpropagation updates the network weights to increase the probability of the correct next token.

At each step, the model outputs one probability for every possible token in the vocabulary.

The mathematical expression combines the input token IDs with the learned weights to produce those probabilities.

Network weights produce next-token probabilities

§1 What an LLM is

Next-token prediction

The transformer takes the entire sequence of token embeddings and, at each step, predicts a probability distribution over the next token.

It generates one token at a time. At each step it looks at the prompt plus everything it has already generated, picks a token, appends it, and repeats.

Autoregressive: looking backwards only. No lookahead, no planning. Each token is conditioned on all previous tokens.

DeepSeek V3.1 at temperature 0. "garage" wins at 85%, but "boardroom" (4%), "corporate" (3%) were candidates. Tiny logit differences can flip the choice.

§1 What an LLM is

Temperature and sampling

The model produces logits (raw scores) for every token in the vocabulary. Temperature controls how those logits are converted to probabilities:

Temperature 0 — always pick the highest-probability token (greedy). Deterministic in theory.
Temperature 0.5–0.7 — moderate variation. Good for most tasks.
Temperature 1.0+ — flatter distribution, more creative, less predictable.

Other sampling parameters:

top-p (nucleus) — only consider tokens whose cumulative probability reaches p
top-k — only consider the k most probable tokens
seed — attempt reproducibility (provider-dependent)

Temperature 0 ≠ deterministic in practice. We'll see why in §5.

§1 What an LLM is

Stateless

Every API call to an LLM is stateless: no memory, no continuity, no knowledge of what happened before unless it is explicitly provided again.

ChatGPT and similar systems can feel as if the model remembers past conversations. That memory is an application illusion. The model only sees what is inside the current request; any "memory" is previous messages, saved notes, or summaries being resent as context.

Because the model itself is stateless, intelligence shifts upward into the application. The application decides what to include, what to exclude or summarise, what to retrieve and prioritise, and what to forget.

For research:

Conversation history = part of the prompt
Earlier turns influence later outputs
Compaction (summarising old turns) changes behaviour invisibly
Cost scales with context length

Section 2

How models are trained

Every training decision leaks into your research outputs.

§2 How models are trained

Pre-training

Base models are trained via self-supervised learning: predict the next token across an enormous text corpus.

This is how knowledge gets compressed into the model. Training can cost hundreds of millions or even billions of dollars in compute: terabytes of text are compressed into a particular parameter shape that can generate text programmatically.

They learn language patterns and broad world knowledge, but are not tuned to follow instructions. A base model will complete your text, not answer your question.

Pre-trained models reproduce patterns in their training data — including harmful or biased content. They have no alignment, no safety layer, no instruction-following.

What's in the corpus matters:

Predominantly English, predominantly Western internet
Non-English text is underrepresented → weaker multilingual performance
Survey data, interview transcripts, qualitative research are almost entirely absent
The corpus is the primary source of demographic and ideological skew

Experiment with a base model Llama 3.2 1B base model: pre-trained, not instruction fine-tuned.

§2 How models are trained

Supervised instruction tuning (SFT)

The first post-training step. Fine-tune the base model on curated instruction → response pairs written by humans.

Quality over quantity: relatively small, high-quality datasets yield strong instruction following. This is imitation learning — the model learns to mimic the demonstrations.

After SFT:

More coherent, helpful responses
But: inherits biases of the demonstration data
May still hallucinate, refuse inappropriately, or ignore safety

For research: SFT data is curated by humans with their own cultural priors. If your task relies on the model's "opinions" or "judgments", those judgments originate partly here.

Example instruction-tuning dataset Finance-Instruct-500k on Hugging Face.

§2 How models are trained

Reinforcement learning from human feedback

RLHF as the central alignment loop

RLHF is used for domains where the desired output is partly normative or judgement-based. In many tasks, we do not just want a factually correct answer; we want an answer that is useful, safe, calibrated, non-toxic, non-deceptive, and appropriate to the user's request.

The key object in reinforcement learning is the policy: the system being trained to choose actions. In an LLM, the policy is the model itself. It takes the current context — prompt plus generated tokens — and produces a probability distribution over the next token.

Each generated answer is a trajectory through this policy. The model chooses one token, appends it to the context, chooses the next token, and repeats. RLHF updates the model weights so trajectories judged to be better become more probable in future.

The reward signal usually comes indirectly. Humans compare candidate answers; a reward model learns to predict those preferences; reinforcement learning then optimises the LLM against that reward model. The result is a model whose probability distribution has shifted towards preferred forms of behaviour.

What is being reinforced? Not truth directly, but behaviours that score well under a reward model trained from human comparisons.

Related post-training methods: DPO, RLAIF, RLVR/GRPO.

§2 How models are trained

The three steps of RLHF

1. Supervised warm start

Sample prompts and ask humans to write good responses. These demonstrations fine-tune the pre-trained base model in a supervised way.

2. Reward model

Generate several answers per prompt from the SFT model. Humans rank them; those rankings train a reward model that predicts preference scores.

3. Policy optimisation

Use the reward model to update the SFT model, often with PPO, so higher-scoring responses become more likely while the model stays close to its starting behaviour.

§2 How models are trained

Verifiable rewards

RLVR · GRPO

RLVR uses tasks where correctness can be objectively verified: maths, code, logic. A deterministic verifier returns binary reward — 1 if correct, 0 otherwise.

No human or AI labellers. No subjective bias. Research shows RLVR implicitly incentivises correct chain-of-thought reasoning — because the only way to consistently get right answers is to produce valid intermediate steps.

Post-training is how we get LLMs to behave in particular ways. It refines behaviour after pre-training, and has been more productive than many expected.

GRPO (Group Relative Policy Optimisation) supports RLVR at scale. It estimates advantages without training a separate critic: multiple rollouts per prompt are grouped and whitened to compute relative advantage.

Why this matters: RLHF and RLVR can train models into better long tool-call runs, agent runs, coding, and reasoning. Cursor's Composer 2/2.5, for example, is reported by Cursor as a post-trained system built from Moonshot's Kimi K2.5 checkpoint, not a wholly new model trained from scratch.

Post-training is usually cheaper than pre-training, but not always: high-compute RL systems can still be extremely expensive.

§2 How models are trained

What does training mean for research?

Figure 2 shows that LLM personas place far more variance on the first principal component than human GSS respondents, meaning their answers collapse too strongly onto a single ideological axis.

This is the kind of unidimensionality we should expect from alignment-style post-training: models are rewarded for coherent, normatively legible responses, but real people hold messier, cross-cutting, and only weakly constrained belief systems.

Source: "Synthetic personas distort the structure of human belief systems" by Chris Barrie.

LLM personas show more unidimensional belief structure than human respondents

§2 How models are trained

Training risks for research

Corpus bias

The model's "knowledge" is the training data. If a population, language, or domain is underrepresented, outputs will be weaker — or confidently wrong.

Alignment artefacts

Sycophancy, refusal, verbosity are not bugs — they're trained behaviours. They will systematically affect any study that treats model outputs as data.

Opacity

Training data is undisclosed. Alignment methods are partially documented. You cannot fully audit the instrument you are using. Report what you can; acknowledge what you can't.

Section 3

Inference

Where computation happens, and how models reason at answer time.

§3 Inference

Where inference happens

To run a model locally, you need access to the model's weights: the trained numerical parameters that encode what the model learned. Open-weight models make those parameters downloadable under a licence; closed models keep them inside the provider's infrastructure.

Local

Run the model on your own hardware. Ollama, llama.cpp, vLLM. Your data never leaves your machine.

Requires a capable GPU. Limited to open-weights models.

Provider cloud

OpenAI, Anthropic, Google host the model. You send data to their servers. Subject to their data retention and usage policies.

Easiest option. Best models available here.

Re-hosted

Azure OpenAI, AWS Bedrock, GCP Vertex. Same models, hosted in your cloud region with enterprise data controls.

Zero-retention options. GDPR-compliant deployments possible.

§3 Inference

Open weights vs closed weights

Dimension	Open-weight models	Closed/proprietary models
Examples	DeepSeek V3/V4/R1, Qwen3/Qwen3.5, Kimi K2/K2.6, GLM-4.5/GLM-5, Llama 4, Mistral	GPT-5.5, Claude Opus 4.8, Gemini 3.5
Where they run	Local machine, institutional server, private cloud, or third-party inference provider	Provider-hosted service/API
Access to weights	Model weights are downloadable, subject to licence	Weights are not available
Data control	High if self-hosted; lower if using hosted inference	Depends on provider terms, contract, logging and retention settings
Reproducibility	Stronger if model, tokenizer, quantisation, runtime and prompts are pinned	Weaker: provider may update, retire, or alter models/API behaviour
Auditability	Better: model artefacts and deployment can be inspected/versioned	Limited: model internals and system changes are mostly opaque
Performance	Often competitive; usually behind frontier models on hardest tasks	Best frontier capability usually here
Cost/effort	More setup, hardware and maintenance	Easier to use; pay per use/subscription
Best use case	Sensitive data, reproducible pipelines, institutional control	Highest performance, rapid prototyping, general-purpose assistance

§3 Inference

Models are NOT trained at inference time

Learning during inference — in-context learning — does not change model weights. It changes the context the model conditions on: instructions, prompts, examples, retrieved documents, and prior messages.

Parametric learning updates weights. In-context learning does not — and it absolutely works. Transformers are conditional next-token predictors over a sequence; give them the right sequence and behaviour can change radically without touching the parameters.

Zero-shot: "Classify this review as positive or negative."
Few-shot: "Here are 5 examples. Now classify this one."
Many-shot: 50+ examples in context (only feasible with large context windows).

This is the foundation of prompt engineering:

The intelligence lives in the static parameters, but the apparent capability depends heavily on what is fed into the context window. Context management, prompt engineering, instruction tuning, and few-shot examples exploit that fact.

§3 Inference

Reasoning and inference-time compute

Early chatbots mostly tried to answer directly: prompt in, answer out.

Reasoning models change the allocation of compute. They spend more inference-time computation before producing the final answer: decomposing the problem, checking constraints, exploring alternatives, or using tools.

This does not require a separate symbolic reasoner. Mechanically, the model is still generating tokens. What changes is that some tokens are used to structure the path to the answer, not just to state the answer.

Chain-of-thought was the first visible version of this idea. Modern reasoning models often perform some of this intermediate work in hidden reasoning tokens.

User prompt → hidden reasoning tokens → visible answer tokens. The reasoning is trained, not emergent.

§3 Inference

Reasoning strategies and thinking modes

Strategies

Reasoning strategies are different ways of using extra inference-time computation.

Step-by-step reasoning — generate intermediate steps before answering
Self-consistency — sample several reasoning paths and compare the answers
Tree-of-thoughts — branch across possible solution paths and evaluate them
Verifier or reward model — generate candidate answers, then score or select among them

The common pattern is the same: the system spends more computation before committing to an answer.

Thinking intensity

Thinking intensity controls the reasoning budget: how much intermediate computation the system can spend before answering. More is not always better. On simple tasks, extra search can introduce delay, cost, or unnecessary complications.

Adaptive thinking is the next step: the system decides how much reasoning a request deserves, instead of the user picking a fixed budget.

§3 Inference

From reasoning to acting

Reasoning becomes more powerful when the model can call tools.

ReAct means reasoning + acting: the model alternates between internal reasoning and external actions, such as searching the web, querying a database, running code, or retrieving documents.

PAL means Program-Aided Language Models: instead of solving everything in natural language, the model writes code and lets a reliable executor do the calculation.

This is the bridge from chatbots to agents: inference is no longer just text generation; it becomes a loop of thinking, acting, observing, and deciding when to stop.

What breaks the reasoning loop?

Decoding stops when:

The model emits an end marker
The backend detects the final-answer channel has begun
A reasoning-token limit is reached
The output-token limit is reached
The model hits the maximum context window

If the model runs out of tokens mid-reasoning, the answer may be truncated or degraded. This is silent — you won't necessarily know it happened unless you check the usage metadata.

§3 Inference

Why is inference relevant for research?

Inference is not a neutral button press. The same model can behave differently depending on the prompt, context, tools, reasoning budget, sampling settings, and provider implementation.

Cost

Reasoning tokens, tool calls, and long contexts make inference more expensive than a simple prompt-response exchange.

Reproducibility

"We used GPT-4" is not enough. Results can depend on model version, prompt, settings, tool access, context window, retrieval corpus, and date of access.

Data governance

Local inference can keep data inside your infrastructure; cloud inference sends data to a provider or hosted service.

Auditability: if reasoning, retrieval, or tool use is hidden, researchers need logs, metadata, prompts, model versions, and validation checks to understand what happened.

Section 4

Modes of interacting with LLMs

What you control — and what you don't — at each layer.

§4 Modes of interaction

Consumer chatbot

ChatGPT, Claude.ai, Gemini: the interface most people use. But the interaction is not just "you + model".

On every turn, the app may add or change things you do not see:

System instructions — hidden rules and priorities shape the answer
Model routing / model version — the same product name may point to different model snapshots over time
Conversation management — long chats may be summarised, truncated, or compacted
Tool orchestration — search, file reading, code execution, image analysis, or retrieval may be invoked by the product layer
Memory and personalisation — prior preferences or stored context may be injected into the current chat
Safety and policy layers — outputs may be filtered, rewritten, or refused outside the visible prompt

What this means for research use

You often cannot fully observe or fix:

the exact system prompt
the model snapshot
hidden retrieval or tool calls
how much history was retained
whether memory/personalisation was used
the decoding and serving configuration
product-side updates between runs

A consumer chatbot is a product interface, not a controlled research instrument.

You can study chatbot interactions, but you cannot assume they are exactly reproducible unless the platform exposes and fixes the relevant configuration.

§4 Modes of interaction

The API

Direct access to the model. You send a JSON request, you get a JSON response. Nothing hidden.

You control:

Model name + snapshot version
System prompt (verbatim)
Temperature, top-p, max tokens
Whether tools are available
Structured output schemas

You get back:

The full response
Token usage (input, output, reasoning)
Stop reason
Model fingerprint / ID

curl https://api.openai.com/v1/responses \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1",
    "input": [
      {
        "role": "developer",
        "content": "You are a survey methodologist. Review survey questions for clarity, bias, construct validity, recall burden, and response option problems."
      },
      {
        "role": "user",
        "content": "Review this survey question: In the last 12 months, how often have you used AI tools such as ChatGPT, Claude, Gemini, or Copilot for work or study?"
      }
    ],
    "reasoning": {
      "effort": "medium"
    },
    "text": {
      "format": {
        "type": "json_schema",
        "name": "survey_question_review",
        "schema": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            "construct": { "type": "string" },
            "problems": {
              "type": "array",
              "items": { "type": "string" }
            },
            "revised_question": { "type": "string" },
            "reporting_note": { "type": "string" }
          },
          "required": [
            "construct",
            "problems",
            "revised_question",
            "reporting_note"
          ]
        },
        "strict": true
      }
    },
    "metadata": {
      "project": "AI workshop demo",
      "task": "survey_question_review"
    }
  }'

§4 Modes of interaction

The SDK

SDK

A thin wrapper around the API in Python, TypeScript, etc. Same control, less boilerplate.

SDKs make it easier to add batching, retries, streaming, schema validation, and structured outputs. But structured outputs are not a separate interaction mode: they are an option available through APIs and SDKs.

Structured outputs: force the model to return data in a specific schema — JSON, typed objects, validated against a Zod/Pydantic schema. Essential for pipelines that need to parse model outputs programmatically.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

const response = await client.responses.create({
  model: "gpt-5.1",
  input: [
    {
      role: "developer",
      content:
        "You are a survey methodologist. Review survey questions for clarity, bias, construct validity, recall burden, and response option problems.",
    },
    {
      role: "user",
      content:
        "Review this survey question: In the last 12 months, how often have you used AI tools such as ChatGPT, Claude, Gemini, or Copilot for work or study?",
    },
  ],
  reasoning: {
    effort: "medium",
  },
  text: {
    format: {
      type: "json_schema",
      name: "survey_question_review",
      schema: {
        type: "object",
        additionalProperties: false,
        properties: {
          construct: { type: "string" },
          problems: {
            type: "array",
            items: { type: "string" },
          },
          revised_question: { type: "string" },
          reporting_note: { type: "string" },
        },
        required: [
          "construct",
          "problems",
          "revised_question",
          "reporting_note",
        ],
      },
      strict: true,
    },
  },
  metadata: {
    project: "AI workshop demo",
    task: "survey_question_review",
  },
});

console.log(response.output_text);

§4 Modes of interaction

Third mode: agentic IDEs

Claude Code, Cursor, Windsurf and similar systems are not just chat interfaces or SDK calls. They combine a model with tools, files, terminals, version control, and an iterative loop.

The model can inspect state, call tools, edit files, run checks, observe failures, and decide what to do next. This is the bridge from "using a model" to delegating a workflow.

Research implication: agentic IDEs are powerful but harder to reproduce. The agent makes decisions about context management, tool use, retries, and stopping criteria that are difficult to log fully.

Day 3 topic.

§4 Modes of interaction

Live: calling the API

Anthropic key loaded from .env

Request

Response

// Click Run to see the response.

Section 5

Reproducibility and replicability

What can go wrong, what to log, and what to accept.

§5 Reproducibility

Why LLMs can still be non-deterministic at temperature 0

Greedy decoding removes sampling randomness, but not numerical variation in inference.

Temperature 0 + greedy decoding does not guarantee the same answer every time. The problem is not the transformer architecture itself, and it is not random sampling once temperature is 0.

The issue is floating-point arithmetic on GPUs. Floating-point addition is not associative: (a+b)+c ≠ a+(b+c).

Key takeaway: Temperature 0 removes stochastic decoding, but not numerical non-determinism.

Same prompt

↓

Different batch context

↓

Different floating-point reduction order

↓

Slightly different logits

↓

Different chosen token

↓

Different completion

(0.1 + 1e20) - 1e20 = 0

0.1 + (1e20 - 1e20) = 0.1

§5 Reproducibility

The real issue: batch invariance — and yes, it can be fixed

Many inference kernels are deterministic run-to-run, but not batch-invariant.

A kernel is batch-invariant if your request gets the same numerical result regardless of what other requests are processed alongside it.

In practice, output can change if the server processes your request alone, in a larger batch, or with a different batch shape.

LLM decoding is path dependent: a tiny numerical difference can flip one token, one token changes the next context, and the whole completion diverges.

Standard serving: dynamic batching → non-batch-invariant kernels → tiny logit differences → output may vary

Deterministic serving: batch-invariant kernels → same numerical path → same logits → same output

Thinking Machines Lab showed deterministic inference is possible: same prompt, temperature 0, batch-invariant deterministic kernels, same completion 1,000/1,000 times.

Configuration	Time
vLLM default	26 s
Deterministic vLLM (unoptimised)	55 s
Deterministic vLLM (improved attention kernel)	42 s

Key takeaway: Non-determinism is mainly an engineering property of the inference stack, not an unavoidable property of transformers. Deterministic LLM inference is possible, but slower.

§5 Reproducibility

Hallucinations

Is the answer true?

Mitigation strategies:

RAG — Retrieval-Augmented Generation: ground the model in source documents
Decoding interventions — temperature, top-p, top-k: constrain the probability distribution at inference time
Chain-of-thought — reasoning tokens help the model check its own work

Training methods — RLHF reduces hallucination; abstention calibration teaches the model to say "I don't know"
Factuality scoring — external models or tools verify claims
Evaluation — systematic benchmarks for factual accuracy

No mitigation eliminates hallucination entirely. If your research depends on factual accuracy, you must verify outputs independently.

§5 Reproducibility

Model versioning and context effects

Versioning

Models have snapshots (claude-sonnet-4-20250514) and aliases (claude-sonnet-4). The alias points to the latest snapshot and can change under your feet.

New models ≠ better-tuned releases. Architecture, training data, and capabilities can change completely.
Model cards document known capabilities and limitations — read them.
Prompts that worked on one snapshot may fail on the next.

Context-window effects

The context window is finite (8K–2M tokens depending on model). As you fill it, behaviour changes:

Information at the beginning and end of the context is recalled better than the middle ("lost in the middle")
Longer contexts cost more and run slower
Models may silently degrade on very long inputs

Limits of reproducibility: Does your task rely on parametric memory (what the model learned in training) or on the context window (what you provide)? The former is harder to reproduce; the latter is fully in your control.

§5 Reproducibility

What to report

If you publish anything that used an LLM, log and report:

Model	Full name + snapshot ID
Temperature	And top-p, if set
Max tokens	Both input limit and output limit
System prompt	Verbatim, in full
User prompt	Template + any variable substitution
Seed	If the API supports it
Date of call	Models get updated; date anchors the version
Full response	Including tool calls, stop reason, usage metadata
Interface	Chat / API / SDK / agent — and version

AAPOR-style transparency: treat the LLM call like a survey instrument. Document the instrument (prompts), the sampling method (temperature), the fieldwork period (dates), and the response data (outputs). If you wouldn't publish a survey without a methods section, don't publish LLM results without one.

Today: what's underneath the chat.

Day 2: how to control it — prompts, knowledge, memory, tools.

AI for Survey Researchers(Part I)

What an LLM actually is

...a generative neural network

Tokenization

Why tokenization matters

How tokenizers are built

How does vocabulary work?

Why do models do/count tokens differently?

Embeddings

Transformers

Pre-training starts with token IDs

Training uses windows of tokens

Weights turn context into probabilities

Next-token prediction

Temperature and sampling

Stateless

How models are trained

Pre-training

Supervised instruction tuning (SFT)

Reinforcement learning from human feedback

The three steps of RLHF

1. Supervised warm start

2. Reward model

3. Policy optimisation

Verifiable rewards

What does training mean for research?

Training risks for research

Corpus bias

Alignment artefacts

Opacity

Inference

Where inference happens

Local

Provider cloud

Re-hosted

Open weights vs closed weights

Models are NOT trained at inference time

Reasoning and inference-time compute

Reasoning strategies and thinking modes

Strategies

Thinking intensity

From reasoning to acting

What breaks the reasoning loop?

Why is inference relevant for research?

Cost

Reproducibility

Data governance

Modes of interacting with LLMs

Consumer chatbot

The API

The SDK

SDK

Third mode: agentic IDEs

Live: calling the API

Reproducibility and replicability

Why LLMs can still be non-deterministic at temperature 0

The real issue: batch invariance — and yes, it can be fixed

Hallucinations

Model versioning and context effects

Versioning

Context-window effects

What to report

Today: what's underneath the chat.

AI for Survey Researchers
(Part I)