AI for Survey Researchers
(Part I)

When you type a prompt, what's happening?
NCRM · Paulo Serôdio · ISER, University of Essex

What an LLM actually is

A function. Tokens in, probabilities out.

...a generative neural network

LLM as a function
Tokens in, probabilities out
Mental model of an LLM

Tokenization

The goal is to take strings and feed them into a language model. That requires converting text into integers from a fixed vocabulary.

Those integers are the token IDs. They are lookup keys, not meanings by themselves.

string
token IDs
embedding vectors
transformer
The model never sees raw text. It sees vectors retrieved from an embedding table.
Tokenization encode/decode cycle

The same text round-trips: encode → integer IDs → decode. The model never sees raw text.

Tiktokenizer app

Why tokenization matters

Many behaviours that look like model intelligence failures trace back to the input representation.

  • Why can't LLMs spell or reverse strings reliably? Tokenization.
  • Why are simple arithmetic and punctuation-heavy formats awkward? Tokenization.
  • Why can non-English text cost more and perform worse? Tokenization.
  • Why can special strings such as <|endoftext|> behave oddly? Tokenization.
Tokenization lies at the heart of many unexpected LLM behaviours. The neural network may be fine; the representation it receives may be awkward.
Practical payoff: inspect tokenization before blaming the model, especially for survey items with jargon, code, tables, non-English text, or unusual punctuation.

How tokenizers are built

Tokenizer training: BPE process

Character-level tokenization is the naive baseline. BPE starts small, then repeatedly merges frequent adjacent pieces.

Unusual words tokenization

Unusual words get split into more sub-word pieces — more tokens, higher cost, weaker representation.

How does vocabulary work?

A vocabulary is the tokenizer's fixed menu of possible token types. Tokenizers turn text into token IDs: integer numbers from that vocabulary.

Small vocabulary: ["q", "u", "e", "n", "k", "i", "g"]

queenq + u + e + e + n = 5 token instances

Larger vocabulary: ["queen", "king", "Elizabeth", "Charles"]

queenqueen = 1 token instance

Vocabulary size comparison
Tokenizer typeVocabulary sizeTokens needed
Character-levelSmallMany
Word/subword-levelLargerFewer
Larger vocabulary = more possible tokens, but shorter token sequences. Smaller vocabulary = fewer possible tokens, but longer token sequences.

Why do models do/count tokens differently?

Tokenizers are design decisions. They are optimized around the model's training data, target languages, code support, context length, and deployment cost.

Because models do tokenization differently, they also count tokens differently: the same text can become different chunks before the neural network runs.

  • Vocabulary size and training corpus
  • Language, code, whitespace, and punctuation priorities
  • Special tokens and provider chat-message wrappers

GPT-2 struggled more than necessary with Python partly because its tokenizer handled whitespace poorly. Newer tokenizers encode code and multilingual text more efficiently.

Charged by token

Input and output tokens are priced differently. Output tokens cost more, so tokenization directly affects budgets.

Rule of thumb: if the task depends on exact characters, spelling, formatting, or multilingual nuance, inspect the tokens.

Embeddings

After tokenisation, each token ID is used to look up a vector in the model's embedding table.

This vector is the token's starting representation before the transformer layers process it.

The embedding table starts randomly and is learned during pre-training. Tokens that appear in similar contexts tend to acquire related vectors.

  1. Build tokenizer vocabulary: e.g. 5,000 token types
  2. Assign each token type an integer ID: IDs 0 to 4,999
  3. Create embedding table: 5,000 rows × 768 columns
  4. Initialise the table with small random values
  5. Train the model
  6. Backpropagation updates the table values
  7. After training, each token ID has a learned 768-dimensional vector

Transformers

A transformer is the building block underneath GPT-style language models. It takes a sequence of token embeddings and repeatedly updates each token's representation using the surrounding context.

The key idea is attention: each token can weight other tokens in the context when deciding what information matters.

Stack many transformer layers, train them to predict the next token, and you get the core architecture behind modern LLMs.

We only need the mental model here: tokens become vectors, vectors pass through transformer layers, and the output becomes next-token probabilities.

Pre-training starts with token IDs

Here is an example from FineWeb1. The dataset has been tokenized with a vocabulary of 100,277 token types.

The text is now a long sequence of integer token IDs. Those IDs are what the model trains on.

FineWeb1 text tokenized into token IDs

Training uses windows of tokens

The neural network takes windows of token IDs and learns to predict the next token.

In this example, the first four tokens are used as context to predict the 5th token.

A larger window means a larger context size: the model can pay attention to more previous tokens when making the next prediction.

Training window predicts the next token

Weights turn context into probabilities

During pre-training, backpropagation updates the network weights to increase the probability of the correct next token.

At each step, the model outputs one probability for every possible token in the vocabulary.

The mathematical expression combines the input token IDs with the learned weights to produce those probabilities.

Network weights produce next-token probabilities

Next-token prediction

The transformer takes the entire sequence of token embeddings and, at each step, predicts a probability distribution over the next token.

Autoregressive generation loop

It generates one token at a time. At each step it looks at the prompt plus everything it has already generated, picks a token, appends it, and repeats.

Autoregressive: looking backwards only. No lookahead, no planning. Each token is conditioned on all previous tokens.
DeepSeek next-token probabilities

DeepSeek V3.1 at temperature 0. "garage" wins at 85%, but "boardroom" (4%), "corporate" (3%) were candidates. Tiny logit differences can flip the choice.

Temperature and sampling

The model produces logits (raw scores) for every token in the vocabulary. Temperature controls how those logits are converted to probabilities:

  • Temperature 0 — always pick the highest-probability token (greedy). Deterministic in theory.
  • Temperature 0.5–0.7 — moderate variation. Good for most tasks.
  • Temperature 1.0+ — flatter distribution, more creative, less predictable.

Other sampling parameters:

  • top-p (nucleus) — only consider tokens whose cumulative probability reaches p
  • top-k — only consider the k most probable tokens
  • seed — attempt reproducibility (provider-dependent)
Temperature 0 ≠ deterministic in practice. We'll see why in §5.

Stateless

Every API call to an LLM is stateless: no memory, no continuity, no knowledge of what happened before unless it is explicitly provided again.

ChatGPT and similar systems can feel as if the model remembers past conversations. That memory is an application illusion. The model only sees what is inside the current request; any "memory" is previous messages, saved notes, or summaries being resent as context.

Because the model itself is stateless, intelligence shifts upward into the application. The application decides what to include, what to exclude or summarise, what to retrieve and prioritise, and what to forget.

For research:
  • Conversation history = part of the prompt
  • Earlier turns influence later outputs
  • Compaction (summarising old turns) changes behaviour invisibly
  • Cost scales with context length

How models are trained

Every training decision leaks into your research outputs.

Pre-training

Base models are trained via self-supervised learning: predict the next token across an enormous text corpus.

This is how knowledge gets compressed into the model. Training can cost hundreds of millions or even billions of dollars in compute: terabytes of text are compressed into a particular parameter shape that can generate text programmatically.

They learn language patterns and broad world knowledge, but are not tuned to follow instructions. A base model will complete your text, not answer your question.

Pre-trained models reproduce patterns in their training data — including harmful or biased content. They have no alignment, no safety layer, no instruction-following.

What's in the corpus matters:

  • Predominantly English, predominantly Western internet
  • Non-English text is underrepresented → weaker multilingual performance
  • Survey data, interview transcripts, qualitative research are almost entirely absent
  • The corpus is the primary source of demographic and ideological skew
Pre-training process

Supervised instruction tuning (SFT)

The first post-training step. Fine-tune the base model on curated instruction → response pairs written by humans.

Quality over quantity: relatively small, high-quality datasets yield strong instruction following. This is imitation learning — the model learns to mimic the demonstrations.

After SFT:

  • More coherent, helpful responses
  • But: inherits biases of the demonstration data
  • May still hallucinate, refuse inappropriately, or ignore safety
Fine-tuning process
For research: SFT data is curated by humans with their own cultural priors. If your task relies on the model's "opinions" or "judgments", those judgments originate partly here.

Reinforcement learning from human feedback

RLHF as the central alignment loop

RLHF is used for domains where the desired output is partly normative or judgement-based. In many tasks, we do not just want a factually correct answer; we want an answer that is useful, safe, calibrated, non-toxic, non-deceptive, and appropriate to the user's request.

The key object in reinforcement learning is the policy: the system being trained to choose actions. In an LLM, the policy is the model itself. It takes the current context — prompt plus generated tokens — and produces a probability distribution over the next token.

Each generated answer is a trajectory through this policy. The model chooses one token, appends it to the context, chooses the next token, and repeats. RLHF updates the model weights so trajectories judged to be better become more probable in future.

The reward signal usually comes indirectly. Humans compare candidate answers; a reward model learns to predict those preferences; reinforcement learning then optimises the LLM against that reward model. The result is a model whose probability distribution has shifted towards preferred forms of behaviour.

Alignment diagram
What is being reinforced? Not truth directly, but behaviours that score well under a reward model trained from human comparisons.

Related post-training methods: DPO, RLAIF, RLVR/GRPO.

The three steps of RLHF

RLHF step 1

1. Supervised warm start

Sample prompts and ask humans to write good responses. These demonstrations fine-tune the pre-trained base model in a supervised way.

RLHF step 2

2. Reward model

Generate several answers per prompt from the SFT model. Humans rank them; those rankings train a reward model that predicts preference scores.

RLHF step 3

3. Policy optimisation

Use the reward model to update the SFT model, often with PPO, so higher-scoring responses become more likely while the model stays close to its starting behaviour.

Verifiable rewards

RLVR · GRPO

RLVR uses tasks where correctness can be objectively verified: maths, code, logic. A deterministic verifier returns binary reward — 1 if correct, 0 otherwise.

No human or AI labellers. No subjective bias. Research shows RLVR implicitly incentivises correct chain-of-thought reasoning — because the only way to consistently get right answers is to produce valid intermediate steps.

Post-training is how we get LLMs to behave in particular ways. It refines behaviour after pre-training, and has been more productive than many expected.

GRPO (Group Relative Policy Optimisation) supports RLVR at scale. It estimates advantages without training a separate critic: multiple rollouts per prompt are grouped and whitened to compute relative advantage.

Why this matters: RLHF and RLVR can train models into better long tool-call runs, agent runs, coding, and reasoning. Cursor's Composer 2/2.5, for example, is reported by Cursor as a post-trained system built from Moonshot's Kimi K2.5 checkpoint, not a wholly new model trained from scratch.

Post-training is usually cheaper than pre-training, but not always: high-compute RL systems can still be extremely expensive.

What does training mean for research?

Figure 2 shows that LLM personas place far more variance on the first principal component than human GSS respondents, meaning their answers collapse too strongly onto a single ideological axis.

This is the kind of unidimensionality we should expect from alignment-style post-training: models are rewarded for coherent, normatively legible responses, but real people hold messier, cross-cutting, and only weakly constrained belief systems.

Source: "Synthetic personas distort the structure of human belief systems" by Chris Barrie.

LLM personas show more unidimensional belief structure than human respondents

Training risks for research

Corpus bias

The model's "knowledge" is the training data. If a population, language, or domain is underrepresented, outputs will be weaker — or confidently wrong.

Alignment artefacts

Sycophancy, refusal, verbosity are not bugs — they're trained behaviours. They will systematically affect any study that treats model outputs as data.

Opacity

Training data is undisclosed. Alignment methods are partially documented. You cannot fully audit the instrument you are using. Report what you can; acknowledge what you can't.

Inference

Where computation happens, and how models reason at answer time.

Where inference happens

To run a model locally, you need access to the model's weights: the trained numerical parameters that encode what the model learned. Open-weight models make those parameters downloadable under a licence; closed models keep them inside the provider's infrastructure.

Local

Run the model on your own hardware. Ollama, llama.cpp, vLLM. Your data never leaves your machine.

Requires a capable GPU. Limited to open-weights models.

Provider cloud

OpenAI, Anthropic, Google host the model. You send data to their servers. Subject to their data retention and usage policies.

Easiest option. Best models available here.

Re-hosted

Azure OpenAI, AWS Bedrock, GCP Vertex. Same models, hosted in your cloud region with enterprise data controls.

Zero-retention options. GDPR-compliant deployments possible.

Open weights vs closed weights

DimensionOpen-weight modelsClosed/proprietary models
ExamplesDeepSeek V3/V4/R1, Qwen3/Qwen3.5, Kimi K2/K2.6, GLM-4.5/GLM-5, Llama 4, MistralGPT-5.5, Claude Opus 4.8, Gemini 3.5
Where they runLocal machine, institutional server, private cloud, or third-party inference providerProvider-hosted service/API
Access to weightsModel weights are downloadable, subject to licenceWeights are not available
Data controlHigh if self-hosted; lower if using hosted inferenceDepends on provider terms, contract, logging and retention settings
ReproducibilityStronger if model, tokenizer, quantisation, runtime and prompts are pinnedWeaker: provider may update, retire, or alter models/API behaviour
AuditabilityBetter: model artefacts and deployment can be inspected/versionedLimited: model internals and system changes are mostly opaque
PerformanceOften competitive; usually behind frontier models on hardest tasksBest frontier capability usually here
Cost/effortMore setup, hardware and maintenanceEasier to use; pay per use/subscription
Best use caseSensitive data, reproducible pipelines, institutional controlHighest performance, rapid prototyping, general-purpose assistance

Models are NOT trained at inference time

Learning during inference — in-context learning — does not change model weights. It changes the context the model conditions on: instructions, prompts, examples, retrieved documents, and prior messages.

Parametric learning updates weights. In-context learning does not — and it absolutely works. Transformers are conditional next-token predictors over a sequence; give them the right sequence and behaviour can change radically without touching the parameters.

Zero-shot: "Classify this review as positive or negative."
Few-shot: "Here are 5 examples. Now classify this one."
Many-shot: 50+ examples in context (only feasible with large context windows).

This is the foundation of prompt engineering:

The intelligence lives in the static parameters, but the apparent capability depends heavily on what is fed into the context window. Context management, prompt engineering, instruction tuning, and few-shot examples exploit that fact.

Reasoning and inference-time compute

Early chatbots mostly tried to answer directly: prompt in, answer out.

Reasoning models change the allocation of compute. They spend more inference-time computation before producing the final answer: decomposing the problem, checking constraints, exploring alternatives, or using tools.

This does not require a separate symbolic reasoner. Mechanically, the model is still generating tokens. What changes is that some tokens are used to structure the path to the answer, not just to state the answer.

Chain-of-thought was the first visible version of this idea. Modern reasoning models often perform some of this intermediate work in hidden reasoning tokens.

Hidden reasoning tokens flow

User prompt → hidden reasoning tokens → visible answer tokens. The reasoning is trained, not emergent.

Reasoning strategies and thinking modes

Strategies

Reasoning strategies are different ways of using extra inference-time computation.

  • Step-by-step reasoning — generate intermediate steps before answering
  • Self-consistency — sample several reasoning paths and compare the answers
  • Tree-of-thoughts — branch across possible solution paths and evaluate them
  • Verifier or reward model — generate candidate answers, then score or select among them

The common pattern is the same: the system spends more computation before committing to an answer.

Thinking intensity

Thinking intensity controls the reasoning budget: how much intermediate computation the system can spend before answering. More is not always better. On simple tasks, extra search can introduce delay, cost, or unnecessary complications.
Adaptive thinking is the next step: the system decides how much reasoning a request deserves, instead of the user picking a fixed budget.

From reasoning to acting

Reasoning becomes more powerful when the model can call tools.

ReAct means reasoning + acting: the model alternates between internal reasoning and external actions, such as searching the web, querying a database, running code, or retrieving documents.

PAL means Program-Aided Language Models: instead of solving everything in natural language, the model writes code and lets a reliable executor do the calculation.

This is the bridge from chatbots to agents: inference is no longer just text generation; it becomes a loop of thinking, acting, observing, and deciding when to stop.

What breaks the reasoning loop?

Decoding stops when:

  • The model emits an end marker
  • The backend detects the final-answer channel has begun
  • A reasoning-token limit is reached
  • The output-token limit is reached
  • The model hits the maximum context window
If the model runs out of tokens mid-reasoning, the answer may be truncated or degraded. This is silent — you won't necessarily know it happened unless you check the usage metadata.

Why is inference relevant for research?

Inference is not a neutral button press. The same model can behave differently depending on the prompt, context, tools, reasoning budget, sampling settings, and provider implementation.

Cost

Reasoning tokens, tool calls, and long contexts make inference more expensive than a simple prompt-response exchange.

Reproducibility

"We used GPT-4" is not enough. Results can depend on model version, prompt, settings, tool access, context window, retrieval corpus, and date of access.

Data governance

Local inference can keep data inside your infrastructure; cloud inference sends data to a provider or hosted service.

Auditability: if reasoning, retrieval, or tool use is hidden, researchers need logs, metadata, prompts, model versions, and validation checks to understand what happened.

Modes of interacting with LLMs

What you control — and what you don't — at each layer.

Consumer chatbot

ChatGPT, Claude.ai, Gemini: the interface most people use. But the interaction is not just "you + model".

On every turn, the app may add or change things you do not see:

  • System instructions — hidden rules and priorities shape the answer
  • Model routing / model version — the same product name may point to different model snapshots over time
  • Conversation management — long chats may be summarised, truncated, or compacted
  • Tool orchestration — search, file reading, code execution, image analysis, or retrieval may be invoked by the product layer
  • Memory and personalisation — prior preferences or stored context may be injected into the current chat
  • Safety and policy layers — outputs may be filtered, rewritten, or refused outside the visible prompt
What this means for research use

You often cannot fully observe or fix:

  • the exact system prompt
  • the model snapshot
  • hidden retrieval or tool calls
  • how much history was retained
  • whether memory/personalisation was used
  • the decoding and serving configuration
  • product-side updates between runs
A consumer chatbot is a product interface, not a controlled research instrument.

You can study chatbot interactions, but you cannot assume they are exactly reproducible unless the platform exposes and fixes the relevant configuration.

The API

Direct access to the model. You send a JSON request, you get a JSON response. Nothing hidden.

You control:

  • Model name + snapshot version
  • System prompt (verbatim)
  • Temperature, top-p, max tokens
  • Whether tools are available
  • Structured output schemas

You get back:

  • The full response
  • Token usage (input, output, reasoning)
  • Stop reason
  • Model fingerprint / ID
curl https://api.openai.com/v1/responses \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.1",
    "input": [
      {
        "role": "developer",
        "content": "You are a survey methodologist. Review survey questions for clarity, bias, construct validity, recall burden, and response option problems."
      },
      {
        "role": "user",
        "content": "Review this survey question: In the last 12 months, how often have you used AI tools such as ChatGPT, Claude, Gemini, or Copilot for work or study?"
      }
    ],
    "reasoning": {
      "effort": "medium"
    },
    "text": {
      "format": {
        "type": "json_schema",
        "name": "survey_question_review",
        "schema": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            "construct": { "type": "string" },
            "problems": {
              "type": "array",
              "items": { "type": "string" }
            },
            "revised_question": { "type": "string" },
            "reporting_note": { "type": "string" }
          },
          "required": [
            "construct",
            "problems",
            "revised_question",
            "reporting_note"
          ]
        },
        "strict": true
      }
    },
    "metadata": {
      "project": "AI workshop demo",
      "task": "survey_question_review"
    }
  }'

The SDK

SDK

A thin wrapper around the API in Python, TypeScript, etc. Same control, less boilerplate.

SDKs make it easier to add batching, retries, streaming, schema validation, and structured outputs. But structured outputs are not a separate interaction mode: they are an option available through APIs and SDKs.

Structured outputs: force the model to return data in a specific schema — JSON, typed objects, validated against a Zod/Pydantic schema. Essential for pipelines that need to parse model outputs programmatically.
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

const response = await client.responses.create({
  model: "gpt-5.1",
  input: [
    {
      role: "developer",
      content:
        "You are a survey methodologist. Review survey questions for clarity, bias, construct validity, recall burden, and response option problems.",
    },
    {
      role: "user",
      content:
        "Review this survey question: In the last 12 months, how often have you used AI tools such as ChatGPT, Claude, Gemini, or Copilot for work or study?",
    },
  ],
  reasoning: {
    effort: "medium",
  },
  text: {
    format: {
      type: "json_schema",
      name: "survey_question_review",
      schema: {
        type: "object",
        additionalProperties: false,
        properties: {
          construct: { type: "string" },
          problems: {
            type: "array",
            items: { type: "string" },
          },
          revised_question: { type: "string" },
          reporting_note: { type: "string" },
        },
        required: [
          "construct",
          "problems",
          "revised_question",
          "reporting_note",
        ],
      },
      strict: true,
    },
  },
  metadata: {
    project: "AI workshop demo",
    task: "survey_question_review",
  },
});

console.log(response.output_text);

Third mode: agentic IDEs

Claude Code, Cursor, Windsurf and similar systems are not just chat interfaces or SDK calls. They combine a model with tools, files, terminals, version control, and an iterative loop.

The model can inspect state, call tools, edit files, run checks, observe failures, and decide what to do next. This is the bridge from "using a model" to delegating a workflow.

Research implication: agentic IDEs are powerful but harder to reproduce. The agent makes decisions about context management, tool use, retries, and stopping criteria that are difficult to log fully.

Day 3 topic.

Live: calling the API

Anthropic key loaded from .env
Request
Response
// Click Run to see the response.

Reproducibility and replicability

What can go wrong, what to log, and what to accept.

Why LLMs can still be non-deterministic at temperature 0

Greedy decoding removes sampling randomness, but not numerical variation in inference.

Temperature 0 + greedy decoding does not guarantee the same answer every time. The problem is not the transformer architecture itself, and it is not random sampling once temperature is 0.

The issue is floating-point arithmetic on GPUs. Floating-point addition is not associative: (a+b)+c ≠ a+(b+c).

Key takeaway: Temperature 0 removes stochastic decoding, but not numerical non-determinism.

Same prompt

Different batch context

Different floating-point reduction order

Slightly different logits

Different chosen token

Different completion

(0.1 + 1e20) - 1e20 = 0

0.1 + (1e20 - 1e20) = 0.1

The real issue: batch invariance — and yes, it can be fixed

Many inference kernels are deterministic run-to-run, but not batch-invariant.

A kernel is batch-invariant if your request gets the same numerical result regardless of what other requests are processed alongside it.

In practice, output can change if the server processes your request alone, in a larger batch, or with a different batch shape.

LLM decoding is path dependent: a tiny numerical difference can flip one token, one token changes the next context, and the whole completion diverges.

Standard serving: dynamic batching → non-batch-invariant kernels → tiny logit differences → output may vary

Deterministic serving: batch-invariant kernels → same numerical path → same logits → same output

Thinking Machines Lab showed deterministic inference is possible: same prompt, temperature 0, batch-invariant deterministic kernels, same completion 1,000/1,000 times.

ConfigurationTime
vLLM default26 s
Deterministic vLLM (unoptimised)55 s
Deterministic vLLM (improved attention kernel)42 s
Key takeaway: Non-determinism is mainly an engineering property of the inference stack, not an unavoidable property of transformers. Deterministic LLM inference is possible, but slower.

Hallucinations

Is the answer true?

Mitigation strategies:

  • RAG — Retrieval-Augmented Generation: ground the model in source documents
  • Decoding interventions — temperature, top-p, top-k: constrain the probability distribution at inference time
  • Chain-of-thought — reasoning tokens help the model check its own work
  • Training methods — RLHF reduces hallucination; abstention calibration teaches the model to say "I don't know"
  • Factuality scoring — external models or tools verify claims
  • Evaluation — systematic benchmarks for factual accuracy
No mitigation eliminates hallucination entirely. If your research depends on factual accuracy, you must verify outputs independently.

Model versioning and context effects

Versioning

Models have snapshots (claude-sonnet-4-20250514) and aliases (claude-sonnet-4). The alias points to the latest snapshot and can change under your feet.

  • New models ≠ better-tuned releases. Architecture, training data, and capabilities can change completely.
  • Model cards document known capabilities and limitations — read them.
  • Prompts that worked on one snapshot may fail on the next.

Context-window effects

The context window is finite (8K–2M tokens depending on model). As you fill it, behaviour changes:

  • Information at the beginning and end of the context is recalled better than the middle ("lost in the middle")
  • Longer contexts cost more and run slower
  • Models may silently degrade on very long inputs
Limits of reproducibility: Does your task rely on parametric memory (what the model learned in training) or on the context window (what you provide)? The former is harder to reproduce; the latter is fully in your control.

What to report

If you publish anything that used an LLM, log and report:

ModelFull name + snapshot ID
TemperatureAnd top-p, if set
Max tokensBoth input limit and output limit
System promptVerbatim, in full
User promptTemplate + any variable substitution
SeedIf the API supports it
Date of callModels get updated; date anchors the version
Full responseIncluding tool calls, stop reason, usage metadata
InterfaceChat / API / SDK / agent — and version
AAPOR-style transparency: treat the LLM call like a survey instrument. Document the instrument (prompts), the sampling method (temperature), the fieldwork period (dates), and the response data (outputs). If you wouldn't publish a survey without a methods section, don't publish LLM results without one.

Today: what's underneath the chat.

Day 2: how to control it — prompts, knowledge, memory, tools.