AI Glossary
The terms behind modern AI, defined
Plain-English definitions of the words that show up across Chain of Thought. Each term links down into the explainers, topics, and episodes that go deeper.
Concepts
The core ideas behind modern AI.
- Agent Memory Agent memory is what an AI agent keeps and reuses beyond a single turn — working memory in the context window, plus longer-term stores it can write to and retrieve from later. It's how an agent stops starting every session from zero.
- Chain-of-Thought Prompting Chain-of-thought prompting asks a model to work through its reasoning step by step before giving a final answer. Spelling out the intermediate steps measurably improves accuracy on multi-step problems like math, logic, and planning.
- Compound AI Systems A compound AI system solves a task with multiple components — several model calls, retrieval, tools, and control logic — rather than a single prompt to a single model. Most real production AI is compound, which is why reliability is a systems problem, not a model problem.
- Context Engineering Context engineering is the discipline of deciding what information goes into a model's context window for a task — which documents, which history, which tool output — and how fresh and trustworthy it is. As models commoditize, it's where a lot of the durable advantage now lives.
- Context Poisoning Context poisoning is when bad, stale, or excessive information in a model's context window degrades its reasoning — the agent gets buried in irrelevant tokens or misled by wrong data, and its answers get worse even though the model is fine.
- Context Window The context window is the amount of text a model can take in at once — the prompt, the conversation so far, and any documents or tool output you include. Everything the model can 'see' for a given response has to fit inside it.
- Embeddings Embeddings are numerical representations of text, images, or other data as vectors, where things with similar meaning land close together. They're what lets a system search by meaning instead of by exact keyword.
- Explainability Explainability is how well you can understand why an AI system produced a given output. It matters most where decisions need to be justified — lending, hiring, healthcare — and it's hard for large models, whose reasoning isn't transparent just because they can narrate a plausible-sounding rationale.
- Fine-Tuning Fine-tuning continues training a pretrained model on your own examples so it gets better at a specific task, tone, or format. It changes the model's weights, unlike prompting or RAG, which change what you feed it.
- FlashAttention FlashAttention is an optimized way to compute a transformer's attention that's far more memory-efficient, by avoiding writing the huge intermediate attention matrix to memory. It makes longer context windows and faster training practical without changing the model's results.
- Frontier Model A frontier model is one of the most capable AI models available at a given moment — the latest flagship releases from the major labs that set the current ceiling on what's possible. The label moves: today's frontier model is next year's baseline.
- Inference Inference is running a trained model to produce output — the part that happens every time a user sends a prompt. It's distinct from training (teaching the model in the first place): training is a big one-time cost, inference is the recurring cost you pay on every request, forever.
- Knowledge Distillation Knowledge distillation trains a small 'student' model to imitate a larger 'teacher' model, transferring much of the teacher's capability into a model that's cheaper and faster to run. It's a main way the strong-but-expensive becomes small-enough-to-ship.
- Knowledge Graph A knowledge graph stores information as entities and the relationships between them, rather than as loose documents or vectors. For AI, it gives a model structured, connected context it can traverse — which is one answer to grounding and hallucination.
- LoRA (Low-Rank Adaptation) LoRA is a parameter-efficient way to fine-tune a model: instead of updating all its weights, you train small add-on matrices and leave the original model frozen. You get most of the benefit of fine-tuning at a fraction of the compute and storage.
- Mixture of Experts (MoE) A mixture of experts is a model split into many specialized sub-networks, where a router sends each input to just a few of them. You get the capacity of a huge model while only running a fraction of it per request.
- Open Weights An open-weights model is one whose trained parameters are released publicly, so anyone can download, run, inspect, and fine-tune it. It's distinct from fully open source — the weights are open even when the training data and code aren't.
- Prompt Engineering Prompt engineering is crafting the instruction you give a model — the wording, examples, and output format — to get better results without retraining it. It's the most visible AI skill and, as models improve, increasingly table stakes rather than a moat.
- Quantization Quantization shrinks a model by storing its weights at lower numerical precision — say 4-bit integers instead of 16-bit floats. The model gets smaller and faster to run, usually with little quality loss, which is what lets large models fit on smaller hardware.
- Reasoning Models Reasoning models are LLMs trained to do extended step-by-step thinking before they answer, spending more compute at inference to work through hard problems. They trade latency and cost for accuracy on math, code, and multi-step logic.
- Retrieval-Augmented Generation (RAG) RAG is the pattern of fetching relevant documents at query time and feeding them to a model alongside the question, so the answer is grounded in real sources instead of the model's memory. It's how you put private or current data in front of a model without retraining it.
- RLHF (Reinforcement Learning from Human Feedback) RLHF is a training step that tunes a model toward what people actually prefer: humans rank model outputs, those rankings train a reward model, and the model is then optimized to score well against it. It's a big part of why chat models feel helpful instead of just fluent.
- Robotic Process Automation (RPA) RPA automates repetitive digital tasks with explicit, rule-based scripts — click here, copy this field, paste it there. It's deterministic and brittle: it does exactly what it's told and breaks when the screen or process changes, which is the contrast that defines AI agents.
- State-Space Models (Mamba) State-space models are a transformer alternative that process sequences by carrying a compact running state forward, rather than comparing every token to every other token. They scale linearly with sequence length instead of quadratically — cheaper on long inputs — with Mamba the best-known example.
- Temperature Temperature is the setting that controls how random a model's output is. Low temperature makes it pick the most likely next token almost every time (focused, repeatable); high temperature spreads the odds (varied, creative, less predictable). It's the main dial between consistency and creativity.
- Tokenization Tokenization splits text into the chunks a model actually processes — tokens, which are roughly word-pieces, not whole words. It's why model limits and pricing are counted in tokens, and why 'a few paragraphs' is a fuzzy unit but 'tokens' is exact.
- Transformer The transformer is the neural-network architecture behind almost every modern large language model. Its key idea is attention: each token can look at every other token and weigh which ones matter, which is what lets the model handle context and long-range meaning.
- Vector Database A vector database stores embeddings — the numerical representations of your data — and is built to find the nearest ones to a query fast. It's the retrieval engine underneath most RAG systems.
Security & attacks
How AI systems get attacked, and the defenses.
- AI Guardrails Guardrails are the checks that keep an AI system inside safe, intended behavior — filtering inputs, constraining what it can do, and validating outputs before they reach a user. They run outside the model, so they hold even when the model is wrong or manipulated.
- AI Red Teaming AI red teaming is deliberately attacking your own AI system before someone else does — probing it with adversarial inputs to find where it leaks data, breaks its rules, or fails dangerously, so you can fix those holes before launch.
- Backdoor Attack A backdoor attack plants a hidden trigger in a model during training, so it behaves normally until it sees a specific input — then it flips to attacker-chosen behavior. The model passes normal testing, which is what makes the backdoor dangerous.
- Data Poisoning Data poisoning is an attack that corrupts the data a model learns from — its training set, fine-tuning examples, or a knowledge base it retrieves from — so the model behaves the way the attacker wants while looking normal.
- Evasion Attack An evasion attack crafts an input designed to slip past a model's classifier or safety check at inference time — a spam message tweaked to read as legitimate, a malicious payload perturbed to look benign. The model isn't compromised; it's fooled by an input built to exploit its blind spots.
- Excessive Agency Excessive agency is giving an AI agent more capability, permission, or autonomy than its task needs — broad tool access, write permissions, the ability to act without approval. It turns a model mistake or a successful attack into real-world damage.
- Jailbreaking Jailbreaking is crafting a prompt that gets a model to bypass its own safety rules — producing content it was trained to refuse — usually through roleplay, hypotheticals, or obfuscation that talks the model around its guardrails.
- Membership Inference Attack A membership inference attack figures out whether a specific record was in a model's training data by probing how the model responds. It's a privacy leak: confirming someone's data was used can itself expose sensitive information.
- Model Denial of Service Model denial of service is making an AI system unavailable or ruinously expensive by flooding it with requests or crafting inputs that force maximum work — huge outputs, deep tool loops, giant context. Because each call costs real money, the financial version is sometimes called 'denial of wallet.'
- Model Inversion Attack A model inversion attack reconstructs sensitive training data by probing a model's outputs — recovering, for example, features of the records it was trained on. It's a privacy threat: the model itself can leak the data it learned from.
- Prompt Injection Prompt injection is an attack where malicious instructions hidden in the input — a user message, a web page, a document the agent reads — trick the model into ignoring its real instructions and doing the attacker's bidding instead.
- Token Leakage Token leakage is an AI system exposing secrets it shouldn't — API keys, credentials, or auth tokens — in its output, logs, or traces. It happens when secrets end up in the context window or tool results and the model repeats them, or when verbose logging captures them.
Metrics
What the numbers in AI evaluation mean.
- Accuracy Accuracy is the share of predictions a model got right out of all predictions. It's the most intuitive metric and the most misleading — on imbalanced data, a model can score high accuracy while being useless, which is why it's rarely enough on its own.
- Answer Relevance Answer relevance measures whether a response actually addresses the question that was asked, rather than drifting into related-but-off material. It catches the failure where an answer is true and well-sourced but doesn't answer what the user wanted.
- AUC-ROC AUC-ROC measures how well a classifier separates two classes across every possible threshold, summarized as one number from 0.5 (random) to 1.0 (perfect). Unlike accuracy, it doesn't depend on where you set the decision cutoff.
- BERTScore BERTScore compares generated and reference text by the similarity of their embeddings rather than exact word overlap. Because it works in meaning-space, it credits a correct paraphrase that BLEU or ROUGE would mark down.
- BLEU Score BLEU scores machine-generated text by how much its word sequences overlap with one or more human reference texts. It was built for machine translation, runs from 0 to 1, and is fast and cheap — but it rewards surface word-matching, not meaning.
- Cohen's Kappa Cohen's Kappa measures how much two raters agree beyond what you'd expect from random chance. It matters for AI because it's how you check whether your human labels — or an LLM judge against humans — are consistent enough to trust as ground truth.
- Context Relevance Context relevance measures whether the documents a RAG system retrieved actually bear on the question. It scores the retrieval step on its own, before the model writes anything — because the best model can't answer well from the wrong context.
- F1 Score The F1 score combines precision and recall into a single number — their harmonic mean. It's high only when both are high, which makes it a fairer summary than plain accuracy when the classes are imbalanced.
- Faithfulness Faithfulness measures whether an answer is actually supported by the source material it was given — every claim traceable to the retrieved context, nothing invented. It's the core anti-hallucination metric for RAG systems.
- Instruction Adherence Instruction adherence measures whether a model actually did what it was told — followed the format, honored the constraints, stayed within the rules of the prompt. A model can give a high-quality answer that ignores half the instructions, and this is the metric that catches it.
- Latency Latency is how long an AI system takes to respond. For LLMs it splits into time-to-first-token (how fast output starts) and total generation time, and it's a first-class product metric — a more accurate model that's too slow can still be the wrong choice.
- Mean Reciprocal Rank (MRR) MRR measures how high up the first correct result appears in a ranked list, averaged over many queries. If the right answer is usually near the top, MRR is close to 1; if it's buried, MRR drops. It's a core retrieval and search metric.
- METEOR METEOR is a text-generation metric that scores overlap with a reference more flexibly than BLEU — it credits synonyms and word-stem matches, not just exact words, and accounts for word order. It was designed to correlate better with human judgment on translation.
- Perplexity Perplexity measures how surprised a language model is by a piece of text — lower means the model found it more predictable. It's a quick intrinsic gauge of how well a model fits a dataset, but it says little about whether the model is actually useful or correct.
- Precision and Recall Precision and recall are two sides of accuracy. Precision asks: of the things the system flagged, how many were right? Recall asks: of the things it should have flagged, how many did it catch? They trade off against each other, so which one matters depends on whether false positives or misses cost you more.
- ROUGE ROUGE scores a generated summary by how much it overlaps with a human reference summary — leaning on recall, how much of the reference's content the output captured. It's the standard automatic metric for summarization.
- Word Error Rate (WER) WER measures speech-recognition accuracy as the share of words a transcript got wrong — the insertions, deletions, and substitutions needed to fix it, divided by the number of words spoken. Lower is better, and unlike most metrics it can exceed 100%.
Governance & compliance
Running AI responsibly and within the rules.
- AI Governance AI governance is the set of controls that lets an organization deploy AI responsibly: knowing what AI systems are running, bounding what they can do, logging what they did, and naming who's accountable. It's how you earn the right to ship AI that takes real actions.
- AI Pilot (Proof of Concept) An AI pilot is a small, time-boxed test of an AI use case before a full rollout. The trap is that pilots are easy and production is hard — a demo that works with a few users often dies on the way to scale, which is why so many never show ROI.
- AI Safety AI safety is the work of keeping AI systems from causing harm — making them behave as intended, refuse dangerous requests, and fail gracefully. In practice for builders it means alignment, guardrails, evaluation for harmful behavior, and human oversight on consequential actions.
- Audit Trail An audit trail is a durable, reconstructable record of what an AI system did — the inputs, decisions, tool calls, and outputs — so you can later explain or investigate any action it took. It's what turns 'the agent did something' into 'here's exactly what it did and why.'
- EU AI Act The EU AI Act is the European Union's regulation of AI, which sorts systems by risk level and imposes obligations accordingly — banning a few uses outright, heavily regulating 'high-risk' ones, and adding transparency rules for general-purpose models. Like GDPR, its reach extends to anyone serving EU users.
- Human in the Loop Human in the loop means keeping a person in the decision path of an AI system — to approve high-stakes actions, review uncertain outputs, or label the cases the model got wrong. It's the practical way to deploy autonomy you don't fully trust yet.
- Model Risk Management Model risk management is the discipline of identifying, measuring, and controlling the risks a model poses to a business — that it's wrong, biased, misused, or drifts over time. It comes from regulated finance and now applies to AI: treat each model as a risk to be governed, not just a tool to be shipped.
- Shadow AI Shadow AI is employees using AI tools their organization hasn't approved or doesn't know about — pasting work into a consumer chatbot, wiring up an unsanctioned agent. It's where a lot of real AI adoption actually happens, and where the governance and data-leak risk lives.