AI Glossary — Chain of Thought

Concepts

The core ideas behind modern AI.

Agent Memory Agent memory is what an AI agent keeps and reuses beyond a single turn — working memory in the context window, plus longer-term stores it can write to and retrieve from later. It's how an agent stops starting every session from zero.
Chain-of-Thought Prompting Chain-of-thought prompting asks a model to work through its reasoning step by step before giving a final answer. Spelling out the intermediate steps measurably improves accuracy on multi-step problems like math, logic, and planning.
Compound AI Systems A compound AI system solves a task with multiple components — several model calls, retrieval, tools, and control logic — rather than a single prompt to a single model. Most real production AI is compound, which is why reliability is a systems problem, not a model problem.
Context Engineering Context engineering is the discipline of deciding what information goes into a model's context window for a task — which documents, which history, which tool output — and how fresh and trustworthy it is. As models commoditize, it's where a lot of the durable advantage now lives.
Context Poisoning Context poisoning is when bad, stale, or excessive information in a model's context window degrades its reasoning — the agent gets buried in irrelevant tokens or misled by wrong data, and its answers get worse even though the model is fine.
Context Window The context window is the amount of text a model can take in at once — the prompt, the conversation so far, and any documents or tool output you include. Everything the model can 'see' for a given response has to fit inside it.
Embeddings Embeddings are numerical representations of text, images, or other data as vectors, where things with similar meaning land close together. They're what lets a system search by meaning instead of by exact keyword.
Explainability Explainability is how well you can understand why an AI system produced a given output. It matters most where decisions need to be justified — lending, hiring, healthcare — and it's hard for large models, whose reasoning isn't transparent just because they can narrate a plausible-sounding rationale.
Fine-Tuning Fine-tuning continues training a pretrained model on your own examples so it gets better at a specific task, tone, or format. It changes the model's weights, unlike prompting or RAG, which change what you feed it.
FlashAttention FlashAttention is an optimized way to compute a transformer's attention that's far more memory-efficient, by avoiding writing the huge intermediate attention matrix to memory. It makes longer context windows and faster training practical without changing the model's results.
Frontier Model A frontier model is one of the most capable AI models available at a given moment — the latest flagship releases from the major labs that set the current ceiling on what's possible. The label moves: today's frontier model is next year's baseline.
Inference Inference is running a trained model to produce output — the part that happens every time a user sends a prompt. It's distinct from training (teaching the model in the first place): training is a big one-time cost, inference is the recurring cost you pay on every request, forever.
Knowledge Distillation Knowledge distillation trains a small 'student' model to imitate a larger 'teacher' model, transferring much of the teacher's capability into a model that's cheaper and faster to run. It's a main way the strong-but-expensive becomes small-enough-to-ship.
Knowledge Graph A knowledge graph stores information as entities and the relationships between them, rather than as loose documents or vectors. For AI, it gives a model structured, connected context it can traverse — which is one answer to grounding and hallucination.
LoRA (Low-Rank Adaptation) LoRA is a parameter-efficient way to fine-tune a model: instead of updating all its weights, you train small add-on matrices and leave the original model frozen. You get most of the benefit of fine-tuning at a fraction of the compute and storage.
Mixture of Experts (MoE) A mixture of experts is a model split into many specialized sub-networks, where a router sends each input to just a few of them. You get the capacity of a huge model while only running a fraction of it per request.
Open Weights An open-weights model is one whose trained parameters are released publicly, so anyone can download, run, inspect, and fine-tune it. It's distinct from fully open source — the weights are open even when the training data and code aren't.
Prompt Engineering Prompt engineering is crafting the instruction you give a model — the wording, examples, and output format — to get better results without retraining it. It's the most visible AI skill and, as models improve, increasingly table stakes rather than a moat.
Quantization Quantization shrinks a model by storing its weights at lower numerical precision — say 4-bit integers instead of 16-bit floats. The model gets smaller and faster to run, usually with little quality loss, which is what lets large models fit on smaller hardware.
Reasoning Models Reasoning models are LLMs trained to do extended step-by-step thinking before they answer, spending more compute at inference to work through hard problems. They trade latency and cost for accuracy on math, code, and multi-step logic.
Retrieval-Augmented Generation (RAG) RAG is the pattern of fetching relevant documents at query time and feeding them to a model alongside the question, so the answer is grounded in real sources instead of the model's memory. It's how you put private or current data in front of a model without retraining it.
RLHF (Reinforcement Learning from Human Feedback) RLHF is a training step that tunes a model toward what people actually prefer: humans rank model outputs, those rankings train a reward model, and the model is then optimized to score well against it. It's a big part of why chat models feel helpful instead of just fluent.
Robotic Process Automation (RPA) RPA automates repetitive digital tasks with explicit, rule-based scripts — click here, copy this field, paste it there. It's deterministic and brittle: it does exactly what it's told and breaks when the screen or process changes, which is the contrast that defines AI agents.
State-Space Models (Mamba) State-space models are a transformer alternative that process sequences by carrying a compact running state forward, rather than comparing every token to every other token. They scale linearly with sequence length instead of quadratically — cheaper on long inputs — with Mamba the best-known example.
Temperature Temperature is the setting that controls how random a model's output is. Low temperature makes it pick the most likely next token almost every time (focused, repeatable); high temperature spreads the odds (varied, creative, less predictable). It's the main dial between consistency and creativity.
Tokenization Tokenization splits text into the chunks a model actually processes — tokens, which are roughly word-pieces, not whole words. It's why model limits and pricing are counted in tokens, and why 'a few paragraphs' is a fuzzy unit but 'tokens' is exact.
Transformer The transformer is the neural-network architecture behind almost every modern large language model. Its key idea is attention: each token can look at every other token and weigh which ones matter, which is what lets the model handle context and long-range meaning.
Vector Database A vector database stores embeddings — the numerical representations of your data — and is built to find the nearest ones to a query fast. It's the retrieval engine underneath most RAG systems.

Security & attacks

How AI systems get attacked, and the defenses.

Metrics

What the numbers in AI evaluation mean.

Governance & compliance

Running AI responsibly and within the rules.

The terms behind modern AI, defined

Concepts

Security & attacks

Metrics

Governance & compliance