Reference
A - Z Generative AI Glossary
Generative AI and Agents Glossary
This glossary covers the core concepts you'll encounter when working with generative AI and agents. The domain is moving at a blistering pace, so expect new terms and techniques to emerge regularly.
A
Action The step an agent takes during execution. This could be calling a tool, making an API request, or providing a final answer to the user. Actions follow from the agent's reasoning about what to do next.
Agentic RAG An evolution of basic RAG where autonomous agents actively refine searches based on reasoning. Instead of one static search, agents generate multiple query refinements, decompose complex queries into steps, and validate results before responding.
Agent An AI system that can reason about tasks, use tools, and take actions to accomplish goals. Agents combine a language model (for decision-making), tools (for external interactions), and an orchestration layer (for reasoning and planning).
AgentOps The operational practices for building, deploying, and maintaining agents in production. It extends DevOps and MLOps with agent-specific concerns like tool management, orchestration, memory, and task decomposition.
Approximate Nearest Neighbor (ANN) Fast search methods that find vectors similar to a query without checking every single vector. Used in vector databases to scale semantic search to billions of embeddings. Examples include ScaNN, HNSW, and LSH.
Attention Mechanism How transformers determine which parts of input to focus on. Each word creates query, key, and value vectors. Scores calculate relationships between words, then weights determine how much each word influences others.
Autoregressive Generation Generating text one token at a time, where each new token depends on all previous tokens. The model predicts the next token, adds it to the sequence, then predicts again. This is how most LLMs work.
B
Benchmark Standard tests that measure model or agent performance. Examples include BFCL for function calling, PlanBench for planning, and AgentBench for end-to-end agent capabilities. Helps compare different approaches objectively.
BERT (Bidirectional Encoder Representations from Transformers) An encoder-only transformer model trained to understand context by predicting masked words. It revolutionized embeddings but isn't used for text generation. Modern LLMs evolved from this foundation.
C
Chain-of-Thought (CoT) A prompting technique where you show the model how to break problems into steps. Instead of jumping to an answer, the model generates intermediate reasoning steps. Improves performance on complex tasks requiring multi-step logic.
Chunking Breaking documents into smaller pieces before generating embeddings. Good chunking keeps related information together while staying within size limits. Critical for RAG quality.
Context Length / Context Window The maximum number of tokens a model can process at once. Longer contexts let models handle more information but require more compute. Modern models range from 4K to 128K+ tokens.
Cross-Entropy Loss The function used to measure how wrong model predictions are during training. Lower loss means better predictions. The training process adjusts parameters to minimize this loss.
D
Data Store A tool type that gives agents access to external information through RAG. Includes vector databases, relational databases, and document repositories. Agents query these to ground responses in factual data.
Decoder The part of a transformer that generates output text. Most modern LLMs use decoder-only architectures. The decoder predicts the next token based on previous tokens and any encoder outputs.
Distillation Training a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's outputs, not just raw data. Reduces inference costs while preserving much of the performance.
Dual Encoder An embedding architecture with separate networks for queries and documents. Lets you optimize each side differently since questions and answers have different characteristics. Common in search applications.
E
Embeddings Numerical representations of text, images, or other data in vector space. Similar items map to nearby points. Critical for semantic search, classification, and RAG. Typical dimensions: 768 to 1536.
Encoder The part of a transformer that processes input into representations. It uses self-attention to understand relationships between tokens. Encoder-only models like BERT excel at understanding but don't generate text.
Encoder-Decoder The original transformer architecture with separate encoding and decoding. The encoder processes input, the decoder generates output. Good for translation and summarization. Most modern LLMs skip the encoder.
Evaluation Measuring how well a model or agent performs. For agents, this includes trajectory evaluation (did it take the right steps?), final response quality, tool usage accuracy, and operational metrics like latency.
Extensions Pre-built integrations that let agents interact with external services. The agent controls when and how to call the API. Examples include code interpreters, search engines, and database connectors.
F
Few-Shot Prompting Providing the model with a few examples of the task before asking it to perform. Shows the model the pattern to follow. More reliable than zero-shot for complex or unusual tasks.
Fine-Tuning Additional training on task-specific data to improve model performance. Much cheaper than pre-training. Types include supervised fine-tuning (SFT), instruction tuning, and safety tuning.
Flash Attention An optimized attention calculation that minimizes data movement between memory tiers. Changes operation order and fuses layers to use fast memory efficiently. Can give 2-4x speedups with identical outputs.
Function Calling The model's ability to generate structured requests for external functions. The model doesn't execute functions directly - it outputs which function to call and what arguments to use. The application handles execution.
G
Gradient The direction parameters should move to reduce loss. Calculated during backpropagation. The optimizer uses gradients to update model weights during training.
Greedy Search Always picking the token with highest probability. Simple but can produce repetitive text. Other sampling methods add controlled randomness for more natural outputs.
Grounding Connecting model outputs to factual sources. RAG is one grounding technique. Citation checking and fact verification also help. Reduces hallucinations by anchoring responses in real data.
H
Hallucination When a model generates plausible-sounding but incorrect information. Happens because models predict likely text, not truth. RAG, grounding, and evaluation help reduce hallucinations but don't eliminate them.
HNSW (Hierarchical Navigable Small Worlds) An ANN algorithm that builds a multi-layer graph for fast similarity search. Navigates from general to specific to find nearest neighbors quickly. Good balance of speed and accuracy.
I
Inference Running a trained model to generate outputs. More expensive for LLMs than traditional models because of autoregressive generation. Optimization techniques like prefix caching and speculative decoding reduce costs.
Instruction Tuning Fine-tuning a model to follow natural language instructions. Teaches the model to understand and execute commands like "summarize this article" or "write code to sort a list."
K
Key-Value (KV) Cache Stored attention scores from previous tokens. Prevents recalculating attention for tokens already processed. Critical for inference speed during autoregressive generation.
Knowledge Cutoff The date beyond which a model has no training data. Models can't know events after this without external tools. Current models typically have cutoffs in 2023-2025.
L
Large Language Model (LLM) A neural network trained on massive text data to predict the next token. Modern LLMs have billions of parameters and can handle diverse tasks through prompting or fine-tuning.
Layer A set of parameters that transforms data in a neural network. Transformers have many layers, each with attention and feed-forward components. More layers generally means more capability.
LoRA (Low-Rank Adaptation) A parameter-efficient fine-tuning method that adds small trainable matrices to frozen model weights. Much cheaper than full fine-tuning with similar performance gains.
M
Masked Language Modeling (MLM) Training by hiding random tokens and having the model predict them. Used to train BERT and other encoder models. Teaches bidirectional understanding of context.
Memory (Agent) How agents store information across interactions. Short-term memory maintains conversation history within a session. Long-term memory persists across sessions for personalization and learning.
Multi-Agent System Multiple specialized agents working together on complex tasks. Each agent has its own role and context. They communicate and coordinate to achieve common goals.
Multi-Head Attention Running several attention mechanisms in parallel, each potentially focusing on different relationships. Outputs are combined to give richer representations. Key to transformer performance.
Multimodal Model A model that handles multiple data types - text, images, audio, video. Processes different modalities in a unified way. Examples include Gemini and GPT-4V.
O
Observation The result returned after an agent takes an action. Could be tool output, API response, or error message. Feeds back into the reasoning loop to determine next steps.
One-Shot Prompting Providing exactly one example before asking the model to perform a task. Middle ground between zero-shot and few-shot.
Orchestration Layer The component that manages agent reasoning, planning, state, and memory. Implements frameworks like ReAct or Chain-of-Thought to guide decision-making.
P
Parameters The learned weights in a neural network. More parameters generally mean more capability but also more compute cost. Modern LLMs range from 7B to 1T+ parameters.
PEFT (Parameter Efficient Fine-Tuning) Techniques that fine-tune only a small subset of parameters. Reduces training costs dramatically. Includes methods like LoRA and adapter layers.
Positional Encoding Information added to embeddings about token positions in the sequence. Transformers need this because attention is order-agnostic. Helps the model understand word order.
Prefix Caching Storing KV cache between requests to avoid recalculating attention for unchanged input. Saves compute on repeated prefixes like system prompts or uploaded documents.
Pre-training The initial training phase on massive unlabeled data. Teaches the model general language understanding. Most expensive training stage, taking weeks to months.
Prompt Engineering Designing input text to get desired model behavior. Includes techniques like few-shot examples, chain-of-thought, and system instructions. Critical for getting good results without fine-tuning.
Q
Query (in Attention) A vector that asks "which other words are relevant to me?" Used in attention calculation to determine relationships between tokens.
R
RAG (Retrieval Augmented Generation) Finding relevant documents from a knowledge base and adding them to the prompt before generation. Helps models give factual, current answers without retraining.
ReAct (Reasoning and Acting) A prompting framework where the model alternates between reasoning steps and actions. The agent thinks about what to do, takes an action, observes the result, and repeats until reaching an answer.
Reinforcement Learning from Human Feedback (RLHF) Fine-tuning using human preferences rather than just demonstration data. A reward model learns what humans prefer, then reinforcement learning optimizes the LLM to generate preferred outputs.
Reward Model A model trained on human preference data that scores outputs. Used in RLHF to guide the LLM toward more helpful, safe, or accurate responses.
S
ScaNN (Scalable Nearest Neighbors) Google's ANN algorithm using anisotropic vector quantization. Extremely fast similarity search at scale. Powers products like YouTube and Google Search.
Self-Attention The mechanism where each token attends to all other tokens in the sequence. Lets models capture relationships regardless of distance. Core innovation of transformers.
Semantic Search Finding results based on meaning rather than exact keyword matches. Uses embeddings to measure similarity. Can find relevant documents even with different wording.
Speculative Decoding Speeding up generation by using a small model to guess multiple tokens, then verifying with the large model. Maintains quality while reducing latency for memory-bound decode.
Supervised Fine-Tuning (SFT) Training on input-output pairs where each input has a target response. Examples include question-answer pairs or translation pairs. Improves task-specific performance.
System Prompting Instructions that set the model's behavior and persona. Usually placed at the start of the conversation. Examples: "You are a helpful assistant" or "Always cite your sources."
T
Temperature Controls randomness in generation. Lower values (0.1-0.5) make outputs more focused and deterministic. Higher values (0.7-1.0) increase creativity and diversity.
Token The basic unit of text a model processes. Could be a word, word piece, or character. Most models use subword tokenization where common words are single tokens and rare words split into pieces.
Tokenization Breaking text into tokens. Different models use different tokenization schemes. The vocabulary size determines how many unique tokens exist.
Tool An external capability an agent can use. Includes extensions (agent-controlled APIs), functions (client-executed code), and data stores (for RAG).
Top-K Sampling Limiting the model to choose from the K most likely tokens. Prevents very low-probability tokens from being selected. Helps avoid nonsense while maintaining some variety.
Top-P (Nucleus) Sampling Choosing from the smallest set of tokens whose cumulative probability exceeds P. Adapts to the situation - uses more tokens when uncertain, fewer when confident.
Trajectory Evaluation Assessing the steps an agent took to reach its answer. Checks if it used the right tools, followed good reasoning, and didn't get stuck. More insightful than just checking the final output.
Transformer The neural network architecture that powers modern LLMs. Uses self-attention to process sequences in parallel. Much faster to train than RNNs while capturing longer-range dependencies.
Tree of Thoughts (ToT) An extension of chain-of-thought that explores multiple reasoning paths. Like searching a tree of possibilities. Good for problems requiring strategic lookahead or exploration.
V
Vector Database A specialized database for storing and searching embeddings. Optimized for high-dimensional similarity search at scale. Examples include Vertex AI Vector Search, Pinecone, Weaviate, Qdrant.
Vector Search Finding items with similar embeddings using distance metrics. Much faster than comparing everything. Core technology enabling semantic search and RAG.
Z
Zero-Shot Prompting Asking the model to perform a task without examples. Relies purely on the model's pre-training and the instruction clarity. Works well for simple tasks or very capable models.