The Hitchhiker's Guide to Generative AI and Agents

Every new technology brings its own language. Sometimes that language clarifies. Often it obscures.

Generative AI is no exception. In the past two years, we've been flooded with terms: LLMs, embeddings, RAG, Agentic workflows, chain-of-thought reasoning, hallucinations, fine-tuning, vector databases. Some of these terms describe genuinely new concepts. Others are marketing renaming old ideas.

This guide cuts through the noise. No hype, no hand-waving.

Clear explanations of what these terms actually mean and why they matter when you're building AI applications.

What is Generative AI?

Generative AI generates or creates new content text, images, code, audio, based on patterns it learned from training data.

This is different from traditional AI, which classifies or predicts from existing options. A spam filter decides if an email is spam or not (classification). A recommendation engine picks from existing products (prediction). A stock price forecasting model predicts a numeric value based on regression.

Generative AI creates new patterns that didn't exist before based on learned patterns and user queries or prompts.

The "generative" part matters. These systems don't just retrieve or categorize or forecast. They generate.

When you ask ChatGPT to write an email, it doesn't pull from a database of emails. It generates new text, word by word, based on patterns learned from billions of examples.

Large Language Models (LLMs)

Large Language Models are the engines behind most text based applications of generative AI. They're trained on the entire internet to predict what words should come next in a sequence.

The "large" refers to two things: the amount of training data (often trillions of words) and the number of parameters (the internal variables the model adjusts during training). GPT-4 reportedly has over a trillion parameters. Claude 3.5 Sonnet, hundreds of billions.

Here's what matters practically: LLMs are prediction machines. You give them text (called a prompt), and they predict what should come next, token by token, based on patterns they learned during training.

They don't "understand" in the way humans do. They're exceptionally good at recognizing and reproducing patterns. That difference explains both their capabilities and their limitations.

Tokens

Tokens are the basic units LLMs work with. Not quite words, not quite characters, somewhere in between.

A token might be a whole word ("cat"), part of a word ("un" + "bel" + "iev" + "able"), or even punctuation. English text typically breaks down to about 4 characters per token, so roughly 750 words equals 1,000 tokens.

Why this matters: LLMs are priced by tokens. Context windows (how much text the model can consider at once) are measured in tokens. Understanding tokens helps you understand costs and constraints.

When someone says "GPT-4 has a 128k token context window," they mean it can consider about 96,000 words of input text at once.

Prompts and Context

A prompt is the text you give an LLM to generate a response. It's both simpler and more complex than it sounds.

Simple version: "Write an email apologizing for a delayed shipment."

Complex version: A carefully structured prompt that includes instructions, examples, relevant data, conversation history, and constraints on the output format.

Context is everything the model sees when generating a response. This includes:

Your current prompt
Previous messages in the conversation
Any additional information you've fed it (documents, data, instructions)
System messages that shape how it behaves

Context management is crucial. LLMs are stateless, that is they don't remember past conversations unless you include that history in the current context. If you want the model to remember what the user said in a separate conversation three days ago, you have to send those messages again.

Embeddings and Vector Databases

Embeddings convert text into numbers, specifically, into lists of numbers (vectors) that represent the meaning of the text.

Why this matters: similar meanings produce similar vectors. "dog" and "puppy" will have embeddings that are mathematically close. "dog" and "telescope" will be far apart.

This lets you do semantic search. Instead of matching exact keywords, you can find text that means similar things.

Vector databases store these embeddings and let you search them efficiently. When you need to find relevant information from thousands of documents, you convert your question into an embedding and search for similar embeddings in your database.

This is foundational infrastructure for many AI applications. You're essentially building a system that can find meaning, not just words.

RAG (Retrieval Augmented Generation)

RAG combines retrieval (finding relevant information) with generation (creating a response).

Here's the pattern:

User asks a question
Convert question to embedding
Search a database for relevant content
Retrieve relevant documents or data
Feed those documents to the LLM as context
LLM generates response using that information

Why use RAG instead of just asking the LLM directly? Because LLMs only know what they were trained on. They don't know your company's internal documents, your product specifications, or yesterday's meeting notes.

RAG lets you give the model access to specific, current information without retraining it.

Most production AI applications use some form of RAG. It's the bridge between generic models and domain-specific knowledge.

Fine-tuning

Fine-tuning takes a pre-trained model and trains it further on your specific data.

Think of it like specialization. The base model learned general language patterns. Fine-tuning teaches it your specific vocabulary, style, or task.

When to fine-tune vs use RAG:

RAG: When you need to incorporate changing information or specific facts
Fine-tuning: When you need consistent style, domain-specific understanding, or task-specific behavior

Fine-tuning is more expensive and complex than RAG. Most teams start with RAG and only fine-tune when they hit clear limitations.

What AI Agents Actually Are

Here's where terminology gets messy. Everyone calls different things "agents."

At its core, an agent is an AI system that can take actions, not just generate responses. It doesn't just tell you what to do—it can do things.

Practical definition: An agent is an LLM that can:

Reason about a task
Decide what actions to take
Use tools to take those actions
Observe results
Adjust its approach based on what happened

The key word is "tools." Agents have access to functions they can call: search databases, run calculations, send emails, update records, call APIs.

A simple chatbot isn't an agent—it just talks. An agent that can check your calendar, find a meeting time, and send invites? That's an agent.

Tools and Function Calling

Tools (also called functions or plugins) are capabilities you give an agent.

Example tools:

Search a database
Calculate something
Fetch current weather
Send an email
Update a CRM record
Run code
Search the web

Function calling is how the LLM tells your system which tool to use. The model doesn't execute the function—it specifies what to call and with what parameters. Your code executes it and returns results.

This is crucial infrastructure. It's how you extend LLMs beyond text generation into actual system integration.

Reasoning and Chain of Thought

Reasoning in LLMs refers to the model working through a problem step by step, not jumping straight to an answer.

Chain-of-thought is a technique where you prompt the model to "show its work"—to articulate intermediate steps before reaching a conclusion.

Without chain-of-thought:

Question: "If I have 3 apples and buy 2 more, then give away 1, how many do I have?"
Answer: "4"

With chain-of-thought:

Question: "If I have 3 apples and buy 2 more, then give away 1, how many do I have? Think step by step."
Answer: "Starting with 3 apples. After buying 2 more, I have 3 + 2 = 5 apples. After giving away 1, I have 5 - 1 = 4 apples. Final answer: 4."

The second approach is more reliable, especially for complex problems. You can also inspect the reasoning to see where it went wrong if the answer is incorrect.

Modern models like Claude and GPT-4 have chain-of-thought reasoning built in, even when you don't explicitly ask for it.

Hallucinations

Hallucinations happen when an LLM generates information that sounds plausible but is factually wrong.

This isn't a bug—it's inherent to how these models work. They're trained to predict plausible text, not to verify truth. If making something up produces text that fits the pattern, the model will do it.

Examples:

Citing academic papers that don't exist
Inventing product features
Making up statistics
Creating plausible but wrong historical facts

Why this matters: You can't eliminate hallucinations completely. You can only reduce them through:

Better prompting (asking the model to cite sources, admit uncertainty)
RAG (grounding responses in real documents)
Validation (checking outputs against known facts)
Human review (especially for high-stakes decisions)

This is why "AI said so" isn't sufficient for critical applications. You need validation layers.

Temperature and Other Parameters

Temperature controls randomness in the model's output.

Low temperature (0-0.3): More focused, deterministic, consistent. Same question will usually get similar answers.
High temperature (0.7-1.0): More creative, varied, unpredictable. Same question might get very different answers.

When to use which:

Low temperature: Data extraction, code generation, factual Q&A
High temperature: Creative writing, brainstorming, varied examples

Other parameters you'll encounter:

Top-p: Alternative way to control randomness
Max tokens: How long the response can be
Stop sequences: Text that tells the model to stop generating

Most applications use temperature between 0 and 0.7. Higher values are rarely practical for production systems.

Evaluation Metrics

How do you measure if an AI application is working? This is harder than it sounds.

Traditional metrics:

BLEU/ROUGE: Compare generated text to reference text. Originally for translation. Limited usefulness for open-ended generation.
Perplexity: How "surprised" the model is by the next word. Lower is better. Useful for model comparison, not for production monitoring.

Modern approaches:

LLM-as-judge: Use another LLM to evaluate responses for quality, relevance, accuracy. Scales better than human review but adds cost and latency.
Human evaluation: Still the gold standard, but expensive and slow.
Task-specific metrics: Did the agent complete the task? Did the user get what they needed?

The honest truth: evaluating generative AI is still evolving. There's no single metric that captures "is this good?" You usually need multiple approaches.

Context Windows and Memory

Context window is how much text an LLM can consider at once. Measured in tokens.

Current context windows:

GPT-4 Turbo: 128k tokens (~96k words)
Claude 3.5 Sonnet: 200k tokens (~150k words)
Gemini 1.5 Pro: 2 million tokens (~1.5 million words)

Larger windows let you include more conversation history, documents, or data. But they also cost more and can be slower.

Memory in AI applications means maintaining context across conversations. Since LLMs are stateless, you have to manage this yourself—storing conversation history and including relevant parts in each request.

Some systems implement "memory" by storing summaries of past conversations and retrieving them when relevant. This is really just managed context and retrieval, not memory in the human sense.

Guardrails and Safety

Guardrails are rules and checks you implement to keep AI systems safe and on-topic.

Types of guardrails:

Input filtering: Block harmful or off-topic prompts
Output validation: Check generated content before showing it to users
Content moderation: Detect and filter inappropriate content
Behavioral constraints: Keep the AI within defined boundaries

Example: A customer service agent might have guardrails preventing it from discussing politics, making promises about refunds beyond policy limits, or sharing internal company information.

Guardrails are implemented through:

System prompts that define boundaries
Validation rules in your application code
Secondary models that check outputs
Human review for high-risk content

Critical point: Guardrails aren't foolproof. Determined users can often find ways around them (called "jailbreaking"). Layer multiple approaches.

Observability, Monitoring, and Evals

These terms get used interchangeably but mean different things.

Observability is your ability to understand what's happening inside your AI system. Can you see:

What prompt was sent?
What context was included?
How did the model reason?
What tools did it call?
What was the final output?

Monitoring is watching your system over time for issues:

Latency spikes
Error rates increasing
Cost trending up
Quality degrading

Evals (evaluations) are tests you run to check if your system works:

Does it handle edge cases correctly?
Does output match expected quality?
Does it follow your domain-specific rules?

All three are necessary. Observability helps you debug. Monitoring catches problems. Evals prevent them.

Putting It Together

These concepts aren't isolated—they work together in real systems.

Here's a typical AI application architecture:

User sends a query (prompt)
RAG system retrieves relevant documents (embeddings, vector database)
Prompt is constructed with retrieved docs and conversation history (context management)
LLM generates response (potentially using chain-of-thought reasoning)
Agent decides if tools are needed (function calling)
Output is validated (guardrails, evals)
Response is returned and logged (observability)
System monitors quality over time (monitoring)

Each piece matters. Miss one and your system becomes unreliable.

What Actually Matters

You don't need to understand every detail of how LLMs work internally. You do need to understand:

What they're good at: Pattern matching, text generation, reasoning through structured problems
What they're bad at: Math, factual accuracy without sources, consistency without constraints
How to work with them: Good prompting, proper context management, validation layers
What can go wrong: Hallucinations, context limits, cost overruns, quality degradation

The technology is powerful but not magic. It's a tool. Like any tool, it works best when you understand what it's actually doing.

Keep Learning

This guide covers the fundamentals, but the field moves fast. New capabilities emerge. Best practices evolve. Terms shift meaning.

The concepts here, prompts, context, agents, evaluation will remain relevant even as specific models and techniques change. They're the foundation everything else builds on.

When you encounter a new term or technique, ask:

What problem does this solve?
How does it fit with what I already know?
Is this genuinely new or renamed?

Most importantly: build things. Reading about AI is useful. Building with it teaches you what actually matters.

The concepts in this guide will make more sense once you've debugged a misbehaving prompt, optimized token usage to reduce costs, or traced why an agent took an unexpected action.

Theory meets reality in production. That's where you learn what really works.

Every new technology brings its own language. Sometimes that language clarifies. Often it obscures.

Generative AI is no exception. In the past two years, we've been flooded with terms: LLMs, embeddings, RAG, Agentic workflows, chain-of-thought reasoning, hallucinations, fine-tuning, vector databases. Some of these terms describe genuinely new concepts. Others are marketing renaming old ideas.

This guide cuts through the noise. No hype, no hand-waving.

Clear explanations of what these terms actually mean and why they matter when you're building AI applications.

What is Generative AI?

Generative AI generates or creates new content text, images, code, audio, based on patterns it learned from training data.

This is different from traditional AI, which classifies or predicts from existing options. A spam filter decides if an email is spam or not (classification). A recommendation engine picks from existing products (prediction). A stock price forecasting model predicts a numeric value based on regression.

Generative AI creates new patterns that didn't exist before based on learned patterns and user queries or prompts.

The "generative" part matters. These systems don't just retrieve or categorize or forecast. They generate.

When you ask ChatGPT to write an email, it doesn't pull from a database of emails. It generates new text, word by word, based on patterns learned from billions of examples.

Large Language Models (LLMs)

Large Language Models are the engines behind most text based applications of generative AI. They're trained on the entire internet to predict what words should come next in a sequence.

The "large" refers to two things: the amount of training data (often trillions of words) and the number of parameters (the internal variables the model adjusts during training). GPT-4 reportedly has over a trillion parameters. Claude 3.5 Sonnet, hundreds of billions.

Here's what matters practically: LLMs are prediction machines. You give them text (called a prompt), and they predict what should come next, token by token, based on patterns they learned during training.

They don't "understand" in the way humans do. They're exceptionally good at recognizing and reproducing patterns. That difference explains both their capabilities and their limitations.

Tokens

Tokens are the basic units LLMs work with. Not quite words, not quite characters, somewhere in between.

A token might be a whole word ("cat"), part of a word ("un" + "bel" + "iev" + "able"), or even punctuation. English text typically breaks down to about 4 characters per token, so roughly 750 words equals 1,000 tokens.

Why this matters: LLMs are priced by tokens. Context windows (how much text the model can consider at once) are measured in tokens. Understanding tokens helps you understand costs and constraints.

When someone says "GPT-4 has a 128k token context window," they mean it can consider about 96,000 words of input text at once.

Prompts and Context

A prompt is the text you give an LLM to generate a response. It's both simpler and more complex than it sounds.

Simple version: "Write an email apologizing for a delayed shipment."

Complex version: A carefully structured prompt that includes instructions, examples, relevant data, conversation history, and constraints on the output format.

Context is everything the model sees when generating a response. This includes:

Your current prompt
Previous messages in the conversation
Any additional information you've fed it (documents, data, instructions)
System messages that shape how it behaves

Context management is crucial. LLMs are stateless, that is they don't remember past conversations unless you include that history in the current context. If you want the model to remember what the user said in a separate conversation three days ago, you have to send those messages again.

Embeddings and Vector Databases

Embeddings convert text into numbers, specifically, into lists of numbers (vectors) that represent the meaning of the text.

Why this matters: similar meanings produce similar vectors. "dog" and "puppy" will have embeddings that are mathematically close. "dog" and "telescope" will be far apart.

This lets you do semantic search. Instead of matching exact keywords, you can find text that means similar things.

Vector databases store these embeddings and let you search them efficiently. When you need to find relevant information from thousands of documents, you convert your question into an embedding and search for similar embeddings in your database.

This is foundational infrastructure for many AI applications. You're essentially building a system that can find meaning, not just words.

RAG (Retrieval Augmented Generation)

RAG combines retrieval (finding relevant information) with generation (creating a response).

Here's the pattern:

User asks a question
Convert question to embedding
Search a database for relevant content
Retrieve relevant documents or data
Feed those documents to the LLM as context
LLM generates response using that information

Why use RAG instead of just asking the LLM directly? Because LLMs only know what they were trained on. They don't know your company's internal documents, your product specifications, or yesterday's meeting notes.

RAG lets you give the model access to specific, current information without retraining it.

Most production AI applications use some form of RAG. It's the bridge between generic models and domain-specific knowledge.

Fine-tuning

Fine-tuning takes a pre-trained model and trains it further on your specific data.

Think of it like specialization. The base model learned general language patterns. Fine-tuning teaches it your specific vocabulary, style, or task.

When to fine-tune vs use RAG:

RAG: When you need to incorporate changing information or specific facts
Fine-tuning: When you need consistent style, domain-specific understanding, or task-specific behavior

Fine-tuning is more expensive and complex than RAG. Most teams start with RAG and only fine-tune when they hit clear limitations.

What AI Agents Actually Are

Here's where terminology gets messy. Everyone calls different things "agents."

At its core, an agent is an AI system that can take actions, not just generate responses. It doesn't just tell you what to do—it can do things.

Practical definition: An agent is an LLM that can:

Reason about a task
Decide what actions to take
Use tools to take those actions
Observe results
Adjust its approach based on what happened

The key word is "tools." Agents have access to functions they can call: search databases, run calculations, send emails, update records, call APIs.

A simple chatbot isn't an agent—it just talks. An agent that can check your calendar, find a meeting time, and send invites? That's an agent.

Tools and Function Calling

Tools (also called functions or plugins) are capabilities you give an agent.

Example tools:

Search a database
Calculate something
Fetch current weather
Send an email
Update a CRM record
Run code
Search the web

Function calling is how the LLM tells your system which tool to use. The model doesn't execute the function—it specifies what to call and with what parameters. Your code executes it and returns results.

This is crucial infrastructure. It's how you extend LLMs beyond text generation into actual system integration.

Reasoning and Chain of Thought

Reasoning in LLMs refers to the model working through a problem step by step, not jumping straight to an answer.

Chain-of-thought is a technique where you prompt the model to "show its work"—to articulate intermediate steps before reaching a conclusion.

Without chain-of-thought:

Question: "If I have 3 apples and buy 2 more, then give away 1, how many do I have?"
Answer: "4"

With chain-of-thought:

Question: "If I have 3 apples and buy 2 more, then give away 1, how many do I have? Think step by step."
Answer: "Starting with 3 apples. After buying 2 more, I have 3 + 2 = 5 apples. After giving away 1, I have 5 - 1 = 4 apples. Final answer: 4."

The second approach is more reliable, especially for complex problems. You can also inspect the reasoning to see where it went wrong if the answer is incorrect.

Modern models like Claude and GPT-4 have chain-of-thought reasoning built in, even when you don't explicitly ask for it.

Hallucinations

Hallucinations happen when an LLM generates information that sounds plausible but is factually wrong.

This isn't a bug—it's inherent to how these models work. They're trained to predict plausible text, not to verify truth. If making something up produces text that fits the pattern, the model will do it.

Examples:

Citing academic papers that don't exist
Inventing product features
Making up statistics
Creating plausible but wrong historical facts

Why this matters: You can't eliminate hallucinations completely. You can only reduce them through:

Better prompting (asking the model to cite sources, admit uncertainty)
RAG (grounding responses in real documents)
Validation (checking outputs against known facts)
Human review (especially for high-stakes decisions)

This is why "AI said so" isn't sufficient for critical applications. You need validation layers.

Temperature and Other Parameters

Temperature controls randomness in the model's output.

Low temperature (0-0.3): More focused, deterministic, consistent. Same question will usually get similar answers.
High temperature (0.7-1.0): More creative, varied, unpredictable. Same question might get very different answers.

When to use which:

Low temperature: Data extraction, code generation, factual Q&A
High temperature: Creative writing, brainstorming, varied examples

Other parameters you'll encounter:

Top-p: Alternative way to control randomness
Max tokens: How long the response can be
Stop sequences: Text that tells the model to stop generating

Most applications use temperature between 0 and 0.7. Higher values are rarely practical for production systems.

Evaluation Metrics

How do you measure if an AI application is working? This is harder than it sounds.

Traditional metrics:

BLEU/ROUGE: Compare generated text to reference text. Originally for translation. Limited usefulness for open-ended generation.
Perplexity: How "surprised" the model is by the next word. Lower is better. Useful for model comparison, not for production monitoring.

Modern approaches:

LLM-as-judge: Use another LLM to evaluate responses for quality, relevance, accuracy. Scales better than human review but adds cost and latency.
Human evaluation: Still the gold standard, but expensive and slow.
Task-specific metrics: Did the agent complete the task? Did the user get what they needed?

The honest truth: evaluating generative AI is still evolving. There's no single metric that captures "is this good?" You usually need multiple approaches.

Context Windows and Memory

Context window is how much text an LLM can consider at once. Measured in tokens.

Current context windows:

GPT-4 Turbo: 128k tokens (~96k words)
Claude 3.5 Sonnet: 200k tokens (~150k words)
Gemini 1.5 Pro: 2 million tokens (~1.5 million words)

Larger windows let you include more conversation history, documents, or data. But they also cost more and can be slower.

Memory in AI applications means maintaining context across conversations. Since LLMs are stateless, you have to manage this yourself—storing conversation history and including relevant parts in each request.

Some systems implement "memory" by storing summaries of past conversations and retrieving them when relevant. This is really just managed context and retrieval, not memory in the human sense.

Guardrails and Safety

Guardrails are rules and checks you implement to keep AI systems safe and on-topic.

Types of guardrails:

Input filtering: Block harmful or off-topic prompts
Output validation: Check generated content before showing it to users
Content moderation: Detect and filter inappropriate content
Behavioral constraints: Keep the AI within defined boundaries

Example: A customer service agent might have guardrails preventing it from discussing politics, making promises about refunds beyond policy limits, or sharing internal company information.

Guardrails are implemented through:

System prompts that define boundaries
Validation rules in your application code
Secondary models that check outputs
Human review for high-risk content

Critical point: Guardrails aren't foolproof. Determined users can often find ways around them (called "jailbreaking"). Layer multiple approaches.

Observability, Monitoring, and Evals

These terms get used interchangeably but mean different things.

Observability is your ability to understand what's happening inside your AI system. Can you see:

What prompt was sent?
What context was included?
How did the model reason?
What tools did it call?
What was the final output?

Monitoring is watching your system over time for issues:

Latency spikes
Error rates increasing
Cost trending up
Quality degrading

Evals (evaluations) are tests you run to check if your system works:

Does it handle edge cases correctly?
Does output match expected quality?
Does it follow your domain-specific rules?

All three are necessary. Observability helps you debug. Monitoring catches problems. Evals prevent them.

Putting It Together

These concepts aren't isolated—they work together in real systems.

Here's a typical AI application architecture:

User sends a query (prompt)
RAG system retrieves relevant documents (embeddings, vector database)
Prompt is constructed with retrieved docs and conversation history (context management)
LLM generates response (potentially using chain-of-thought reasoning)
Agent decides if tools are needed (function calling)
Output is validated (guardrails, evals)
Response is returned and logged (observability)
System monitors quality over time (monitoring)

Each piece matters. Miss one and your system becomes unreliable.

What Actually Matters

You don't need to understand every detail of how LLMs work internally. You do need to understand:

What they're good at: Pattern matching, text generation, reasoning through structured problems
What they're bad at: Math, factual accuracy without sources, consistency without constraints
How to work with them: Good prompting, proper context management, validation layers
What can go wrong: Hallucinations, context limits, cost overruns, quality degradation

The technology is powerful but not magic. It's a tool. Like any tool, it works best when you understand what it's actually doing.

Keep Learning

This guide covers the fundamentals, but the field moves fast. New capabilities emerge. Best practices evolve. Terms shift meaning.

The concepts here, prompts, context, agents, evaluation will remain relevant even as specific models and techniques change. They're the foundation everything else builds on.

When you encounter a new term or technique, ask:

What problem does this solve?
How does it fit with what I already know?
Is this genuinely new or renamed?

Most importantly: build things. Reading about AI is useful. Building with it teaches you what actually matters.

The concepts in this guide will make more sense once you've debugged a misbehaving prompt, optimized token usage to reduce costs, or traced why an agent took an unexpected action.

Theory meets reality in production. That's where you learn what really works.

Every new technology brings its own language. Sometimes that language clarifies. Often it obscures.

Generative AI is no exception. In the past two years, we've been flooded with terms: LLMs, embeddings, RAG, Agentic workflows, chain-of-thought reasoning, hallucinations, fine-tuning, vector databases. Some of these terms describe genuinely new concepts. Others are marketing renaming old ideas.

This guide cuts through the noise. No hype, no hand-waving.

Clear explanations of what these terms actually mean and why they matter when you're building AI applications.

What is Generative AI?

Generative AI generates or creates new content text, images, code, audio, based on patterns it learned from training data.

This is different from traditional AI, which classifies or predicts from existing options. A spam filter decides if an email is spam or not (classification). A recommendation engine picks from existing products (prediction). A stock price forecasting model predicts a numeric value based on regression.

Generative AI creates new patterns that didn't exist before based on learned patterns and user queries or prompts.

The "generative" part matters. These systems don't just retrieve or categorize or forecast. They generate.

When you ask ChatGPT to write an email, it doesn't pull from a database of emails. It generates new text, word by word, based on patterns learned from billions of examples.

Large Language Models (LLMs)

Large Language Models are the engines behind most text based applications of generative AI. They're trained on the entire internet to predict what words should come next in a sequence.

The "large" refers to two things: the amount of training data (often trillions of words) and the number of parameters (the internal variables the model adjusts during training). GPT-4 reportedly has over a trillion parameters. Claude 3.5 Sonnet, hundreds of billions.

Here's what matters practically: LLMs are prediction machines. You give them text (called a prompt), and they predict what should come next, token by token, based on patterns they learned during training.

They don't "understand" in the way humans do. They're exceptionally good at recognizing and reproducing patterns. That difference explains both their capabilities and their limitations.

Tokens

Tokens are the basic units LLMs work with. Not quite words, not quite characters, somewhere in between.

A token might be a whole word ("cat"), part of a word ("un" + "bel" + "iev" + "able"), or even punctuation. English text typically breaks down to about 4 characters per token, so roughly 750 words equals 1,000 tokens.

Why this matters: LLMs are priced by tokens. Context windows (how much text the model can consider at once) are measured in tokens. Understanding tokens helps you understand costs and constraints.

When someone says "GPT-4 has a 128k token context window," they mean it can consider about 96,000 words of input text at once.

Prompts and Context

A prompt is the text you give an LLM to generate a response. It's both simpler and more complex than it sounds.

Simple version: "Write an email apologizing for a delayed shipment."

Complex version: A carefully structured prompt that includes instructions, examples, relevant data, conversation history, and constraints on the output format.

Context is everything the model sees when generating a response. This includes:

Your current prompt
Previous messages in the conversation
Any additional information you've fed it (documents, data, instructions)
System messages that shape how it behaves

Context management is crucial. LLMs are stateless, that is they don't remember past conversations unless you include that history in the current context. If you want the model to remember what the user said in a separate conversation three days ago, you have to send those messages again.

Embeddings and Vector Databases

Embeddings convert text into numbers, specifically, into lists of numbers (vectors) that represent the meaning of the text.

Why this matters: similar meanings produce similar vectors. "dog" and "puppy" will have embeddings that are mathematically close. "dog" and "telescope" will be far apart.

This lets you do semantic search. Instead of matching exact keywords, you can find text that means similar things.

Vector databases store these embeddings and let you search them efficiently. When you need to find relevant information from thousands of documents, you convert your question into an embedding and search for similar embeddings in your database.

This is foundational infrastructure for many AI applications. You're essentially building a system that can find meaning, not just words.

RAG (Retrieval Augmented Generation)

RAG combines retrieval (finding relevant information) with generation (creating a response).

Here's the pattern:

User asks a question
Convert question to embedding
Search a database for relevant content
Retrieve relevant documents or data
Feed those documents to the LLM as context
LLM generates response using that information

Why use RAG instead of just asking the LLM directly? Because LLMs only know what they were trained on. They don't know your company's internal documents, your product specifications, or yesterday's meeting notes.

RAG lets you give the model access to specific, current information without retraining it.

Most production AI applications use some form of RAG. It's the bridge between generic models and domain-specific knowledge.

Fine-tuning

Fine-tuning takes a pre-trained model and trains it further on your specific data.

Think of it like specialization. The base model learned general language patterns. Fine-tuning teaches it your specific vocabulary, style, or task.

When to fine-tune vs use RAG:

RAG: When you need to incorporate changing information or specific facts
Fine-tuning: When you need consistent style, domain-specific understanding, or task-specific behavior

Fine-tuning is more expensive and complex than RAG. Most teams start with RAG and only fine-tune when they hit clear limitations.

What AI Agents Actually Are

Here's where terminology gets messy. Everyone calls different things "agents."

At its core, an agent is an AI system that can take actions, not just generate responses. It doesn't just tell you what to do—it can do things.

Practical definition: An agent is an LLM that can:

Reason about a task
Decide what actions to take
Use tools to take those actions
Observe results
Adjust its approach based on what happened

The key word is "tools." Agents have access to functions they can call: search databases, run calculations, send emails, update records, call APIs.

A simple chatbot isn't an agent—it just talks. An agent that can check your calendar, find a meeting time, and send invites? That's an agent.

Tools and Function Calling

Tools (also called functions or plugins) are capabilities you give an agent.

Example tools:

Search a database
Calculate something
Fetch current weather
Send an email
Update a CRM record
Run code
Search the web

Function calling is how the LLM tells your system which tool to use. The model doesn't execute the function—it specifies what to call and with what parameters. Your code executes it and returns results.

This is crucial infrastructure. It's how you extend LLMs beyond text generation into actual system integration.

Reasoning and Chain of Thought

Reasoning in LLMs refers to the model working through a problem step by step, not jumping straight to an answer.

Chain-of-thought is a technique where you prompt the model to "show its work"—to articulate intermediate steps before reaching a conclusion.

Without chain-of-thought:

Question: "If I have 3 apples and buy 2 more, then give away 1, how many do I have?"
Answer: "4"

With chain-of-thought:

Question: "If I have 3 apples and buy 2 more, then give away 1, how many do I have? Think step by step."
Answer: "Starting with 3 apples. After buying 2 more, I have 3 + 2 = 5 apples. After giving away 1, I have 5 - 1 = 4 apples. Final answer: 4."

The second approach is more reliable, especially for complex problems. You can also inspect the reasoning to see where it went wrong if the answer is incorrect.

Modern models like Claude and GPT-4 have chain-of-thought reasoning built in, even when you don't explicitly ask for it.

Hallucinations

Hallucinations happen when an LLM generates information that sounds plausible but is factually wrong.

This isn't a bug—it's inherent to how these models work. They're trained to predict plausible text, not to verify truth. If making something up produces text that fits the pattern, the model will do it.

Examples:

Citing academic papers that don't exist
Inventing product features
Making up statistics
Creating plausible but wrong historical facts

Why this matters: You can't eliminate hallucinations completely. You can only reduce them through:

Better prompting (asking the model to cite sources, admit uncertainty)
RAG (grounding responses in real documents)
Validation (checking outputs against known facts)
Human review (especially for high-stakes decisions)

This is why "AI said so" isn't sufficient for critical applications. You need validation layers.

Temperature and Other Parameters

Temperature controls randomness in the model's output.

Low temperature (0-0.3): More focused, deterministic, consistent. Same question will usually get similar answers.
High temperature (0.7-1.0): More creative, varied, unpredictable. Same question might get very different answers.

When to use which:

Low temperature: Data extraction, code generation, factual Q&A
High temperature: Creative writing, brainstorming, varied examples

Other parameters you'll encounter:

Top-p: Alternative way to control randomness
Max tokens: How long the response can be
Stop sequences: Text that tells the model to stop generating

Most applications use temperature between 0 and 0.7. Higher values are rarely practical for production systems.

Evaluation Metrics

How do you measure if an AI application is working? This is harder than it sounds.

Traditional metrics:

BLEU/ROUGE: Compare generated text to reference text. Originally for translation. Limited usefulness for open-ended generation.
Perplexity: How "surprised" the model is by the next word. Lower is better. Useful for model comparison, not for production monitoring.

Modern approaches:

LLM-as-judge: Use another LLM to evaluate responses for quality, relevance, accuracy. Scales better than human review but adds cost and latency.
Human evaluation: Still the gold standard, but expensive and slow.
Task-specific metrics: Did the agent complete the task? Did the user get what they needed?

The honest truth: evaluating generative AI is still evolving. There's no single metric that captures "is this good?" You usually need multiple approaches.

Context Windows and Memory

Context window is how much text an LLM can consider at once. Measured in tokens.

Current context windows:

GPT-4 Turbo: 128k tokens (~96k words)
Claude 3.5 Sonnet: 200k tokens (~150k words)
Gemini 1.5 Pro: 2 million tokens (~1.5 million words)

Larger windows let you include more conversation history, documents, or data. But they also cost more and can be slower.

Memory in AI applications means maintaining context across conversations. Since LLMs are stateless, you have to manage this yourself—storing conversation history and including relevant parts in each request.

Some systems implement "memory" by storing summaries of past conversations and retrieving them when relevant. This is really just managed context and retrieval, not memory in the human sense.

Guardrails and Safety

Guardrails are rules and checks you implement to keep AI systems safe and on-topic.

Types of guardrails:

Input filtering: Block harmful or off-topic prompts
Output validation: Check generated content before showing it to users
Content moderation: Detect and filter inappropriate content
Behavioral constraints: Keep the AI within defined boundaries

Example: A customer service agent might have guardrails preventing it from discussing politics, making promises about refunds beyond policy limits, or sharing internal company information.

Guardrails are implemented through:

System prompts that define boundaries
Validation rules in your application code
Secondary models that check outputs
Human review for high-risk content

Critical point: Guardrails aren't foolproof. Determined users can often find ways around them (called "jailbreaking"). Layer multiple approaches.

Observability, Monitoring, and Evals

These terms get used interchangeably but mean different things.

Observability is your ability to understand what's happening inside your AI system. Can you see:

What prompt was sent?
What context was included?
How did the model reason?
What tools did it call?
What was the final output?

Monitoring is watching your system over time for issues:

Latency spikes
Error rates increasing
Cost trending up
Quality degrading

Evals (evaluations) are tests you run to check if your system works:

Does it handle edge cases correctly?
Does output match expected quality?
Does it follow your domain-specific rules?

All three are necessary. Observability helps you debug. Monitoring catches problems. Evals prevent them.

Putting It Together

These concepts aren't isolated—they work together in real systems.

Here's a typical AI application architecture:

User sends a query (prompt)
RAG system retrieves relevant documents (embeddings, vector database)
Prompt is constructed with retrieved docs and conversation history (context management)
LLM generates response (potentially using chain-of-thought reasoning)
Agent decides if tools are needed (function calling)
Output is validated (guardrails, evals)
Response is returned and logged (observability)
System monitors quality over time (monitoring)

Each piece matters. Miss one and your system becomes unreliable.

What Actually Matters

You don't need to understand every detail of how LLMs work internally. You do need to understand:

What they're good at: Pattern matching, text generation, reasoning through structured problems
What they're bad at: Math, factual accuracy without sources, consistency without constraints
How to work with them: Good prompting, proper context management, validation layers
What can go wrong: Hallucinations, context limits, cost overruns, quality degradation

The technology is powerful but not magic. It's a tool. Like any tool, it works best when you understand what it's actually doing.

Keep Learning

This guide covers the fundamentals, but the field moves fast. New capabilities emerge. Best practices evolve. Terms shift meaning.

The concepts here, prompts, context, agents, evaluation will remain relevant even as specific models and techniques change. They're the foundation everything else builds on.

When you encounter a new term or technique, ask:

What problem does this solve?
How does it fit with what I already know?
Is this genuinely new or renamed?

Most importantly: build things. Reading about AI is useful. Building with it teaches you what actually matters.

The concepts in this guide will make more sense once you've debugged a misbehaving prompt, optimized token usage to reduce costs, or traced why an agent took an unexpected action.

Theory meets reality in production. That's where you learn what really works.