AI Guide

What Is a Large Language Model (LLM)?

How LLMs work, how the major models compare, and how to build production systems on top of them.

The short answer

Large language models are the models behind ChatGPT, Claude, and Gemini. They are trained on massive text datasets — billions of documents, books, code repositories, and web pages — to learn the statistical patterns of language well enough to predict and generate it.

Everything from customer support chatbots to code assistants to document summarizers runs on some version of this technology. The "large" in large language model refers to the scale of both the model parameters (hundreds of billions in top-tier models) and the training data — not to any specific architectural property.

They are not databases, search engines, or reasoning engines in the formal sense. They predict plausible next tokens. That distinction has real consequences for how you design systems around them.

How LLMs work

The foundation is the transformer architecture: a neural network design that processes text as sequences of tokens (roughly 0.75 words per token on average) and learns which tokens attend to which other tokens — which parts of a sentence are most relevant to interpreting each other.

Training is built around a deceptively simple objective: next-token prediction. Given a sequence of tokens, predict what comes next. Run that task trillions of times on internet-scale text — GPT-4 was trained on roughly a trillion tokens — and the model learns grammar, facts, reasoning patterns, code syntax, and conversational structure as side effects of getting good at prediction.

The context window is the amount of text a model can process in a single request, measured in tokens. This matters in practice. A 4,000-token context window (early GPT-3) could handle a short document and a few exchanges. A 200,000-token context window (Claude 3.5 Sonnet) can hold a 500-page book, a full codebase, or hours of conversation history. Context window size determines what you can pass to the model at query time — which directly affects whether RAG architectures are necessary and how you structure prompts.

Inference is the process of running the model to generate a response. The model produces one token at a time, each chosen probabilistically based on the input and all previously generated tokens. Temperature and sampling parameters control how deterministic vs creative the output is.

Most production LLMs go through additional training after the base pre-training step. Instruction tuning teaches the model to follow directions. RLHF (Reinforcement Learning from Human Feedback) aligns model outputs with human preferences for helpfulness, safety, and accuracy. These steps are why ChatGPT behaves like a useful assistant rather than a raw text-continuation engine.

Major LLMs compared

The market has consolidated around a few dominant models and a growing set of capable open-source alternatives. Here is how the most widely used ones compare:

Model	Provider	Context window	Best for	Access
GPT-4o	OpenAI	128k tokens	General purpose, vision, function calling	API
Claude 3.5 Sonnet	Anthropic	200k tokens	Long-form content, coding, instruction following	API
Gemini 1.5 Pro	Google	1M tokens	Multi-modal tasks, long-document analysis	API + Vertex AI
Llama 3	Meta	128k tokens	Self-hosted deployments, fine-tuning, no vendor lock-in	Open source
Mistral Large	Mistral AI	32k tokens	European data residency, cost-sensitive workloads	API + self-hosted

Context windows and capabilities change frequently. Check each provider's documentation for current specs before committing to an architecture.

Three ways to adapt LLMs to your use case

You do not need to train a model from scratch to build useful products on LLMs. There are three techniques for making a model behave the way you need, each with different costs and trade-offs:

Prompting: Write better instructions. System prompts, few-shot examples, chain-of-thought instructions — all of this shapes model behavior without touching model weights. It is the cheapest approach and the right starting point. Limitations: you are constrained by the context window, and behavior consistency degrades on complex or unusual inputs.
RAG (Retrieval-Augmented Generation): Give the model access to your knowledge base at query time. Documents are chunked, embedded into vectors, and stored in a vector database. At query time, semantically similar chunks are retrieved and injected into the prompt alongside the user's question. The model answers from your data, not just its training knowledge. Good for dynamic or proprietary content — product docs, internal policies, customer records. The right approach for most enterprise knowledge applications.
Fine-tuning: Continue training the model on your own labeled examples to change its behavior, tone, output format, or domain vocabulary. This adjusts model weights — it produces a new version of the model. High cost to set up and maintain (retraining required when your data changes). Warranted when prompting and RAG cannot achieve the behavior you need — for example, enforcing a rigid output format across all queries, or handling very specialized domain jargon consistently.

Start with prompting. If prompting does not get you to your quality target, add RAG. Fine-tune only when you have identified specific, measurable gaps that RAG plus prompt engineering cannot close — and you have labeled training data to work with.

What LLMs can't do

LLMs are genuinely capable, but they have hard limits that affect every production system built on them. These are not edge cases — they are properties of how the technology works.

They predict text, not reason: An LLM generates the most statistically plausible continuation of your input. It can produce outputs that look like careful reasoning and still be wrong. It does not verify claims. It does not have a model of the world it checks against. Outputs that sound confident are not more likely to be correct.
Hallucination: Models fabricate facts — names, citations, statistics, code behavior — with fluent confidence. This is not a bug being fixed; it is a property of the generation mechanism. Every production system needs output validation or retrieval grounding to catch it.
Stale knowledge: Training data has a cutoff. Events, products, prices, regulations, and people that changed after the cutoff date are either missing or wrong. For anything time-sensitive, RAG is the fix — retrieve current information at query time rather than relying on training knowledge.
No persistent memory by default: A fresh API call has no memory of previous conversations unless you explicitly include conversation history in the context. At scale, managing conversation context — deciding what to keep, trim, or summarize — is a meaningful engineering problem.
Expensive at scale: LLM inference is billed per token. A system running 100,000 queries per month at 2,000 tokens each generates 200 million tokens of API cost every month. Cost modeling before you choose a model tier is not optional — it changes architecture decisions significantly.

How to build with LLMs

Building a production LLM application follows a consistent sequence. Skipping steps three and four is where most teams run into trouble.

Choose a model based on your requirements. Define your context window needs, acceptable latency, cost budget, and data residency requirements first. Then pick a model that clears those constraints. GPT-4o and Claude 3.5 Sonnet are strong defaults for most US business applications. If you need self-hosted for data sensitivity or cost reasons, Llama 3 is the current open-source benchmark.
Pick an orchestration framework. LangChain and LlamaIndex handle common patterns — RAG pipelines, tool calling, memory management, multi-step chains — so you do not build them from scratch. For simpler applications or teams with ML experience, direct API access (OpenAI SDK, Anthropic SDK) is cleaner and easier to debug. Avoid adding an orchestration framework just because it exists — it adds abstraction layers that make debugging harder.
Build your eval harness before your UI. Define what good output looks like for your use case. Create a labeled test set with representative inputs and expected outputs. Run every prompt change and model change against that test set before shipping. Teams that ship first and evaluate later spend months undoing production problems.
Monitor outputs in production. Log every input and output. Track quality metrics, latency, cost per query, and error rates. Model behavior can shift as input distributions change — a model that performs well on your test set may degrade on real user inputs you did not anticipate. Catching degradation early is only possible if you are measuring.

We build production LLM systems for US businesses across these layers — model selection, RAG architecture, eval frameworks, and deployment. See our custom LLM development and AI software development services for where we typically start with new clients.

Related Services