An outsiders introduction to developing a Large Language Model
Dr. Robert W. Malone
How to Train Your AI
A Plain-English Guide to Building, Teaching, and Safeguarding Artificial Intelligence
Artificial
intelligence is no longer the stuff of science fiction. It answers our
questions, writes our emails, and holds conversations that feel
startlingly human. But how does it actually work? How is an AI built,
taught, and kept from going off the rails? The answer is more
fascinating, and more human, than most people realize.
Part One: Building the Brain
Every
AI starts with a goal. Do you want it to recognize faces? Translate
languages? Answer questions? That goal determines everything that
follows. Once the goal is clear, the real work begins, and the first
ingredient needed is data. Enormous amounts of it.
For a Large
Language Model, which is the kind of AI behind chatbots and writing
assistants, that data is text. Trillions of words drawn from books,
websites, academic papers, and more. The goal is to expose the model to
as much of human language and knowledge as possible, because AI learns
from examples the same way humans do: through exposure and repetition.
At
the heart of the AI is something called a neural network, a
mathematical structure loosely inspired by the human brain, made up of
layers of connected nodes that pass information to one another. The
network’s behavior is determined by billions of tiny numerical values
called “weights,” which represent the strength of connections between
those nodes. Training the AI is essentially the process of finding the
right weights.
Training works through a beautifully simple idea:
prediction. The model is shown a sentence with the last word removed,
and it tries to guess what that word is. It gets scored on how wrong it
was. Then a process called backpropagation figures out which weights
made the prediction worse and adjusts them slightly. Do this billions of
times across trillions of words, and something remarkable happens: the
model does not just learn grammar. It absorbs facts, reasoning patterns,
and context. It begins to understand language, or something that
functions very much like understanding.
This phase, called
pre-training, is staggeringly expensive. It requires thousands of
specialized computer chips running for weeks or months, consuming vast
amounts of electricity. The result is a “base model” that is
extraordinarily good at generating fluent text, but also unpredictable
and sometimes problematic. It has learned from all of human writing,
which includes the full spectrum of human expression: the inspiring and
the offensive, the truthful and the false.
Part Two: Teaching It to Behave
A
raw, pre-trained model is a bit like someone who has read everything
ever written but has never been taught manners, ethics, or professional
conduct. The next phase of development is about instilling those
qualities, and it involves several overlapping techniques.
Fine-Tuning
After
pre-training, the model is trained again, this time on a much smaller,
carefully curated set of high-quality conversations and responses. This
teaches it to behave like a helpful, professional assistant rather than a
raw text predictor. The model’s weights shift gradually toward
producing the kinds of responses a thoughtful person would give.
Reinforcement Learning from Human Feedback (RLHF)
One
of the most powerful techniques used today is called Reinforcement
Learning from Human Feedback, or RLHF. The AI generates several
different responses to the same prompt, and human reviewers rank them
from best to worst. A separate “reward model” is trained to predict what
humans prefer. Then the main AI is trained to maximize that reward,
essentially learning to produce responses that real people find helpful,
accurate, and appropriate.
Through this process, guardrails, or
more formally safety mitigations and alignment measures, get woven
directly into the model’s weights. It is not that a rulebook gets
programmed in. It is that the model’s deeply ingrained tendencies are
shaped, through thousands of examples and feedback cycles, to steer away
from harmful outputs. Think of the difference between giving a child a
printed list of rules versus raising them with consistent guidance,
feedback, and example. The AI’s values, such as they are, develop
through the latter approach.
Constitutional AI
Some
companies go a step further, training the AI to critique its own
responses against a set of core principles that function essentially as a
constitution for the model’s behavior. The AI learns to ask itself
whether a response is honest and whether it could cause harm, then
revise accordingly before settling on a final answer.
System Prompts and Hard Filters
Layered
on top of the trained behavior are more traditional software tools.
System prompts are invisible sets of instructions given to the AI before
each conversation begins, telling it how to behave in a specific
context. Hard filters are conventional code sitting outside the model
that scan inputs and outputs for prohibited content and block them
before they reach the user. These act like a bouncer at the door, while
the trained behavior acts like the internalized conscience of the person
inside.
System prompts can even include tiered access,
essentially passwords or keys that allow different users to unlock
different levels of AI capability. An administrator with the right key
might access features unavailable to a general user. However, this
approach has real limitations: because the AI processes system prompts
and user messages through the same mechanism, a clever user may be able
to extract or circumvent them. For high-stakes applications, true
security is better handled by the surrounding software rather than by
trusting the AI to enforce it.
Part Three: Testing in the Sandbox
Before
any AI is released to the public, it goes through a critical phase of
testing in what is called a sandbox, which is a controlled, isolated
environment where the model can be probed and stressed without any risk
to real users or real systems. Think of it as a flight simulator for AI:
trainee pilots can crash the plane a hundred times without anyone
getting hurt.
In the sandbox, engineers can safely test dangerous
scenarios, observe unfiltered behavior, and experiment with new safety
measures before deploying them. The AI might be cut off from the
internet or sensitive systems, so even if it misbehaves, the damage is
fully contained. When AI is given tools such as the ability to browse
the web, run code, or interact with other software, those capabilities
are sandboxed first to understand what could go wrong.
A key part
of sandbox testing is something called red-teaming. Researchers,
sometimes humans and sometimes other AI systems, try their hardest to
make the model misbehave: to get it to say something harmful, reveal
restricted information, or bypass its guidelines through clever
phrasing, roleplay scenarios, or encoding tricks. This is ethical
hacking for AI. The vulnerabilities discovered through red-teaming are
patched before the model goes live.
Part Four: The Ongoing Challenge of Jailbreaking
One
of the most sobering truths about AI safety is that it is never
finished. Because guardrails are embedded in the model’s weights rather
than in explicit, readable code, they cannot be mathematically verified
the way traditional software can. You cannot read the weights and
confirm they are safe. You have to probe the model through testing and
observe how it behaves.
This creates what the industry calls a
jailbreaking problem. Users who are determined to get an AI to misbehave
can sometimes succeed by finding gaps in its training, asking questions
in roundabout ways, using fictional framing, switching languages, or
employing other creative techniques to make the model’s safety instincts
fail to activate. It is an ongoing arms race: researchers find
exploits, developers patch them, and new exploits emerge.
There is
also a fundamental tension that every AI developer grapples with:
guardrails that are too tight make the AI useless, refusing to discuss
anything remotely sensitive even for entirely legitimate reasons.
Guardrails that are too loose allow harm. Finding and maintaining the
right balance requires constant human judgment, ongoing monitoring of
real-world conversations, and regular retraining as new problems are
discovered.
Part Five: The Hallucination Problem
Of
all the challenges in AI development, hallucinations may be the most
insidious. Unlike a jailbreak, where a bad actor has to work
deliberately to extract harmful content, hallucinations happen on their
own, uninvited, in the middle of otherwise helpful conversations. And
they do so with complete confidence.
An AI hallucination is when
the model confidently states something that is factually wrong,
inventing people, citations, events, statistics, or details that simply
do not exist. The term is apt: the AI is not lying intentionally. It is
generating text that sounds plausible based on patterns in its training
data, even when no factual basis exists. It is the dark side of the same
fluency that makes these models so impressive.
The root cause
goes back to how LLMs work. They are trained to predict the most
statistically likely next word. They do not know facts the way a
database does; they have learned patterns associated with facts. When
asked something outside their confident knowledge, they do not naturally
say they do not know. They do what they were trained to do: generate
plausible-sounding text. The result can be a well-written, confidently
delivered, completely fabricated answer.
Retrieval-Augmented Generation (RAG)
One
of the most effective practical solutions is called Retrieval-Augmented
Generation, or RAG. Rather than relying solely on what the model
memorized during training, RAG connects the AI to an external knowledge
source, such as a database, a document library, or the internet, at the
moment a question is asked. The model retrieves relevant, current,
verified information first, then generates its answer based on that
retrieved content rather than pure memory. Think of the difference
between answering a question from memory versus being allowed to look it
up first. RAG dramatically reduces hallucinations on factual questions
because the model is working from real source material it can reference.
Teaching the Model to Say It Does Not Know
One
of the most powerful behavioral interventions is teaching the model to
express uncertainty. Through fine-tuning and RLHF, models can be
specifically rewarded for acknowledging when they are not certain and
penalized for confidently stating things that turn out to be wrong. This
does not prevent the model from being wrong, but it stops it from being
wrong with confidence, which is arguably the more dangerous form of
hallucination. A hedged wrong answer invites the user to verify. A
confident wrong answer does not.
Chain-of-Thought Reasoning
Instead
of jumping straight to an answer, models can be trained or prompted to
reason step by step, showing their work so to speak. This approach,
called chain-of-thought reasoning, tends to reduce hallucinations
because each reasoning step can catch errors in the previous one. It
also makes the model’s thinking visible, so users can spot where the
logic went wrong rather than simply receiving a confident wrong
conclusion.
Grounding, Citations, and Fact-Checking Layers
Models
can be designed to cite their sources, pointing to specific documents
or passages that support their claims. This forces the model to anchor
its answers in retrievable evidence rather than relying on statistical
intuition alone. If it cannot cite a source, it should say so. Many
enterprise AI systems build this in as a hard requirement.
Some
systems go further, adding a second AI on top of the first, one whose
sole job is to verify the claims made in the first model’s response
against a trusted knowledge base. If a claim cannot be verified, it gets
flagged or removed. A related technique called self-consistency
checking has the model generate multiple independent answers to the same
question and compare them. If all versions agree, confidence is higher.
If they contradict each other, the model flags uncertainty.
Hallucinations tend to be inconsistent across attempts, while true
knowledge tends to be stable.
Specialized Models and Controlled Creativity
Counterintuitively,
trying to make a model know everything can increase hallucinations. A
model trained specifically on medical literature, for example,
hallucinates far less on medical questions than a general-purpose model
trying to cover all of human knowledge. Specialized models have a
narrower but more reliable knowledge base.
There is also a
technical setting inside the model called “temperature” that controls
how creative or random its outputs are. High temperature produces more
varied, imaginative responses, but also more hallucinations. Lower
temperature makes the model more conservative, sticking closer to
patterns it has seen before. For factual applications, dialing down the
temperature reduces the risk of the model wandering into invented
territory.
The Human in the Loop
For
high-stakes applications in medicine, law, and finance, the most
reliable safeguard remains a human expert reviewing the AI’s output
before it is acted upon. AI handles the heavy lifting; a human catches
the errors. No current technique eliminates hallucinations entirely.
They are, to some extent, a fundamental consequence of how LLMs work.
The goal of current research is not perfection; it is making
hallucinations rarer, less confident, more detectable, and less
consequential.
Part Six: Can a Large Language Model Think?
This
is one of the most debated questions in all of artificial intelligence,
and depending on who you ask, the answer ranges from an emphatic yes to
an equally emphatic no. Can a Large Language Model actually think? The
honest answer is that it depends entirely on what you mean by the word.
On
the surface, the case against thinking seems straightforward. An LLM
does not reason the way a human does. It has no experiences, no
curiosity, no inner life. It does not sit quietly and ponder a problem.
What it does, at a mechanical level, is predict the next most likely
word based on patterns absorbed from vast amounts of human text. It is,
in that sense, an extraordinarily sophisticated pattern-matching engine.
Critics who hold this view often say that LLMs do not think at all;
they merely simulate thinking with enough skill to be convincing.
But
that view, while valid, leaves some important things unexplained. When
an LLM solves a novel logic puzzle it has never encountered before, is
it just matching patterns? When it catches an error in a legal argument,
translates irony between languages, or generates a metaphor that
genuinely illuminates an idea, what exactly is happening? The outputs
sometimes go well beyond what simple pattern retrieval would predict.
Something is being processed, recombined, and applied in ways that at
least resemble reasoning.
What the Research Suggests
Researchers
have found that large language models, particularly those trained at
scale, develop internal representations of concepts, relationships, and
even something resembling logical structure. They can perform multi-step
reasoning, draw inferences, and generalize from principles to new
situations. These are behaviors that, in humans, we would not hesitate
to call thinking.
At the same time, LLMs fail in ways that human
thinkers rarely do. They can be confidently wrong about simple
arithmetic. They can contradict themselves within the same conversation.
They can be fooled by rephrasing a question slightly differently, even
when the underlying logic remains identical. These failures suggest that
whatever is happening inside the model is not the same as human
reasoning, even when the outputs look similar.
The Chinese Room Problem
The
philosopher John Searle famously illustrated this tension with a
thought experiment called the Chinese Room. Imagine a person locked in a
room with a large rulebook for responding to Chinese characters.
Messages in Chinese are passed under the door; the person looks up the
appropriate responses in the rulebook and passes them back out; to
anyone on the outside, the exchange looks like a fluent conversation
with a Chinese speaker. But the person inside understands nothing. They
are just following the rules.
Searle argued that LLMs are
essentially that person in the room: producing outputs that appear to
reflect understanding without any actual comprehension behind them. The
counterargument, made by many AI researchers, is that the human brain
itself might be described as a very complex version of the same process,
and that understanding may simply be what sophisticated information
processing looks like from the inside.
Neither side has
definitively won that argument. It remains one of the genuinely open
questions at the intersection of philosophy, neuroscience, and computer
science.
A More Useful Way to Frame the Question
Rather
than asking whether LLMs can think, it may be more useful to ask what
kinds of thinking they can do and what kinds they cannot. They are
remarkably capable at synthesizing information, identifying patterns,
generating creative connections, and producing well-structured
arguments. They are considerably weaker at sustained logical chains that
require holding many variables in precise relationship, at grounding
their knowledge in real-world experience, and at knowing the limits of
their own knowledge.
In practical terms, LLMs think differently
from humans, rather than not at all. They process language with a kind
of breadth and fluency that no human could match, drawing on connections
across billions of words. But they lack the embodied experience, the
emotional grounding, and the genuine self-awareness that shape human
thought in ways that go far beyond language.
Perhaps the most
honest answer is this: a Large Language Model does something that is
genuinely impressive, genuinely useful, and genuinely worth taking
seriously. Whether it rises to the level of thinking in the fullest
sense of that word is a question that says as much about how we define
thinking as it does about what the model is actually doing. And that
question, for now, remains beautifully unsettled.
Conclusion: More Art Than Science
Building
and training an AI, especially one that is helpful, honest, and safe,
is as much an art as it is a science. The data, the architecture, the
training techniques, the safety measures, the sandboxing, the resistance
to jailbreaking, the ongoing battle against hallucinations, and the
still-unresolved question of whether any of this constitutes genuine
thinking all play a role in how we understand and develop these systems.
But underneath all the technical sophistication is something
surprisingly human: the attempt to pass on values, instill judgment, and
build something that tells the truth even when making something up
would be easier.
From all of this, you can easily understand why
all AIs reflect the politics and cultural biases of their creators. It
begins with decisions regarding system prompts and hard filters, is
reinforced during training (including the selection and editing of the
training datasets used), and proceeds through the entire sandbox and
hallucination guardrail development process. It is inevitable - each AI
is effectively a mirror of the internal cognitive and psychological
environment of those who have birthed it.
We cannot write a
comprehensive rulebook for every situation an AI might encounter, any
more than we could write one for a child. Instead, we shape its
instincts through experience, feedback, example, and correction, and we
test it rigorously before trusting it with real responsibilities. The goal is not a perfect machine. It is a reliable, well-intentioned one that keeps getting better.
In that sense, training an AI is not so different from training a child, or a dragon.

