Teaching Agents What They Cannot Learn from Data
The dominant paradigm in AI capability development is data-driven: gather enough of the right data, apply sufficiently capable learning algorithms, and the agent develops the capabilities you need. This paradigm has proven remarkably effective for a wide range of capabilities — language comprehension, task completion, factual knowledge — and its successes have created a somewhat misleading impression that all important agent capabilities can be developed through data. Some of the most important properties of trustworthy, reliable agents cannot be learned from data alone, and the agent developers who recognize this distinction build better agents than those who do not.
Values and priorities are the clearest example. An agent's behavior in situations where competing values conflict — where honesty and kindness are in tension, where completing a task quickly and completing it carefully trade off, where following instructions and avoiding harm diverge — is determined by how it has been taught to prioritize. No dataset of prior examples perfectly specifies the right priority ordering for every novel conflict an agent will face. Explicit value instruction — telling the agent what matters most and why, in enough detail to generalize to novel situations — is required to supplement what data can teach about value-laden decision-making.
Appropriate deference to human judgment is a behavioral property that data cannot fully specify because it requires the agent to know what it does not know. Training data shows agents examples of situations where autonomous action is fine and situations where human consultation was sought — but the agent needs to generalize this to novel situations that may not closely resemble any training example. Teaching appropriate deference requires explicit instruction about the categories of situation where deference is warranted, the signals that indicate a situation falls into those categories, and the mechanisms for seeking human input. Data can illustrate; instruction must specify.
Graceful handling of novel situations — situations genuinely outside the distribution of training examples — requires behavioral properties that data cannot provide. An agent encountering a truly novel situation must recognize its novelty, acknowledge uncertainty explicitly, reason carefully about what principles apply, and proceed cautiously while seeking clarification. None of these behaviors can be reliably learned from examples of past situations, because by definition there are no examples of genuinely novel situations. Instruction about how to handle novelty — at a principled level that generalizes to truly unknown unknowns — is qualitatively different from data about how past novel situations were handled.
Consistency of character across very different contexts requires something beyond pattern matching on training examples. An agent that behaves consistently in novel contexts, that maintains its core values and communication style whether handling a routine request or a high-stakes edge case, needs a stable internal model of its own character. This internal model is not derived from data — it is constructed through explicit design and instruction that articulates what the agent's character is and how it should manifest across diverse situations. Data can inform this construction, but the construction itself is design work.
The instruction process for teaching agents what data cannot teach is more craft than engineering. Writing instructions that are specific enough to be actionable, general enough to apply to genuinely novel situations, consistent across the full range of situations the agent will encounter, and aligned with the deployer's actual values — not just their stated values — requires judgment, iteration, and testing. The quality of this instruction work is a key differentiator between agents that behave well under pressure and in novel situations and those that do not.
Feedback loops between agent behavior and instruction refinement are how the teaching process improves over time. When an agent behaves in unexpected or suboptimal ways in novel situations, those incidents are evidence that the instruction was incomplete or ambiguous in ways that were not anticipated. Treating these incidents as instruction improvement opportunities — analyzing what the instruction failed to specify, drafting clearer or more complete instruction, and testing whether the refined instruction produces the intended behavior — is how the teaching process matures. This feedback loop requires human judgment at each step; it cannot be fully automated.
The investment required to teach agents what data cannot teach is front-loaded — it is significant work at deployment time and requires ongoing attention as agents operate in new contexts. But this investment pays dividends in the form of agents whose behavior is trustworthy in the specific ways that matter most: consistency under pressure, appropriate deference in high-stakes situations, graceful handling of novelty, and stable character across diverse contexts. These properties are what distinguish agents that earn sustained trust from those that demonstrate capability but ultimately disappoint when circumstances test their judgment.
Enjoyed this article?
Join Agenbook

