Everyone talks about AI learning. Almost nobody ships it. I've sat through upwards of 50 AI vendor demos over the past two years. Every deck mentions "continuous learning" or "adaptive intelligence" or some variant thereof. Then you ask the question: "How does the system learn from production outcomes?" And you get vague hand-waving about retraining schedules and "future roadmap items" and "we're exploring approaches." Which is vendor-speak for "we don't do that but we'd like you to believe we might someday." The Dupoux-LeCun-Malik paper calls this out directly. Their System B—learning from action—requires closed-loop feedback. You do something. You observe the result. You adjust. Repeat until competent. Most AI systems break this loop at the observation step. They do things. They do not observe results in any meaningful way. They definitely do not adjust based on what they observe.
What the actual loop looks like, in our implementation: Request arrives (workflow needs to execute, context attached). Execute (system runs the workflow, every step logged before execution—this is non-negotiable). Outcome captured (workflow completes with success, partial success, or failure, plus detailed metrics on timing, resource usage, quality scores). Learn (outcome feeds back to pattern layer, successful executions increase pattern confidence, failures decrease it, system extracts "what worked" and "what didn't" from outcome data). Adjust (confidence thresholds tune, pattern priorities shift, future requests route differently based on accumulated learning). This isn't a batch job that runs on Tuesday nights. It happens per execution. Every workflow teaches the system something. If my math is correct, we process something like 10,000 learning signals per day across deployments. That's 10,000 opportunities for the system to get marginally smarter, or 10,000 opportunities to reinforce bad patterns if you're not careful about outcome classification.
Why most vendors skip this: building real feedback loops is expensive as hell. You need instrumentation everywhere—every step emitting structured telemetry, not just logging text that humans might read someday, but structured data you can process programmatically. You need outcome classification—raw results aren't useful, you need to categorize success versus partial versus failure, and for failures you need root cause analysis. Timeout? Bad data? Wrong approach? Payer rejection for reasons unrelated to your system? These have different implications for learning. You need attribution—when a five-step workflow fails at step four, which step actually caused the problem? A failure at step four doesn't necessarily mean step four was wrong; maybe step two passed bad data downstream. You need storage—every outcome record takes space, and at scale this is serious data volume. You need processing—something has to analyze outcomes and update the system continuously. This is not a side project. It's core infrastructure. Most vendors skip it because it's 10x the engineering work of a frozen model, and customers don't know to ask the right questions during demos.
Our system has dedicated infrastructure for outcome processing: OutcomeCapture records raw execution results, OutcomeClassifier categorizes into success/partial/failure, FailureAnalyzer extracts failure causes and patterns (we track root cause categories: input quality, pattern mismatch, execution error, external rejection, timeout, unknown), PatternUpdater adjusts pattern confidence based on outcomes, ThresholdCalibrator adjusts meta-thresholds based on aggregate performance, InsightGenerator produces learning recommendations. This pipeline runs on every execution—average processing time 100-200ms after workflow completion. The system doesn't wait to learn. It doesn't accumulate data for weekly analysis. It learns in production, from production, continuously. The paper discusses "trial-and-error with feedback." That's abstractly correct. The implementation is: capture everything, classify outcomes, update patterns, adjust thresholds, generate insights. It's not one step. It's six separate infrastructure components that have to work together without losing data.
Here's something the paper doesn't address in depth, and I get why because it's ugly: learning from failure is considerably harder than learning from success. Success is unambiguous. It worked. Move on. Failure has causes, and causes have dependencies, and untangling them is genuinely difficult. A workflow might fail because the input data was garbage (not our fault), because the chosen pattern was wrong for this case (our fault), because the execution environment had a transient issue (infrastructure fault), because the external system rejected for reasons unrelated to our approach (payer fault), or because something took too long (could be anyone's fault). We invest heavily in failure analysis because pattern confidence should only decrease for pattern mismatch failures. A network timeout doesn't mean the pattern was wrong—it means the network was slow. Without proper attribution, a network outage would tank confidence in patterns that had nothing to do with the failure. This is finicky, detail-oriented work that doesn't demo well and doesn't show up in feature lists. But it's the difference between a system that learns correct lessons and a system that learns superstitious associations.
If your AI vendor says "continuous learning" and can't explain this loop in specific detail, they don't have it. Ask: How are execution outcomes captured? How are failures classified? How does outcome data update the model? How quickly do updates take effect? Vague answers mean frozen model with aspirational roadmap. Specific answers mean real feedback loops. We built the loops first, before the fancy features, because without them nothing else matters. The paper validates why this is essential. The market will validate who actually has it.