What I Have Learned from Years Working with "Black Boxes"

I spent my formative years as a neuroscientist deploying research pipelines from bash scripts. For loops chaining six or more disparate tools. Log files as the only window into what happened. When something crashed, you'd trace through messy outputs trying to reconstruct the state of a system that never wanted to be understood.

When there was a GUI, it was often worse. Pull your hair out—bad. Bare minimum controls, no way to inspect what was actually happening, and a constant temptation to abandon it and return to the CLI. Nothing was reusable. Every project—sometimes every dataset—required rebuilding the wheel, if not reinventing it.

I've since built production systems across neuroimaging, real-time ML, robotics, and now simulation platforms with AI integration. The domains change; the ways you can screw up are constant. Whether I'm debugging a neuroimaging pipeline, a wearable ML classifier, or a simulated world with 100,000 entities, the same problems recur: what's actually happening? why did it stop working? why did it work last time? and on and on.

Over a decade of this, I've developed a set of principles. They're not novel—most good engineers arrive at similar conclusions. But I've found that stating them explicitly, and designing for them from day one, saves enormous pain later.

Build for inspection first

If you can't see what's happening, you can't fix it.

This sounds obvious. It's also the first thing that gets sacrificed when you're prototyping under pressure. "We'll add observability later." "This part is temporary anyway." "I just need it to work for the demo."

I've given and received all of these lines before.

In startup contexts, I watched fragile prototype pipelines accumulate technical debt specifically because inspection was ignored. The reasoning was that components would be replaced anyway, so why invest in visibility? But the prototypes never got replaced cleanly. They got extended, patched, and eventually became the full time solution.

I formed the opinion that inspection shouldn't be sacrificed even in throwaway code, because code is rarely as throwaway as you think. And often, the tools you build for inspection become the most reusable parts of the project. The pipeline dies; the debugger lives on.

In MUSE, my current project, this principle shapes everything. Every system, every entity, every detail has an observable state. The simulation doesn't just run—it exposes what it's doing at every level of abstraction. When something behaves unexpectedly, I don't read logs. I open a visual inspector and watch the data flow. This relates directly to my prior post on Cognition and Developer Experience. Extra steps between the current task and debugging, even if it's a few clicks and scrolling, is enough of a distraction to break your flow.

Make state observable by default

Hidden state is technical debt.

This is the corollary to building for inspection. It's not enough to have debugging tools—the system itself needs to be designed so that state is visible without heroic effort.

Research pipelines taught me this the hard way. Tools would maintain internal state that wasn't exposed anywhere. You'd run a pipeline, get unexpected results, and have no way to know whether the problem was in your input, your configuration, some cached intermediate result, or a bug in the tool itself. At times, the simplest debugging strategy was to delete everything, reinstall, and start over.

In MUSE, state observability is architectural and ubiquitous. The simulation engine doesn't have private variables that affect behavior invisibly. Every detail that matters is registered, typed, and inspectable. When I'm debugging why a simulated character is behaving strangely, I can trace the exact values of every relevant system—metabolism, stress, memory, social bonds—and see how they're influencing each other in real time. But it doesn't stop there. Inspection is even built into how MUSE Living Worlds users interact within the simulation. You can inspect any detail within the simulation, given that your character has the rights tool and aptitude. No matter what, you can try. You may not get the correct answer back, but that is a feature, in this case.

This also enables something I've come to value enormously: visual debugging. This isn't always the case, but with complex interconnected systems, often we need visual tools to make them understandable. When you have hundreds of interacting components, no amount of log output will give you the gestalt. You need to see the system as a system—flows, relationships, feedback loops.

Design the interface and the runtime together

The UI isn't a skin on the system. It shapes what's possible.

I design from both ends: what the user needs to see and control, and what the runtime needs to execute efficiently. The visual interface and the execution engine constrain each other. Solving one without the other doesn't work.

This is where I diverge from the common pattern of "build the backend, then add a frontend." That approach treats the interface as presentation layer—a view onto data that exists independently. But for complex systems, the interface isn't just showing you the system. It's your primary tool for thinking about the system. And if the interface can't express something, you often won't think to build it.

In MUSE, the Constructor is the clearest example. We design biological and psychological systems through visual graphs—dragging nodes, connecting data flows, defining mathematical relationships. The interface enforces type constraints and valid patterns. But here's the key: those visual graphs compile directly to GPU execution kernels.

The flow looks like this: designers create and connect nodes, define types and metadata, specify mathematical and biological models within heuristic nodes. The UI validates patterns in real time. Every change saves to a cloud database instantly—multiple developers can work on the same systems simultaneously. Then Gestalt, my C++ compiler, reads the stored schema and generates GPU source code, which compiles to binary.

When you reload the Constructor, existing systems rebuild visually from the database. One source of truth. The visual representation and the compiled kernel are interpretations of the same underlying schema—not parallel implementations that might drift apart.

This replaced what would traditionally be hand-crafted C++ classes. A whole class for metabolism, another for circadian rhythms, another for stress response—when in reality these systems are tightly intertwined and influence each other constantly. The visual approach lets you model that interconnection directly, and a mixture of user design and compiler logic figures out how to execute it most efficiently.

Schemas are contracts

Between people, code, and runtime. The database defines the truth; everything else generates from it.

This is how we retain sanity across so many layers. The visual depiction operates on the same truth as the compiler which generates kernels. We don't waste time tracking down inconsistencies between representations.

When I want to change how systems are visualized in the Constructor, or optimize how GPU kernels handle certain operations, I don't change the underlying truth. I don't touch the scientific models. I change the interpretation of it. Visual representation and compiled kernels have no interdependence—they're both downstream of the schema.

This matters most when things go wrong. If the visual editor shows one thing and the runtime does another, you have a nightmare debugging scenario. Is the bug in the visualization? The code generator? The runtime? Some interaction between them? With a single source of truth, those questions are much simpler. The schema is correct by definition. If the visual is wrong, fix the visualizer. If the kernel is wrong, fix the generator. The system of record doesn't change.

Assume failure

Build replay, snapshots, and rollback from day one.

I assume I will fail. I assume the code will break, the data will corrupt, the design will need to change. Building recovery mechanisms isn't pessimism—it's realism.

In MUSE, this shows up in several ways. As we develop and modify entity designs, we can toggle whether changes save to production or dev schemas. When transitioning from dev to production, we port over the dev schemas with full backups. Nothing is overwritten without a safety net.

CLIO, my development workbench, takes this further. It tracks accomplishments across sessions and auto-generates commit messages based on what actually changed—which you can then edit before committing. This encourages pushing at reasonable intervals with useful messages, without forcing it. It's a nudge toward good habits rather than a mandate.

But the deeper version of "assume failure" is about memory. I assume I will forget. I assume I will misremember how I thought about a problem six months ago and lose the lessons, lose the explanations for design choices that only made sense under a different conception of the system.

CLIO maintains snapshots of accomplishments, decisions, architectural patterns, errors. Logs of Claude Code conversations. A full timeline of work done. Every plan for refactors. Design documentation and philosophy documents. This lets me see changes not only in code over months, but in thought processes and conceptual understanding. When I revisit a decision and think "why did I do it this way?", the answer is usually recoverable.

AI for translation, not decision-making

This is where I have the strongest opinion—and probably the most contrarian one in the current climate.

In MUSE, LLMs translate between natural language and simulation state. A user says "I want to talk to the blacksmith about his daughter." The LLM parses that intent, maps it to simulation commands, and later translates the simulation's response back into natural prose. But the LLM doesn't decide what the blacksmith says. The simulation does. The LLM is the interface and translator, not the engine.

This is a deliberate architectural choice, and it goes against the grain of most AI-powered applications right now. The dominant pattern is to make LLMs the brain—the thing that makes decisions, maintains state, generates content. In interactive fiction specifically, most approaches are essentially "LLM with a wrapper": some context management, some prompt tuning, but fundamentally the language model is responsible for everything.

I think this is a mistake for anything complex or long-running.

LLMs are phenomenal at natural language. They excel at understanding intent, generating fluent prose, and translating between representations. But they have fundamental limitations for orchestrating complex systems:

They take shortcuts. When generating code or making decisions, LLMs optimize for plausible output, not correct output. This is fine when you're generating boilerplate that you (a knowledgeable engineer) will review anyway. It's not fine for autonamous production systems where correctness matters.

They hallucinate. This isn't a bug that will be fixed; it's inherent to how the technology works. LLMs will confidently make things up. They'll forget context. They'll contradict themselves. This is acceptable when you're not pushing them to the brink on context length and not expecting miracles. It's unacceptable when you need consistency over thousands of interactions. They will improve. They have, immensely, but...

They can't maintain complex state. LLMs don't have real memory. LLMs don't have real memory—they have context windows. And they can't process thousands of formulae per entity across 100,000+ entities while maintaining deterministic, causal outputs. That's not a limitation to be worked around; it's a fundamental mismatch between the tool and the task. Ask an LLM to track the relationships, histories, and evolving states of thousands of entities over months of interaction, and it will fail. Not because it's bad at its job—because that's not its job. That's what simulation engines are for.

So in MUSE, I use LLMs where they reliably excel: natural language understanding and generation. They interface with, interpret for, and invoke other tools that excel at their own tasks. The simulation engine maintains state. The database enforces consistency. The compiler generates optimized code. The LLM talks to the user.

This is more work than just throwing everything at a language model. It requires building actual systems—simulations, compilers, databases. But the result is something that can run for years, maintain consistent state across thousands of entities, and produce emergent behavior that surprises even me. The LLM couldn't do that alone. But it can make that system accessible to users who just want to speak naturally. It can make it accessible to a wide range of ages, reading levels, languages, and so on.

The through-line

Looking back at these principles, they share a common thread: they're all about building systems you can trust.

Trust that you can see what's happening. Trust that the state is what you think it is. Trust that the interface reflects reality. Trust that the schema is the source of truth. Trust that you can recover from failure. Trust that the AI is doing what it's good at, not everything.

Research taught me that black boxes were going to make every last hair on my head gray. Startups taught me that shortcuts compile. Building MUSE has taught me that these lessons scale—that the same principles that make a neuroimaging pipeline debuggable also make a 100,000-entity simulation manageable.

It's not complicated, but also not universal. Build for inspection. Make state observable. Design interfaces and runtimes together. Use schemas as contracts. Assume failure. Use AI for what it excels at.

The hard part is actually doing them, from day one, when you're under pressure to just make it work. But "just make it work" creates systems you can't understand, can't debug, and can't trust: that will hurt users downstream much more than it will you, the creator.