From Research Pipelines to Production: Lessons from Both Worlds

I've spent a decade building software in research labs—neuroimaging pipelines, robotics, VR sensorimotor assessments, real-time EEG classification. I've also co-founded a startup that put wearable analytics in the hands of clinicians, and I'm now building a simulation platform that needs to work for researchers, creators, and consumers simultaneously.

These worlds operate differently. They optimize for different things, tolerate different failures, and break in different ways. But they also share patterns that aren't obvious until you've lived in both.

This is what I've learned building systems for scientists who have likely never heard of Docker and for users who will never read your instructions.

What Research Gets Wrong

Let me describe a codebase I inherited.

15,000 lines of MATLAB. GUI built in MATLAB's archaic App Designer from the early 2010s—Windows XP energy. Nothing could be generated from code. If you duplicated the project to test changes, everything broke. Comments like % TL 1/5/2010 - NOT SURE WHY THIS IS HERE BUT IT IS BEING USED BY SOMETHING. Hardware controllers from a company that no longer exists, with no documentation and no support.

It worked. Barely. If you ran it exactly right. Standing on one foot, not making eye contact with the monitor when you hit run—that kind of fragility.

This is research infrastructure at its worst. And it's common. Not because researchers are bad engineers (some are excellent), but because the incentive structure doesn't reward robustness. Papers get published on results, not on code quality. The lone wolf model means one person builds a system, uses it for their dissertation, and leaves. The next person inherits it or starts from scratch (scarred by trying to play archaeologist with someone else's code graveyard).

What gets neglected:

Error handling (non-existent)
Testing (run it and see if it crashes)
Interfaces (the worst you've ever seen—truly awful R and MATLAB GUIs hacked together in desperation; most often no UI)
Handoff (there is no handoff; there's archaeology and 3-hole punch tomes with printed code and API's—I wish I was joking)

Research tolerates this because the optimization target is different: accuracy and validity in highly controlled conditions. If it works in the exact scenario the paper describes, that's enough. Robustness across edge cases isn't the goal. When something does break (and it always will), restart the computer and retry.

Production can't work this way. Production means users you've never met, running your software on machines you've never seen, in contexts you didn't anticipate. That requires a different kind of engineering entirely.

What Research Gets Right

Here's the thing: research infrastructure isn't all bad. In some ways, the gold standard is higher than industry.

Documentation and logging in well-run labs can be meticulous. When you're publishing results that need to be reproducible, you track everything. Parameters, versions, data provenance, analysis decisions. The paper trail matters because peer reviewers will ask.

Planning and deliberation happen at a pace production rarely allows. We spend weeks reading papers before writing a line of code. We hold meetings just to discuss ideas—not sprint planning, not standups, just thinking together. The luxury of slow thought produces architecture that's considered rather than reactive.

Peer review applies not just to papers but to methods. Your approach gets scrutinized by people who know the domain deeply. From my experience, that's a form of quality control production environments don't match.

The problem isn't that research lacks rigor. It's that the rigor is applied unevenly—meticulous about scientific validity, careless about software engineering. The result is systems that are trustworthy in their conclusions but nightmarish in their implementation.

What Production Taught Me

When we built Proprio—a wearable activity recognition platform for stroke rehabilitation—I learned what breaks when you leave the lab. I was not allowed to be in the room when user testing with stroke patients was underway. I wasn't allowed to even meet them.

You cannot write perfect instructions.

In research, you can hand someone a protocol. You can train them. You can supervise their first ten sessions. In production, your product has to be self-explanatory. When there are instructions, they need to do their job in seconds. Interfaces and concepts need to be grasped intuitively.

Researchers hide complexity behind monitors while we suffer with terrible UIs. We tolerate it because we're motivated to figure it out. Real users aren't. They'll leave.

"Good enough" is sometimes the right answer.

We reduced Proprio's IMU sampling rate to 30Hz to save battery life—well below the rates used in virtually all published research on activity recognition from IMU data. The academic gold standard would have said that's unacceptable.

But the academic gold standard doesn't account for a device dying mid-session because the battery couldn't last a full day. And academic gold standards don't need to stream data from a watch to the cloud in real time. We made the tradeoff. It worked. The classifier performed 20% better than any existing accounts in the engineering literature, even at 30Hz, and the device actually got used.

This was a hard lesson. Academic instinct is to match what came before. If you iterate, make it microscopic. Not to say that production environments are all "move fast and break things". In many ways, legacy patterns and ways of thinking are hard to shake no matter where you go.

Where My Research Background Actually Helped

Not everything I learned in the lab was wrong. Some of it gave me advantages that typical production engineers lack.

Biological architectures as computational patterns.

I spent years studying how brains and bodies process information—sensorimotor integration, perception-action coupling, hierarchical motor control. These systems have patterns that are common in organisms but rare in computation, often only because they haven't been tried.

Traditional game AI uses brute force. Vision is ray casting. Physics is mesh collision. Pathfinding uses precomputed nav meshes. Actions are hardcoded classes and inheritance. These are computational solutions to computational problems.

But brains don't work that way. Organisms aren't optimal—they're efficient and just good enough; sometimes not even good enough. Our brains filter for salient information, not the entire world's contents. Every detail, every piece of context shapes processing and decision-making. That isn't captured in behavior trees or state machines.

When I designed MUSE, I named the computational nodes "heuristics" because they are modelled after the same heuristics our brains and bodies use. The system doesn't process every aspect of a being every frame trying to hit 60 FPS—because that's not how biology works. It uses distributed closed-loop systems where one calculation impacts another over time. Behavior emerges from feedback loops, not from sequential logic executing top to bottom every tick.

This architecture came from neuroscience, not from game development tutorials.

Accessibility as a design constraint.

Years of working with patients—amputees, stroke survivors, people with Parkinson's, aging populations—taught me the range and variability in how humans approach problems. Accessibility isn't a feature you add later. It's a lens that changes how you design from the start.

That perspective shaped BABEL's entire philosophy: if natural language is the interface, then the interface meets users where they are, not where you wish they were. MUSE is designed with everyone in mind. If you can read or listen, if you can type or speak, and if BABEL can handle a language you are comfortable in (it likely can), then MUSE could be for you.

What I Had to Unlearn

The switch from research to product R&D was, if anything, a return to my natural way of being.

Research trained me for long cycles. Let's be honest, it forced me into them. Three years from ideation to published paper. Careful, deliberate, slow. Founding PlatformSTL was a breath of fresh air—finally, something that worked at my natural cadence. Rapid iteration. Ambiguity everywhere. Problems to solve around every corner.

I didn't have to learn to move faster. I did have to remind myself that there are other ways of being that better match my inclinations.

MR.Flow: Research Infrastructure Done Differently

When I built MR.Flow—a pipeline orchestration system for MRI processing—I tried to apply production thinking to a research problem.

The gold standard in neuroimaging preprocessing is fMRIPrep. It's rigorous, well-documented, and produces reproducible results. It also basically requires Docker, command-line fluency, and enough systems knowledge to configure GPU parallelization (or go without). Most researchers don't have this. Many are able to manage bash scripts, albeit, copying and pasting commands into terminals-and then praying.

MR.Flow exists for them. Not power users who want every parameter exposed—researchers who just need high-end results without massive time sinks debugging environment issues.

I made deliberate choices:

Expose parameters that matter in 90% of cases
Use the same default settings from 99% of published studies and clearly explain why
Support the most common processing phases, not every variant
Hide Docker entirely behind a GUI

This is "good enough" engineering applied to research infrastructure. It sacrifices flexibility for accessibility. Power users might be frustrated. The researchers who actually need it can finally use the tools the field considers standard. I don't envision MR.Flow being the final destination for many users. I hope it gets them far enough through the door that they can be a part of the community and maybe find the motivation, or overcome the intimidation, to delve deeper into more advanced analyses.

MUSE: Where Both Worlds Collide

MUSE is strange because it demands both research-grade rigor and production-grade usability—simultaneously.

The simulation needs to be inspectable, traceable, scientifically grounded. PYTHIA (our research interface, still in development) will let researchers outside Gothic Grandma run FONT as a headless simulation, gather data, and conduct population-level studies over simulated decades. It needs to be a legitimate research instrument.

But MUSE Living Worlds—the consumer application—needs to be invisible. Players shouldn't know they're interacting with a biologically-inspired entity system. They should just experience a world where things make sense, where consequences feel real, where characters seem alive. All that complexity should be entirely hidden.

This is the tension I navigate constantly. Traditional game AI uses behavior trees and trial-and-error tuning. Shift a value, see what happens. That can't work here—the logic behind every entity is far too complex. We need research-grade statistical approaches to understand what's happening in our own simulation.

But we're a small team. Three people—two of us comfortable identifying as data scientists, all of us neuroscientists. This is exciting complexity for us. Just another Tuesday. But we want it invisible to consumers and even to content creators who might use our tools.

The same philosophy behind BABEL applies everywhere: translate intuitive everyday patterns into mathematical curves, models, and parameters. Meet users where they are. Hide the machinery.

The Lesson That Bridges Both Worlds

Research and production seem like opposites. One optimizes for validity under controlled conditions; the other optimizes for robustness under chaotic ones. One tolerates terrible interfaces; the other demands intuitive ones. One moves slowly and deliberately; the other ships and iterates.

But the deeper lesson is the same in both: build for the constraint that actually matters.

In research, that constraint is scientific validity. In production, it's user experience. In both, it's easy to optimize for the wrong thing—to build what's impressive rather than what's needed.

Proprio didn't need research-grade sampling rates. It needed battery life. MR.Flow didn't need every parameter exposed. It needed to be usable by people who don't know what a "container" is. MUSE doesn't need to show its complexity. It needs to produce worlds that feel alive.

"Good enough" isn't a compromise. It's a discipline. It means understanding what actually matters and having the judgment to stop before you over-engineer.

Biology taught me this before either research or production did. Organisms aren't optimal. They're efficient. They're just good enough—and that's why they work.

What's Next

This is the third post in a series about building the MUSE Ecosystem. The first covered FONT and the origin story. The second covered BABEL and the divide between deterministic simulation and black-box AI.

Next, I'll write about Living Worlds—the philosophy behind emergent narrative and what it means to build stories that happen rather than stories that are told.

This post is part of a series on building the MUSE Ecosystem. Follow for updates on the architecture, the philosophy, and the path toward Living Worlds.