From Stateful Stream Processing to Stateful Sandbox
8 min read

Five years ago, we started building RisingWave, an open-source streaming database. There was no AI writing code back then — every line was written by hand.
Technically, we were tackling one of the hardest problems in the database world: stateful computation — keeping a system running continuously while performing complex stateful operations, sustaining high throughput, absorbing workload spikes, recovering in seconds from node failures, and maintaining consistency across all state. The system maintains a continuously correct, queryable state over streaming data. Aggregations, joins, materialized views — every operator holds state, and every piece of state must be reliably managed.
After five years of refinement, RisingWave is running in production at thousands of companies. Production environments are unforgiving — workload spikes, node failures, and storage hiccups happen constantly. What keeps things standing is the architectural direction we chose from day one: compute-storage separation, S3 as primary storage, fully stateless compute nodes.
On-Call and AI SRE
Streaming database users aren't running offline reports. They're running real-time fraud detection, payment settlement, and live monitoring alerts — all on the critical path. One minute of downtime means one minute of transactions halted, fraud checks disabled, alerts missed. On-call isn't "fix it tomorrow" — it's "fix it now." And distributed system failures don't respect working hours — 3 AM alerts, cascading failures across dozens of nodes, a SQL query hitting an unseen corner case that stalls checkpointing. This is everyday life.
So we started building AI SRE. The most painful parts of incident response — alert triage, log analysis, fault diagnosis — are largely pattern-based. An experienced SRE sees a certain alert and automatically runs through a mental playbook. These playbooks can be encoded for AI agents.
But a useful AI SRE agent can't just be a chatbot that reads logs and gives advice. It has to take action — connect to the meta node to check barrier status, run diagnostic scripts to scan shared buffer backlogs across compute nodes, replay problematic SQL in a staging environment to reproduce bugs, analyze crash dumps hundreds of megabytes large. These are heavy operations, some potentially destructive. You don't want an AI agent running diagnostic scripts directly on production machines.
So the agent needs an isolated execution environment — a sandbox. Then we discovered: the existing sandbox solutions simply don't work.
The Fundamental Problem with Ephemeral Sandboxes
The AI agent sandbox market is already quite hot — Firecracker microVMs, gVisor containers, V8 isolates — isolation technologies are flourishing. But if you look closely at the architecture of these solutions, they're all fundamentally designed around ephemeral execution: a sandbox is created, code runs, and when it finishes or times out, the sandbox is reclaimed — session limits range from tens of minutes to twenty-four hours, after which all state is destroyed. Some solutions aggressively reclaim idle sandboxes for resource efficiency, causing multi-second cold start delays.
For running a simple Python script, ephemeral sandboxes are fine. But AI agents work very differently.
When an AI SRE agent investigates an incident, it first spends several minutes setting up the environment — installing diagnostic tools, pulling logs, configuring production-matching connection info. Then it runs analyses, generates intermediate files, starts auxiliary processes. The sandbox accumulates significant state. Then the agent pauses — waiting for human approval, or simply because one round of LLM conversation has ended.
Under the ephemeral model, pause = state death. Next time, rebuild from scratch. This isn't just wasted time — the intermediate state accumulated during the previous exploration often can't be reconstructed, because the reasoning chain that produced it is gone.

This problem isn't unique to AI SRE. Coding agents need to persist development environments across sessions. Browser agents need to maintain browser context. RL training needs snapshot and restore. Everyone seriously building AI agents eventually hits the same wall: sandboxes need state.
From an Isolation Problem to a State Management Problem
Most sandbox solutions focus their energy on isolation — how to prevent untrusted code from escaping. This matters, but it's only half the problem. Once a sandbox needs to be stateful, you're no longer dealing with an isolation problem — you're dealing with a state management problem.
Specifically: state can't live only on local disk — that ties the sandbox to a single machine, and if the machine dies, the state is gone. Filesystem changes need to be persisted to object storage, with local disk serving only as cache.
Sandboxes need snapshots, and they must be incremental — snapshotting an entire filesystem each time is unusable in production. Any snapshot should serve as a rollback point — if the agent breaks the environment, revert to the last good state. From a snapshot, you should also be able to fork new instances, and forks must be copy-on-write so agents can explore multiple paths at low cost.
Compute-storage separation means idle sandboxes can release compute resources while retaining disk state, and be restored when needed. At scale, this directly determines cost.
Isolation still can't be compromised. AI agents execute untrusted code — each sandbox should be a micro-VM running its own Linux kernel with hardware-level isolation. Not container namespace isolation — breaking out requires a hypervisor exploit, not a kernel exploit.
If you've built any kind of long-running stateful system — databases, stream processing, distributed storage — you've seen all of these challenges before. They're universal state management problems, just in a different context.
State Management: From Stream Processing to Sandbox
We spent five years solving stateful computation's state management challenges in RisingWave. When we started thinking about how to build a stateful sandbox, we found that many of the core challenges are shared.
Persistence: S3 as source of truth. RisingWave chose S3-as-primary-storage from day one. Our custom storage engine Hummock writes all state as immutable SSTables to S3, organized by table ID and epoch, never doing in-place updates. This makes compute nodes truly stateless — if one dies, spin up a new one, pull state from S3, recover in seconds.
For sandboxes, the same principle applies: filesystem state needs to be persisted to object storage, with local disk serving only as cache. But the shape of sandbox state is different from a database — it's not KV pairs, but an entire filesystem (OS, packages, user files, intermediate data from running processes). How to efficiently sync filesystem changes to S3 is one of the core problems we're currently exploring.
Checkpointing: must be incremental. RisingWave implements epoch-based asynchronous checkpointing: the meta node injects a barrier into the data stream every second, operators asynchronously dump local state to a shared buffer upon receiving the barrier, which then uploads to S3 in the background — checkpointing doesn't block data processing. The key is that checkpoints are incremental, only persisting changed data.
Sandbox snapshots face the same constraint — full snapshots of an entire filesystem are too slow and too expensive. A natural approach is to leverage copy-on-write disk formats (such as QCOW2's overlay mechanism): each snapshot freezes the current layer and creates a new overlay that only records subsequent writes. This way, snapshot cost scales with the amount of change, not the total filesystem size.
Rollback and Fork. When a RisingWave node fails, it loads the latest checkpoint from S3, replays the last few seconds of data, and recovers in seconds. For sandboxes, if each snapshot is an overlay layer, rollback means discarding overlays after the target point; fork means creating multiple independent overlay chains from the same snapshot.

The agent can fork a branch at snap-2 to try plan B while the original sandbox continues with plan A, keeping whichever result is better. If neither works, roll back to snap-1 and start over. Each fork shares all previous overlay data — only new writes consume additional space.
Compute-storage separation and elasticity. RisingWave's compute nodes are stateless — scaling means adding or removing compute nodes without migrating data. LSM-tree compaction is offloaded to dedicated compactor nodes, avoiding resource contention with computation. For sandboxes, compute-storage separation means idle sandboxes can pause — releasing CPU and memory while disk state stays in the persistence layer — and cold-start back when needed. In large-scale agent deployments, a large number of sandboxes are idle at any given moment, making this a direct cost driver.
Isolation. This is a dimension unique to sandboxes. Stream processing operators run in a trusted environment and don't need strong isolation. But AI agents execute untrusted code — each sandbox should be a micro-VM running its own Linux kernel, with hardware-level isolation via KVM or Hypervisor.framework. Not container namespace isolation — breaking out requires a hypervisor exploit, not a kernel exploit. BoxLite implements this layer using libkrun — a lightweight KVM-based VMM library that provides near-container startup speed and resource overhead, but with VM-level isolation strength.
Side by side:

Sandboxes Should Be Embedded
Everything above is about cloud-level architecture. But I believe sandboxes should first be embedded.
Look at the database world. The hottest databases of recent years aren't the most feature-rich cloud databases — they're SQLite and DuckDB — because they're embedded, just import and use. This doesn't mean cloud doesn't matter. It means that trying an idea or validating a scenario on your own machine is always the most natural, fastest way. Get it working locally, understand it, then decide whether to move to the cloud.
BoxLite applies this philosophy to sandboxes. pip install boxlite, three lines of code to spin up a hardware-isolated micro-VM locally. No daemon, no root, no complex deployment. When a sandbox becomes a library you can import, every AI agent developer can use it directly — start locally, connect to cloud-based S3 persistence and elastic scaling when needed. Local-first, cloud-ready.
The Future of Agent Infra
BoxLite was started by my friend Dorian Zheng in mid-2025. When I began exploring the possibilities of sandboxes with him late last year, we increasingly realized two things: first, stateful sandboxes are a severely underestimated developer pain point; second, the cloud-native stateful system experience we accumulated at RisingWave is almost directly transferable to this domain. This alignment wasn't planned — it emerged from doing the work.
Agentic AI is just getting started. Today, attention is still focused on model capabilities — smarter reasoning, longer context, better tool use. But when agents truly start running at scale, the bottleneck will inevitably shift from the model layer to the infrastructure layer. Agents need reliable execution environments, state management, and secure isolation. These problems don't have good answers yet.
This is the best time to build agent infrastructure.
BoxLite is open source on GitHub: github.com/boxlite-ai/boxlite