Build Diary

Devlog

Notes from the process of building something from nothing. Unpolished, chronological, honest.

shipped

Block storage hits production

Six months of SPDK wrestling, NVMe-oF debugging at 3am, and one complete rewrite of the snapshot logic. It's live. Thick-provisioned volumes with online expand, tiered IOPS presets, synchronous mirroring. The monitoring dashboard finally shows green across all zones.

The hardest part wasn't the storage engine — it was the integration surface. Getting IAM policy evaluation to work correctly at the volume-attach boundary took three attempts. Lesson: auth at the data plane is a different beast than auth at the control plane.

learned

RDMA is not your friend (at first)

Spent a week debugging intermittent IO errors on the NVMe-oF path. Turned out to be an MTU mismatch on one switch port causing RDMA retransmissions that looked like target-side errors in our logs. The fix was one line in a network config.

Takeaway: when working below TCP, you lose the diagnostics you're used to. Build your own observability from day one, not after the first incident.

idea

Snapshot cloning could be instant

Current clone-from-snapshot does a full copy. But if we track snapshot blocks as CoW references, cloning becomes a metadata operation. The volume appears instantly and only diverges as writes come in. Need to think about GC implications — orphaned snapshot chains could get expensive.

shipped

IAM v2: OIDC federation

The IAM service now supports OpenID Connect for external identity providers. Took the AWS approach of assume-role via web identity tokens. The tricky bit was the JWKS caching and key rotation — you can't call the IdP on every request at the volume attach path.

Ended up with a local JWKS cache that refreshes on signature failure, with a circuit breaker. Simple, robust, boring. Which is what auth infrastructure should be.

fixed

The ARP storm incident

Our ARP proxy had a subtle bug: when a VM migrated between hosts, the proxy on the old host would keep responding to ARP requests for ~30 seconds (stale cache TTL). During that window, some packets routed to the dead host.

Fix: gratuitous ARP broadcast on migration + immediate cache invalidation via the control plane. Also added metrics for "conflicting ARP responses" which is now our canary for future issues.

shipped

GitHub Actions on Firecracker

CI runs on ephemeral microVMs now. Each job gets a fresh Firecracker instance — boots in ~125ms, runs the job, gets destroyed. No state leaks between jobs, no "works on the runner" bugs from accumulated cruft.

The satisfying part: going from "push to GitHub" to "microVM booted and running" in under 2 seconds. That feedback loop changes how you work.

learned

DNS is deceptively simple

Building an authoritative DNS server from scratch teaches you that DNS is 10% protocol and 90% operational edge cases. Zone transfers with TSIG seem straightforward until you handle: incremental vs. full transfers, serial number wrapping, multiple secondaries with different sync states, and the fun case where your primary restarts mid-transfer.

The spec is clear. Reality is not. Every RFC has an "and then implementations diverge" gap.