PROJECT · 2022 · LEAD

Bytro: Modernizing a Legacy Multiplayer Game Backend

Led modernization of Bytro's PHP game backend into event-driven microservices with CQRS, cutting latency ~35% while keeping live players in active matches.

↗ Homepage

Bytro builds real-time strategy games. Supremacy 1914. Conflict of Nations: WW3. Browser-based, real-time, massively multiplayer. Players coordinate global war campaigns over days or weeks. Matches have hundreds of players. State - unit positions, resource counts, diplomatic agreements, battle outcomes - must be consistent and visible to every participant at all times.

That’s the system I was asked to modernize.

What “legacy” actually meant

The original backend was PHP. Not PHP 8 with modern async handling and well-typed contracts - legacy PHP, years of it, with the accumulated decisions of a codebase that had grown to serve its players faithfully but had reached the point where every new feature was a negotiation with technical debt.

The problems weren’t subtle. State mutations for game events - a unit moving, a battle resolving, a diplomatic message being sent - were happening through a tangle of synchronous calls with unclear ownership. When multiple events arrived concurrently (which in a live game with hundreds of players is always, not sometimes), the system’s consistency guarantees were load-dependent. A quiet server handled it. A loaded server had races.

Read load was also a problem. Game state queries - every player refreshing their map, every client polling for updates - were hitting the same database path as writes. There was no caching strategy that could survive match peaks cleanly.

Latency at p99 was visibly affecting player experience. At the scales Bytro operated, that’s a retention problem, not just an engineering inconvenience.

The architectural bet: CQRS + event sourcing + Kafka

The core decision was to separate command handling from query handling at the architecture level, not just in code organization.

Commands - “unit moves from province A to province B,” “player declares war on faction X,” “resource trade is executed” - go through a command handler that publishes a domain event to Kafka. The event is the record of truth. The command handler does not write application state directly.

State is derived from events. Read models - the materialized projections that players query when they look at their map - are built by event consumers that update PostgreSQL read replicas and Redis caches. A read request never touches the command path. A write never touches the read path. They scale independently.

Event sourcing meant the game state at any point in time was reconstructible from the event log. That’s not just an architectural nicety - it’s the answer to “what happened to my unit” disputes, which are a real category of player support ticket in a game where decisions are consequential and players pay close attention.

Keeping live players alive through the migration

This is the part that looks obvious in retrospect and is miserable to live through.

You cannot take Supremacy offline for a migration weekend. Players are mid-match. Some matches run for weeks. You cannot say “matches started before the cutover will be migrated to the new system; matches started after will run on the new system” - the number of in-flight game states makes that operationally impossible without a dedicated migration team that doesn’t exist.

The migration strategy was event-driven strangler fig: new functionality was implemented as event-producing services from day one. The legacy code path remained live and authoritative. We ran dual-write for the transition period - new events were published to Kafka, legacy state was still updated synchronously - which let us validate that the event-derived read models were consistent with the legacy source of truth before cutting over read traffic.

The dual-write period is where you find every assumption the legacy system made that wasn’t in the code. Every implicit ordering guarantee. Every race that the legacy system’s single-threaded execution accidentally prevented. Finding those was not fun. Not finding them in production was the point.

The ~35% latency reduction

The number comes from p95 and p99 game state query latency before and after the read model migration. Read paths hitting Redis materialized projections are not the same category of operation as read paths hitting a contested PostgreSQL table that’s also absorbing writes. This is not a surprise. The surprise would have been if it hadn’t improved.

The more interesting number is write-path latency, which improved less - Kafka publish latency is real, event consumer lag during peaks is real, and the command path is now asynchronous where it was previously synchronous. Players who were used to seeing their unit move immediately after clicking were now seeing a short async delay. That’s a UX tradeoff that required careful handling - the client-side optimistic update pattern covered most of it, but calibrating the timeout-and-reconcile behavior for cases where the event consumer was temporarily behind took iteration.

Multi-squad coordination

The Bytro migration involved multiple squads: a platform squad handling infrastructure (Kafka, Kubernetes, deployment pipelines), domain squads handling individual game systems (combat, diplomacy, economics), and a client squad handling the frontend state synchronization changes.

Lead Developer across those squads meant managing the contract between them. The event schema was the contract. When the combat squad needed to add a field to the battle-resolution event, that was a schema migration that the client squad needed to handle, the analytics pipeline needed to handle, and the read-model consumers needed to handle - all without a flag day. We versioned events. This sounds obvious. Implementing it in a codebase that had never done it before is three weeks of work that nobody wants to do and everyone is glad you did.

Kubernetes: the right tool, applied carefully

Autoscaling event consumers on Kubernetes during match peaks - tournament events, major updates, the weekend spike that always catches you if you’re not watching - was the right call. It was also the first time this codebase had been run on Kubernetes, which meant the stateless/stateful distinction that Kubernetes requires you to make explicit had to be applied retroactively to a codebase that had made that distinction implicitly and inconsistently.

PHP sessions stored in local memory on a single instance are not Kubernetes-native. We already knew this. Working through every place the legacy code had made that assumption was the unsexy prerequisite to everything else working.

What I owned

Architecture decisions for the CQRS/event sourcing model and Kafka event topology
Migration sequencing and dual-write strategy for live game state
Event schema design and versioning contracts across squads
Read model design (PostgreSQL projections, Redis cache layers)
Multi-squad coordination: platform, domain, and client engineering
Kubernetes deployment design for stateless event consumers

Games are load-variable in ways that most enterprise software isn’t. A tournament announcement at 14:00 on a Saturday is not in your capacity planning spreadsheet. Building a backend that can absorb that spike without leaving players staring at a spinner is a different class of problem than handling a predictable B2B request curve. I learned things about event consumer backpressure and lag alerting at Bytro that I use in every distributed system I design now.

Conflict of Nations: WW3 is still running. Supremacy 1914 is still running. The backend that serves them is meaningfully different from the one I inherited. That’s the outcome.