Maintaining Cross-Platform Functionality and Scalability While Gaining Gamers
I’ve been working on a great massive multi-player for about half a year now with some very high expectations. In doing so, we’re trying a new and unique approach in our technology stack that we haven’t used before. Our goal is to enable a massive number of clients to connect to the game simultaneously while maintaining cross-platform functionality and scalability.
The big picture
frontend –(via WebSockets)-> Node.js –> Redis –> PHP workers –> Persistence via MySQL / PostgreSQL / MongoDB –aaand-> back!
We began with a single HTML5 front-end driving all our gamers’ experiences. However, we encountered some severe performance problems, so we decided to explore a service-oriented architecture. Our new system has several frontends which connect via WebSockets. The front end technology is not important as long as it speaks WebSockets, however, common frontends include JavaScript clients, iOS apps or Android apps. In this setup, the front-end sends simple JSON messages to a Node.js application. Those messages contain information about who is addressed (controller) and has to do what (Action) with some data. Here the Node.js application registers and stores a WebSocket connection and puts it into a Redis queue.
We use our Redis box as a simple queue and take advantage of some of the more advanced Redis features such as PubSub. The Node backend stores the socket’s ID along with the message and enqueues it in Redis:
receive: function (socket, msg) {
msg.client = socket.identity;
pub.rpush(config.push_list, JSON.stringify(msg));
}
All the heavy lifting is done by PHP workers listening to the Redis PubSub channel:
while (!$this->terminate & $this->maxRequests > 0) {
$logger->logWorker($message);
if ($message = $this->queue->listen()) {
try {
pcntl_alarm($this->max_exection_time);
$dispatcher->route($message);
} catch (\Exception $e) {
$this->handleException($e, $message, $logger);
}
}
pcntl_alarm(0);
}
Due to the open nature of our messaging infrastructure we could be using anything else, Ruby for example. And it really has a simple job to do: take the incoming data, validate it, react and put the corresponding response back to Redis. Since the serious heavy lifting is done on the client side, this allows the service to be fast, available, responsive and scalable.
Every game-logic related action happens in the validation and reaction phase. While PHP is not necessarily built to run as a daemon, that is exactly how our PHP workers run. To make the system error resilient, we monitor the daemons and, since they use only our queue and the database to store state, it is safe to just restart them whenever any one of them dies. This way we can also introduce more advanced monitoring techniques like restarting workers which leak memory and grow over a set RAM boundary or randomly killing workers to make sure the error recovering works as expected. While there are no particular problems with PHP’s memory consumption, leaks in your code are not unlikely. However, as long as you monitor your workers, it’s not a problem. For example, whenever the consumption exceeds a specific threshold then the daemon in question gets killed. The monitor afterwards will notice that the desired number of daemons isn’t running anymore and simply spawns a new one.
By the way, we don’t rely on any framework for the daemonization of PHP. All we do is more or less a call to pcntl_fork().
In this configuration, a worker is able to do its job within 2-5 milliseconds when only reading data and sending it back. When the database result is cached, then the job is done even faster.
Through Redis the response goes smoothly back to Node that has subscribed to those events and is then sending the results back to the clients. If there are multiple clients connected for that same gamer Node is fanning out the results to all of them, so that you can play on your iPad and your phone at the same time, seeing the same game.
Scaling
This system can be scaled by adding Node servers as needed. Because they simply function as gatekeepers and messengers between the game and the clients, their load is not very high. However, if this becomes a bottleneck, we can scale horizontally. The same is pretty-much true for the PHP workers.
Scaling the Redis queue might get a bit trickier. Since Redis’ PubSub implementation is bound by the number of subscribers (rather than publishers) and messages will be propagated transparently to all slaves. So, this is solvable by adding more slaves to the Redis cluster.
Our first simple implementation looked up connections by looping through the whole pool of clients. We did this for each incoming request which turned out to be a major bottleneck. The easy fix was to store connections intelligently in a hash table. Each request that comes in now just does a simple lookup in the hash and immediately has the right connection at hand.
When it comes to optimizing performance, there is plenty of room for that in our code itself. Right now the various clients just hammer the backend with so many requests which could be condensed and intelligently bundled. We could also add functionality to the Node layer to split up combined messages, so that the actual game logic in PHP would not even notice the new, bundled messages coming in from the client.
As a last resort, we could also add more caching. Node could actually answer lots of the incoming requests from the clients directly. Many parts of the game data are not changing frequently and therefore are perfect candidates to cache them. Not going through the whole queueing process and hitting up a PHP worker would free lots of resources down the line.
All in all we are pretty happy with our current setup. It’s nothing that we have done before for games in production. But so far, things are working smoothly. The combination of WebSockets, Node.js and PHP workers is superfast and reliable.
Share your thoughts with @engineyard on Twitter