02-19-2011, 03:51 PM
Originally Posted by Rehpic
The number of people logged in at 11:30am PST (the time things started going bad the last two weeks) was about the same as the last two weeks. The difference is that we fixed several bugs.

The main problem was a bug that was causing mission map instances to stick around for a few minutes after the player(s) exited, when they should have closed immediately. That led to the shard getting overloaded (both CPU and RAM), which triggered a couple of other bugs. First there was a problem with the load balancer and how it dealt with machines that are getting close to full which caused already overloaded machines to get even more overloaded. Second there was a problem with our error recovery when a machine started running out of memory, which would require manual intervention of an ops person to clean up, rather than cleaning automatically.

Those problems have all been fixed, and things are running very smoothly today.
Some friends took bets on when the server would go down today but I announced that I thought you guys had finally gotten it all sorted out. Me (and my gold-pressed latinum) are glad to see our trust in you guys was well-placed.