Unexpected Downtime (March 11, 2012)
View Single Post
Join Date: Dec 2007
03-11-2012, 07:34 PM
Incase you missed this the other day from STO Lead Programmer Rehpic:
Originally Posted by
The free to play launch of STO was supported by a substantial upgrade to the hardware used to run the live game, and that hardware is more than adequate to support our current user load, which is substantially higher than before f2p launch.
We have had 6 database crashes, which have brought the game down, over the last month. In 3 of those cases we deployed an already scheduled patch and were able to cancel a scheduled downtime.
The crashes are occurring because of a software bug which is corrupting a data structure used to keep track of large chunks of memory that are being kept around for future use (the term of art among programmers is "Heap Corruption"). These types of bugs are often the most subtle and difficult to track down in large software systems. The crash only occurs after 3-4 days of running under heavy load, and despite our best efforts we have been unable to reproduce it in the lab.
We have had a dedicated team working on this problem since the first crash, including most of our infrastructure programming team, myself, Stephen D. (back in the CTO chair!) and other programmers we pull in as needed. We have been trying to reproduce the crash, studying all code changes from the last couple of months, spending days examining crash dumps, and adding new code to help diagnose the problem in every patch.
With every crash we gather more information and get closer to a solution. The reason for the extended downtime today was that we decided to debug the live database when it crashed, rather than examining a crash dump after the fact. This allowed us to get more information and test some theories in ways we could not do when examining a dump after the fact, so it contributed to getting us to a solution faster.
I just wanted to let you all know that we are working hard on the problem (it is almost midnight here!) and are making progress towards a solution. We have the resources we need, namely the people who know the code the best.