Lt. Commander
Join Date: Dec 2007
Posts: 120
# 1
03-09-2012, 01:19 AM
The free to play launch of STO was supported by a substantial upgrade to the hardware used to run the live game, and that hardware is more than adequate to support our current user load, which is substantially higher than before f2p launch.

We have had 6 database crashes, which have brought the game down, over the last month. In 3 of those cases we deployed an already scheduled patch and were able to cancel a scheduled downtime.

The crashes are occurring because of a software bug which is corrupting a data structure used to keep track of large chunks of memory that are being kept around for future use (the term of art among programmers is "Heap Corruption"). These types of bugs are often the most subtle and difficult to track down in large software systems. The crash only occurs after 3-4 days of running under heavy load, and despite our best efforts we have been unable to reproduce it in the lab.

We have had a dedicated team working on this problem since the first crash, including most of our infrastructure programming team, myself, Stephen D. (back in the CTO chair!) and other programmers we pull in as needed. We have been trying to reproduce the crash, studying all code changes from the last couple of months, spending days examining crash dumps, and adding new code to help diagnose the problem in every patch.

With every crash we gather more information and get closer to a solution. The reason for the extended downtime today was that we decided to debug the live database when it crashed, rather than examining a crash dump after the fact. This allowed us to get more information and test some theories in ways we could not do when examining a dump after the fact, so it contributed to getting us to a solution faster.

I just wanted to let you all know that we are working hard on the problem (it is almost midnight here!) and are making progress towards a solution. We have the resources we need, namely the people who know the code the best.
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 2
03-09-2012, 01:41 AM
Quote:
Originally Posted by Rehpic

I just wanted to let you all know that we are working hard on the problem (it is almost midnight here!) and are making progress towards a solution. We have the resources we need, namely the people who know the code the best.
Interesting problem. We were all thinking there weren't enough hamsters but that's not the case. I hope you guys are able to figure out the corruption issues soon.

Until then, Engage the Emergency Caffeine Generators! Sounds like you need it.
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 3
03-09-2012, 01:44 AM
Quote:
Originally Posted by Rehpic
The free to play launch of STO was supported by a substantial upgrade to the hardware used to run the live game, and that hardware is more than adequate to support our current user load, which is substantially higher than before f2p launch.

We have had 6 database crashes, which have brought the game down, over the last month. In 3 of those cases we deployed an already scheduled patch and were able to cancel a scheduled downtime.

The crashes are occurring because of a software bug which is corrupting a data structure used to keep track of large chunks of memory that are being kept around for future use (the term of art among programmers is "Heap Corruption"). These types of bugs are often the most subtle and difficult to track down in large software systems. The crash only occurs after 3-4 days of running under heavy load, and despite our best efforts we have been unable to reproduce it in the lab.

We have had a dedicated team working on this problem since the first crash, including most of our infrastructure programming team, myself, Stephen D. (back in the CTO chair!) and other programmers we pull in as needed. We have been trying to reproduce the crash, studying all code changes from the last couple of months, spending days examining crash dumps, and adding new code to help diagnose the problem in every patch.

With every crash we gather more information and get closer to a solution. The reason for the extended downtime today was that we decided to debug the live database when it crashed, rather than examining a crash dump after the fact. This allowed us to get more information and test some theories in ways we could not do when examining a dump after the fact, so it contributed to getting us to a solution faster.

I just wanted to let you all know that we are working hard on the problem (it is almost midnight here!) and are making progress towards a solution. We have the resources we need, namely the people who know the code the best.
Thanks for the update and letting us know what's going on.
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 4
03-09-2012, 01:53 AM
Quote:
Originally Posted by Rehpic
The free to play launch of STO was supported by a substantial upgrade to the hardware used to run the live game, and that hardware is more than adequate to support our current user load, which is substantially higher than before f2p launch.

We have had 6 database crashes, which have brought the game down, over the last month. In 3 of those cases we deployed an already scheduled patch and were able to cancel a scheduled downtime.

The crashes are occurring because of a software bug which is corrupting a data structure used to keep track of large chunks of memory that are being kept around for future use (the term of art among programmers is "Heap Corruption"). These types of bugs are often the most subtle and difficult to track down in large software systems. The crash only occurs after 3-4 days of running under heavy load, and despite our best efforts we have been unable to reproduce it in the lab.

We have had a dedicated team working on this problem since the first crash, including most of our infrastructure programming team, myself, Stephen D. (back in the CTO chair!) and other programmers we pull in as needed. We have been trying to reproduce the crash, studying all code changes from the last couple of months, spending days examining crash dumps, and adding new code to help diagnose the problem in every patch.

With every crash we gather more information and get closer to a solution. The reason for the extended downtime today was that we decided to debug the live database when it crashed, rather than examining a crash dump after the fact. This allowed us to get more information and test some theories in ways we could not do when examining a dump after the fact, so it contributed to getting us to a solution faster.

I just wanted to let you all know that we are working hard on the problem (it is almost midnight here!) and are making progress towards a solution. We have the resources we need, namely the people who know the code the best.
All your hard work is appreciated. Thanks for the update.
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 5
03-09-2012, 02:10 AM
I had a feeling that it was not a hamster issue.

I knew you guys kept them well fed
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 6
03-09-2012, 02:25 AM
My suggestion would be to check the ears of all employees if they are pointy.

Thanks for the update, it's good to know what's going on.
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 7
03-09-2012, 02:50 AM
Thank you for the update, Rehpic. We appreciate all the hard work and effort you guys put into the game. Good luck on finding that bug (have you checked the Gamma quadrant?)!
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 8
03-09-2012, 03:03 AM
Quote:
Originally Posted by Rehpic
The free to play launch of STO was supported by a substantial upgrade to the hardware used to run the live game, and that hardware is more than adequate to support our current user load, which is substantially higher than before f2p launch.

We have had 6 database crashes, which have brought the game down, over the last month. In 3 of those cases we deployed an already scheduled patch and were able to cancel a scheduled downtime.

The crashes are occurring because of a software bug which is corrupting a data structure used to keep track of large chunks of memory that are being kept around for future use (the term of art among programmers is "Heap Corruption"). These types of bugs are often the most subtle and difficult to track down in large software systems. The crash only occurs after 3-4 days of running under heavy load, and despite our best efforts we have been unable to reproduce it in the lab.

We have had a dedicated team working on this problem since the first crash, including most of our infrastructure programming team, myself, Stephen D. (back in the CTO chair!) and other programmers we pull in as needed. We have been trying to reproduce the crash, studying all code changes from the last couple of months, spending days examining crash dumps, and adding new code to help diagnose the problem in every patch.

With every crash we gather more information and get closer to a solution. The reason for the extended downtime today was that we decided to debug the live database when it crashed, rather than examining a crash dump after the fact. This allowed us to get more information and test some theories in ways we could not do when examining a dump after the fact, so it contributed to getting us to a solution faster.

I just wanted to let you all know that we are working hard on the problem (it is almost midnight here!) and are making progress towards a solution. We have the resources we need, namely the people who know the code the best.
This makes it easier to be patient. I thank you for all your work on this matter and I really appreciate your team's long hours. Hope you squash this bug soon so you can get back to a more normal schedule.
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 9
03-09-2012, 03:04 AM
Quote:
Originally Posted by Rehpic
The crashes are occurring because of a software bug which is corrupting a data structure used to keep track of large chunks of memory that are being kept around for future use (the term of art among programmers is "Heap Corruption"). These types of bugs are often the most subtle and difficult to track down in large software systems. The crash only occurs after 3-4 days of running under heavy load, and despite our best efforts we have been unable to reproduce it in the lab.
Let us always remember these eternal wisdoms:
  • Memory Management is too important to be left to the computer
  • Memory Management is too important to be left to the programmer
Lt. Commander
Join Date: Dec 2007
Posts: 120
# 10
03-09-2012, 03:16 AM
Quote:
Originally Posted by Rehpic
The free to play launch of STO was supported by a substantial upgrade to the hardware used to run the live game, and that hardware is more than adequate to support our current user load, which is substantially higher than before f2p launch.

We have had 6 database crashes, which have brought the game down, over the last month. In 3 of those cases we deployed an already scheduled patch and were able to cancel a scheduled downtime.

The crashes are occurring because of a software bug which is corrupting a data structure used to keep track of large chunks of memory that are being kept around for future use (the term of art among programmers is "Heap Corruption"). These types of bugs are often the most subtle and difficult to track down in large software systems. The crash only occurs after 3-4 days of running under heavy load, and despite our best efforts we have been unable to reproduce it in the lab.

We have had a dedicated team working on this problem since the first crash, including most of our infrastructure programming team, myself, Stephen D. (back in the CTO chair!) and other programmers we pull in as needed. We have been trying to reproduce the crash, studying all code changes from the last couple of months, spending days examining crash dumps, and adding new code to help diagnose the problem in every patch.

With every crash we gather more information and get closer to a solution. The reason for the extended downtime today was that we decided to debug the live database when it crashed, rather than examining a crash dump after the fact. This allowed us to get more information and test some theories in ways we could not do when examining a dump after the fact, so it contributed to getting us to a solution faster.

I just wanted to let you all know that we are working hard on the problem (it is almost midnight here!) and are making progress towards a solution. We have the resources we need, namely the people who know the code the best.
for once these guys actually listened and got the server down. the situation just before this was getting a bit frustrating for a number of players and the overall reliability has been very poor. hopefully this whole issue can go away soon and perhaps take the server offline once again to completely diagnose the issue once identified. it takes no guesstimating to know you guys are working on the issue. for once, thanks for getting around to it.
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


All times are GMT -7. The time now is 03:23 PM.