August 2007 - Posts

13 disasters for production website and their solutions

When we first went live with Pageflakes back in the year 2005, most of us did not have experience with running a mass consumer high volume web application on the Internet. In our first year of operation, we went through all types of difficulties a web application can face as it grows. Frequent problems with software, hardware, and network were part of our daily life. We have overcome a lot of obstacles and established ourselves as one of the top most Web 2.0 applications in the world. From a thousand user website, we have grown to a million user website over the years. We have learnt how to architect a product that can withstand more than 2 million hits per day and sudden spikes like 7 million hits on a day. We have discovered under the hood secrets of ASP.NET 2.0 that solves many scalability and maintainability problems. We have also gained enough experience in choosing the right hardware and Internet infrastructure which can make or break a high volume web application. In this article, you will learn about 13 disasters than can happen to any production website anytime. These real world stories will help you prepare yourself well enough so that you do not go through the same problems as we did. Being prepared for these disasters upfront will save you a lot of time and money as well as build credibility with your users.

We have gone through many disasters over the years. Some of them are:

  1. Hard drive crashed, burned, got corrupted several times
  2. Controller malfunctions and corrupts all disks in the same controller
  3. RAID malfunction
  4. CPU overheated and burned out
  5. Firewall went down
  6. Remote Desktop stopped working after a patch installation
  7. Remote Desktop max connection exceeded. Cannot login anymore to servers
  8. Database got corrupted while we were moving the production database from one server to another over the network
  9. One developer deleted the production database accidentally while doing routine work
  10. Support crew at hosting service formatted our running production server instead of a corrupted server that we asked to format
  11. Windows got corrupted and was not working until we reinstalled
  12. DNS goes down
  13. Internet backbone goes down in different parts of the world

This article of mine explains all these disasters and gives you the solutions:

http://www.codeproject.com/install/13disasters.asp

Please vote for the article if you like it.

Posted by omar with 10 comment(s)
Filed under: