Whole farm is down because timer jobs are not running
One of my clients this week managed to take his entire farm offline this week by upsetting the timer service. First a little background – currently they are scrambling to get SharePoint back to a happy state. Why? Well, as happens with lots of customers, SharePoint is too successful. When we originally setup their farm and upgraded from SPS2003 to MOSS 2007 they had about 20 GB of content that was growing at a very controlled pace. Fast forward a little more than a year and their content database is about 320 GB. YIKES! Even scarier most of their data is in one site collection. This is bad, very bad! Typical guidance is your content databases should be less than 100 GB.
Part of this growth has forced some moving of the databases to different drives and a database restore to deal with another issue. Well, anytime you want to move SharePoint databases around you should run the command stsadm –o preparetomove as documented by Cory Burns in the post Detaching databases in MOSS. If you didn't you will start getting sync errors once an hour such as:
Failure trying to synch web application 09a21da5-4485-4b00-8268-772aea7fea12, ContentDB 65301403-c277-4b4c-ad5a-e822572d10ea: A duplicate site ID 3b3a4372-aa91-4e0c-ba57-2567958d81bb(http://portal/sites/test1) was found. This might be caused by restoring a content database from one server farm into a different server farm without first removing the original database and then running stsadm -o preparetomove. If this is the cause, the stsadm -o preparetomove command can be used with the -OldContentDB command line option to resolve this issue.
Cory then goes on how to fix it using stsadm –o sync. This is where my client was. He ran this command but for some reason (possible him specifying the wrong switches and accidently deleting a content db) the command hung up for a long period of time, and the portal users were unable to access the environment. So he killed the stsadm process. From that point all hell broke loose.
For several hours they attempted a lot of fixes found on the web. One of the fixes had them rename the folder located at C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\<guid>\. This was a bad option. The folder contains XML files for all of the timer job definitions that need to be ran and the idea was renaming the folder would cause SharePoint to create a new empty copy of the folder and then it could start creating the xml files again and get back to work. Nope, that isn't how it works. What they needed to do was delete all of the XML files and leave the folder alone. Then when they restarted the timer service the proper XML files would have magically reappeared.
Hope this helps you
Shane
SharePoint Consulting