100% CPU, 100% IO, a near death experience for SQL Server 2005 and us

For last two weeks we were having pretty hard time at Pageflakes. The database server was having 100% CPU, and 100% IO usage most of time. It was running hot, almost about to go to a coma. It's a 64 bit Dual Core Dual Xeon Dell server with 2 GB RAM running around 30 GB database on 4 SCSI drives. So, it's more or less the best hardware money can buy (except for RAM ofcourse). But still the performance counter looked like this:

Pretty horrible situation isn't it? Users are having connection timeout on their browser. Event Log is full of "SQLConnection: Timeout". User's can't see their page. Those who can see their pages, have serous poor performance and slow response from the server. You can imagine the rest. Email flood, phone calls, management screaming on your ears etc etc.

After a lot of diagnostics, we came to conclusion that SQL Server 2005 is the culprit. So, we ran SQL Profiler on a different server (don't ever run SQL Profiler on the same server where you database is running). We saw there's an SP which is taking thousands of read and cpu cycles.

So, we were pretty much sure that SP was the culprit. So, we took one of the long executing ones and ran it through Execution Plan to see what goes wrong.

So, we first did this in order to see IO usage during the SP execution:


set statistics IO on

GO

Here's the output after we run the SP:

Table 'RSSItem'. Scan count 1, logical reads 40, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Nothing special. Runs pretty quick, very low read count. Looks like no read count at all, completely cached and returned from memory. So, SQL Server 2005 is doing pretty good caching on it. After this, we verified whether the index was being used properly or not:

'Clustered Index Seek' the best possible thing. If it was Clustered Index Scan, then we would have thought there's something wrong. So, we were making best use of Index also. We were pulling our hairs out.

So, I decided to look at the SP in order to see if I can find out something. I stared at this for half an hour:


ALTER PROCEDURE 
[dbo].[prcRSSItemGetByChannelIDPageSize]

    @ChannelID 
int
,

    @StartIndex 
int
,

    @PageSize 
int

AS
 

WITH 
SelectedRSSItems(
[ID],[ChannelID],[Hash],...)

AS

(

    
SELECT


        [ID],

        [ChannelID],

        [Hash],

        [Title],

        [Guid],

        [Description],

        [EncodedContent],

        [Link],

        [PublishDate],

        [XML],

        [SavedCount],

        ROW_NUMBER() 
OVER

(
ORDER
BY 
[PublishDate]

DESC
)

AS

[RowNumber]

    
FROM

        dbo.[RSSItem]

    
WHERE

        [ChannelID] = @ChannelID 

)
 

SELECT 

        *

FROM

        SelectedRSSItems

WHERE

        RowNumber 
BETWEEN

(@StartIndex
+ 1) 
AND

(@StartIndex
+ @PageSize)

ORDER BY 
PublishDate

desc

Those who are wondering what's with the "WITH" block. This is called Common Table Expression (CTE). This is a new feature in SQL Server 2005 which is the best so far for paging. Previously we had to page rows by doing any of the following:

  1. Copy primary key in a temporary table after sorting and use IDENTITY column to generate row number. Then select rows by joining primary key from temporary table to the actual table. The paging is done based on row number on the temporary table. Very expensive. You end up creating temporary table on every call to the SP. Too many IO on tempdb.
  2. Use subquery. Very complicated SQL. Not so good result.
  3. Use cursor to skip rows and select only those which fall within the page. It was the best solution so far.

Now you have the mighty ROW_NUMBER() function which we all have been waiting for 10 years!

Back to our disaster, you see the SQL is nice and clean, nothing suspicious. But SQL Server 2005 is drowning taking us and a thousands of users all over the world with it.

Finally, it struck my mind! SQL Server is doing something on the PublishDate field because it needs to sort the rows based on it.

        ROW_NUMBER() OVER ( ORDER BY [PublishDate] DESC ) AS [RowNumber]

What if I make a new Non-Clustered index on ChannelID and PublishDate? The query filters using ChannelID and sorts using Publishdate. So, it should get everything it needs from the index.

I created a brand new index on ChannelID and PublishDate. It took 12 mins to create the index. Then the CPU usage looked like this:

We are saved! We are back in business!

So, what we learn from this experience? We learn nothing.

This blog post is dedicated to my friend Shahed who wrote the above SP two weeks ago and gave us a lot of fun all these days.
Published Tuesday, August 15, 2006 1:47 PM by omar
Filed under:

Comments

# re: 100% CPU, 100% IO, a near death experience for SQL Server 2005 and us

Tuesday, August 15, 2006 10:03 AM by Shafqat Ahmed
Excellent find! Saved the day.

# re: 100% CPU, 100% IO, a near death experience for SQL Server 2005 and us

Friday, August 18, 2006 11:16 AM by Dennis Gorelik
1) You learned from this experience (and I from your post) how properly create indexes for "SELECT WITH" queries.
2) How long did it take to write this post after you found the problem?

Leave a Comment

(required) 
(required) 
(optional)
(required)