Primary Keys and GUIDs

Published Fri, Jan 13 2006 1:26 | GregLow

This morning, we've been having a discussion on the Aus Dot Net mailing list about Primary Keys and GUIDs.

Rocky asked my thoughts on this. Greg Linwood made a good post on it and my thoughts were pretty much in line with his.

However, I also place significant value on the ability for the GUID values to be generated in a middle-tier. The removal of the need for a database traversal can in many cases be very beneficial to performance.

But, you sure don't want a clustered primary key that's a GUID. The advice on what makes a good clustering key has changed over the years.

Many people would remember the days when the SQL Server recommendation was for wide-duplicated values (like names) used in range queries. That was because the product didn't have row-locking and they were specifically trying to avoid "hot-spots" caused by identity-style columns, where everyone was trying to insert on the same page at the same time.

Since row-locking has been in the product, such a "hot-spot" is now a good thing as it means the page that's being worked on will very likely already be in memory. So, currently, the performance preference is small and monotonically increasing clustering keys. Identity columns work well for that.  (See Kim Tripp's sessions on indexing from PASS if you have access to them).

However, the more I head into distributed systems territory, the keener I am on GUIDs. Identity columns tend to be more of a single-server solution, unless you implement some form of dispenser proc and that usually slows things down quite a bit.

There was discussion on providing a new data type in SQL Server 2k5 that would have been a monotically increasing GUID. It appears to have got the bullet but I wouldn't doubt it might reappear soon. They'd sure have to be pretty careful with that sort of algorithm.

Others have implemented variations on this, like Jimmy Nilsson's oft-quoted article:

http://www.informit.com/articles/article.asp?p=25862&rl=1

where he talks about the cost of GUIDs as primary keys and proposes a variation he calls a COMB. Worth reading.

If disk space usage is a big concern for you, the extra 8 or 12 bytes per row might concern you. I'm much less concerned about that or its overhead in processing than I used to be. I usually find bigger problems in the data model that will have bigger impacts than that :-)

If space doesn't concern you so much, another approach that I've seen used successfully to speed join performance when using GUIDs in foreign key references is to have a hash value (of the GUID) generated as a persisted calculated column and then joins are done using both the hash and the GUID. In most cases, the optimiser does most of the join work on just the hash value.

But then the queries look odd to me.