For those of you who read my blog using RSS, you probably won’t even see this post, but either way.

I’ve decided to move my blog to sqlblog.com. I have reposted some of my recent posts, but will not be posting any further content here at msmvps.com.

Both the feed from msmvps.com and sqlblog.com use feedburner.com, so the old feed will continue working, but I recommend changing to the new one in case my blog disappears from msmvps one day.

So farewell, msmvps.com (and thanks Susan!), it’s been fun. I’m sure I will still have many posts which link back here…

Posted by Rob Farley | with no comments

Lots of blog posts for this month, for the first T-SQL Tuesday to leave the safe-haven of Adam Machanic’s blog. Some people obviously missed out, probably because they don’t read this blog, but I guess that’s the nature of the meme. I don’t know who is hosting next month yet, but I’ll be looking out for Adam to post something about it in early March.

All the posts had to appear as trackbacks or comments on the invitation to participate, but this post is a short summary for posterity.

As the second week of February involves Valentine’s Day (and a few days earlier, my wedding anniversary), I thought the topic of Relationships would be a nice one for this event. There was a good range of topics too, which I have ordered by the type of relationships chosen.

Special mention:

It wasn't quite on the right day, but a big congratulations to Michelle Ufford who almost managed to time the birth of her new baby Chloe for T-SQL Tuesday.

We have “purely technical” posts:

Brad Schulz is keeping the bar way too high, and if you haven’t read his pieces from previous Tuesdays, then I recommend you go through the history of posts on his blog. This month he has written a letter of disappointment to the FROM clause. At some point Brad will likely be asked to compile these posts into a book, but until that happens, you’ll have to follow the link to his blog. It’s entertaining, but still fits in the “purely technical” category.

John Dunleavy demonstrated (complete with screenshots – something I should put more of in my posts) how foreign keys can be made so easily using the Diagram part of Management Studio. It’s not something I do much of, but I have to admit that reading posts like John’s can often inspire me to changing my ways.

Bryan Smith also talked about database diagrams, and how they can be used to discover relationships in a system.

Michael Coles officially missed the deadline, but I’m going to link to him anyway, demonstrating a nice trick for creating a Product aggregate based on the relationship between a number and its logarithm.

Allen White wrote about the fact that any RDBMS should have relationships to really be considered relational. Great reminder of some of the basics.

Marco Russo timed a piece on relating tables in DAX amazingly well, and only realised that it qualified for T-SQL Tuesday after he had initially posted it. Useful piece, which will appear in search engine results for years to come I’m sure.

Rob Farley wrote a pile of rubbish… hey, that’s me! I wrote about the importance of relationships in a database system to help the Query Optimizer do its job. I also surmised that Foreign Keys using candidate keys (rather than the primary key) might help many queries do away with needing joins. If you have thoughts on that I’d still like to hear them.

Some posts that were about less technical relationships:

Jen McCown wrote about how she left her husband Sean, in a piece called ‘I love you, I quit’. Actually, she was just leaving a job, but it’s a nice piece about the degradation of a employer-employee relationship. It rings too true for all of us I think, and I hope that as an employer I manage to ‘keep the mystery’ for my employees to stop them going through that same experience.

Jason Brimhall wrote about some of the different relationships in his life, particularly how he wants to make sure that the Parent-Child relationship in his life doesn’t become a Foreign relationship. Nicely done Jason.

Allen Kinsel recommends that relationships with professional organisations can be deepened, and that this can be very beneficial. I’m sure he’s right. I have a tendency to get extra-involved in groups, and I hope Allen’s sentiments are heard by many.

The link to Steve Jones’ article must’ve changed. The link that I followed today didn’t work, even though I’d read it successfully a couple of days ago. Still, I managed to find it, and I can thoroughly recommend reading about the relationships between Steve and his colleagues. He’s most definitely correct in pointing out that any effort in developing personal relationships with your colleagues will help you get things done!

And some “combination” posts:

Mike Walsh provided the first piece of the day, with an excellent run down of various types of relationships that are important, including a recommendation to read up on database design. Great advice, Mike. Thanks.

With Kalen Delaney’s post we almost got two for the price of one. A brilliant prelude talking about some of the personal relationships that have enhanced her SQL career, followed up by covering how foreign key relationships have developed through the versions of SQL Server. Kalen’s blog posts are always worth reading, as I’m sure everyone who is reading this post appreciates.

Stuart Ainsworth’s piece was on Maslow, drawing parallels between the hierarchy of needs of a person with the hierarchy of needs of a database system. It’s thought-provoking, and something that I feel could be made into a poster for database developers’ walls.

In summary

I’d like to thank everyone who has taken part, and for Adam for having introduced the T-SQL Tuesday concept to us. Keep your eye on his blog to find out what’s going on next month. If your name isn’t listed here, then I encourage you to write something for March.

Also remember that lots of these people are on Twitter and are very much followable. Look at the hashtag #tsql2sday for related posts, and make sure you follow the people who post blogs for these events.

Posted by Rob Farley | with no comments
Filed under: , ,

I’m feeling the pressure for this month’s T-SQL Tuesday, probably because I’m also the host. I’ll be posting a roll-up for it soon too, which I’m sure will be great fun researching.

Given that the topic is on relationships, and the main SQL Server database is a relational engine, relationships are incredibly relevant to databases. The idea behind RDBMSs is that keys are used to refer to entities. This leads to foreign keys, such that a particular column(s) is constrained to values which appear in another table, thus referential integrity is enforced. And yet there are so many database designs out there that do not have relationships defined between tables. I shudder when I find them, but that doesn’t make them any less commonplace.

In data warehouses, designed primarily for reporting systems rather than transactional systems, tables are often designed in a more denormalized manner, to avoid needing to perform so many joins to access the data required for reports. This involves having tables with many extra columns, containing data that would otherwise be stored in other tables. Fewer tables are used, and therefore the system has fewer relationships.

I sometimes wonder how this should affect relational database design. I have written before about the fact that the Query Optimizer can leverage foreign key relationships to be able to simplify queries by noticing that joins can be redundant and simplified out of plans, but to summarise:

A foreign key relationship is between a column (or set of columns) in a table to a unique key (typically the primary key) in another table (which could even be the same one). Because of this uniqueness, the relationship can map to at most one record on the other side. And because the foreign key relationship enforces referential integrity, it must map to exactly one record on the other side (a caveat being that the foreign key column(s) could be configured to allow NULLs, which won’t map). Therefore, a join that doesn’t actually select data from any of the columns in the second table might be able to be simplified out of the query completely, as if the query didn’t have the join at all. But read my other post for more on that.

Thinking about many of the queries that I’ve written over the years, I know that I often only want one field from the table I’m joining. For example, I might want to get a ProductName when I have a ProductID, or I might want the Login from a UserID. A standard Lookup situation, which in a data warehouse would often be handled by storing the Name in a dimension table rather than the ID.

So my surmising leads me to this question:

If there is a unique index on the field that I typically want to lookup from a table, does this make it a better candidate for foreign key relationships than the primary key, so that the system can avoid needing as many joins?

This screams against everything I ever learned about relational databases. It would obviously make for a larger row, but if this were offset by performance gains in querying, could it be worthwhile? Maintaining referential integrity based on a string field may be more costly than on an integer field, but I’m wondering if the impact on SELECT queries through the reduction of the number of joins required might not make it a worthy consideration.

Please note: I’m not saying it’s necessarily a good idea – I’m surmising here, and would love to hear comments about whether or not other people have tried it, what circumstances they were trying to handle, and whether or not the idea worked.

But back to the Query Optimizer…

The QO needs information to be able to work out how to run your query. It needs statistical information about the columns that are filtered; it needs to be able to figure out which parts of the query can utilise indexes effectively (see my recent posts about SARGability); but also very significantly, it needs to have information about the relationships between tables.

Any time you write a query that involves a join (which I imagine most of your queries do), the system can use information about the relationship between the two tables. If it is defined as a foreign key relationship that doesn’t allow NULL values, then the system knows that each record in the ‘child’ table must match exactly one record in the other table. The Query Optimizer can use this information, and it should be available. Any time you’re thinking of not putting enforced relationships in your system, consider the fact that you’re blinding the Query Optimizer and ultimately making it do more work.

Put your relationships into your database design. Put constraints where they can apply, mark items as unique. The Query Optimizer will thank you for it, expressing its thanks as better performance.

Posted by Rob Farley | 4 comment(s)
Filed under: , ,

It’s time for the third of Adam Machanic’s T-SQL Tuesdays, and this time, I’m the host. The first one, last December was on the topic of date/time, and the second was on Puzzling Situations. Check them both out, along with the round-ups that Adam wrote about them. Lots of great topics, which is starting to make me anticipate the content that comes out on the second Tuesday of each month.

As an early volunteer to host, I have been given the honour of being the first person chosen to host one. I’d like to claim that this implies some sort of special relationship between myself and the SQL community as a whole, but it’s actually just a “first-in, best dressed” policy – although the ‘best dressed’ analogy is lost on me.

Theme

Valentine’s Day is coming up. Hopefully I don’t need to tell you that it’s on February 14th, but if you’ve read this far into the post then perhaps you’re involved with databases for some reason and may need reminding. Shopping centres around the world have signs up reminding us to buy flowers for our loved ones, but I know many people in IT circles who don’t tend to go to such places, lurking in dark corners of houses until all hours of the night, surviving on pizza. Hopefully this theme will not only prompt some interesting posts, but also prompt people to go out and invest in the meaningful relationships in their own lives. Actually, if you don’t know that Valentine’s Day is February 14th, I’m guessing you don’t have anyone in your life worth buying for. ;)

For me, Valentine’s Day is only three days after my wedding anniversary, so I can’t forget either – as if I would.

So the theme for this month’s T-SQL Tuesday is Relationships.

There are a massive number of options you could go with for this theme. You could talk about Foreign Keys in the relational world. You could wax lyrical about the benefits of attribute relationships in cube design. You could write a poem for your loved one, apologising for all those hours spent in front of a Management Studio window, trying to tune a query, rather than tuning your guitar to serenade her.

Other ideas include: Relationships between Devs & DBAs, Clients & Vendors, Entities, data types, concepts (eg: Report Model & Cube), and more… if you’re struggling to think of something, drop me a line (twitter, Msgr, email, whatever – a list of contact options is over on the left) and I can help.

But so long as you can loosely tie your post to both the theme and some aspect of SQL Server, that’s fine. Be creative, informative, reflective, and hopefully relevant.

Rules

Please note that the time zone for this Tuesday is UTC. For me, that means between 10:30am Tuesday and 10:30am Wednesday. For you, it might mean some time on Monday afternoon to some time on Tuesday afternoon. If you’re lucky (read ‘English’), then you can publish your post any time on Tuesday. It’s about when it’s published though, not when you write it. I encourage you to write your post in advance, in case you’re busy on the day.

So the rules are:

  • Your post must go live after 00:00:00 UTC on Tuesday, February 9, 2010, but before 00:00:00 UTC on Wednesday, February 10, 2010
  • Your post must link back to this one, and it's recommended that you clearly identify the post as a T-SQL Tuesday post
  • You are responsible for ensuring that a trackback or comment appears here so that I can find the posts

Follow the rules, and your post will be included in the roundup to be posted on a day or two later. Don't follow the rules, and it won't show up there. Simple as that!

Twitter

Follow the event on Twitter by watching for the #TSQL2sDay hash tag. (The ‘2’ refers to the fact that it’s the second Tuesday of the month, and nothing to do with the pronunciation of the word, which is more like “Choose-day” where I come from.)

Additional Notes

Please make sure you put a link in your post to this one, and post a comment here if the trackback doesn’t appear. I normally moderate comments to my blog (to avoid spam), but may consider changing that policy for 24 hours next week. I have noticed people trying to take part, but failing to make sure that a trackback/comment has appeared. So please check this.

Please check the time zone. It’s something that we’re used to here in Australia, but I know a few people missed out last month not realising that the event had shifted to UTC.

Let Adam Machanic know if you want to host. You can contact him in a variety of ways, as he mentioned last time. To host, you must have participated in two previous T-SQL Tuesday events, and your blog must have had at least one post a month for the prior six months. Let him know you’re keen even if you don’t meet these criteria, as I believe there’s a list, and you might be able to get those blog posts sorted before your turn is up. When I volunteered to host, I hadn’t participated in any yet…

If you want to host but don’t have the faintest clue on what topic to use, ask Adam for some ideas. I’ve suggested a few to him, and I think he’s putting a list together to maintain the longevity of all this. Similarly, let him know if you have some ideas – you might suggest something that catches people’s imagination like never before.

Feel free to contact Adam or myself if you have any comments or ideas for all this. I’ll probably refer you to Adam if it’s a general thing, or happily do my best to answer you if it’s about this month’s event in particular.

Have lots of fun! I look forward to reading your posts!

Edited: The round-up is at http://msmvps.com/blogs/robfarley/archive/2010/02/13/t-sql-tuesday-003-relationships-the-round-up.aspx

Posted by Rob Farley | 18 comment(s)
Filed under: , ,

A recent discussion on Twitter about a query that Denny Cherry was looking at led to this post by Josef Richberg: http://josef-richberg.squarespace.com/journal/2010/1/28/is-a-case-statement-considered-sargable.html, and I thought it might be worth going through a few points on the topic of SARGability. Particularly given that I wrote a related post recently myself.

If something is SARGable, it means it can be used as a Search Argument – that is, it can be used effectively in conjunction with an index.

To repeat a common analogy, trying to find “Farley” in the phonebook is a Seek (I can pretty much go straight there), whereas trying to find names which end in “arley”, is not. Those names could be just about anywhere.

The way that an Index Seek works is that the system can go to a record and tell whether the row(s) it's looking for is to the left of the current value or to the right. Looking for "F" is easy. I open the book at "M" and know that I must go left. I open the book at C and know that I must go right. It's SARGable.

As Josef writes, applying a function to a field stops it being SARGable. If we wanted to say WHERE RIGHT(Surname,5) = 'arley'  we would quickly see a performance problem, as finding these records would mean starting the Aardvarks (in case there was an ‘Aardvarkarley’), and end with the ZZzzzs (in case there was a ‘ZZzzarley’). On the other hand, looking for WHERE Surname LIKE 'Farl%' is quick, because we see that evaluating surnames using a wildcard that doesn’t appear at the start doesn’t affect the SARGability.

At first glance, an ON clause such as:

ON g.GroupId = CASE
             WHEN c.GroupId <> l.GroupID
               AND l.GroupID <> @AccGroupID
             THEN l.GroupID
             ELSE c.GroupID
             END

…would seem to be nasty for SARGability. We’re not just looking for a row where GroupId = AParticularyGroupID, we’re looking for something that is the result of a CASE function, which is going to change for every combination of rows in the tables ‘c’ and ‘l’.

But actually, it’s a lot more complicated than that, and yet in some ways, a lot more simple.

Let’s think about the term SARGable again. It means that we’re able to use it as a Search Argument – but what is the ‘it’ we’re looking at? What is the thing that is SARGable? (This next bit is the key – so I’ll write it again in a moment, in bold) The thing that is SARGable is the term that is indexed – in other words, g.GroupId. It doesn’t matter how much work is required to work out that value – it’s whether or not you can find it in the index.

SARGability is about the thing that is (hopefully) indexed in the table (or set) that you’re introducing into the query.

In the phonebook, RIGHT(Surname,5) isn’t in the index. No index on Surname is going to help match RIGHT(Surname,5) (unless you want to index the result of that function). But if you wanted to say WHERE Surname = REVERSE('yelraF') then there’s no problem. Despite how nasty REVERSE is, it doesn’t affect what we’re looking up in the book. We can evaluate REVERSE, and then (using the index) find the Surname that matches that. Surname = something is SARGable.

So the CASE clause above is largely irrelevant to finding something by GroupId. The result of the CASE can be easily worked out, and then a seek done on an index on the ‘g’ table.

That was the simple aspect of it.

Consider that you have a diary, and you want to mark a week before some birthdays, so that you can go shopping. Suppose you have Jul 11, Feb 6, Dec 10. You take one of the birthdays, work out the date that is a week before it, then do an Index Seek to find that date in the diary. Then you repeat it twice more. This query might be along the lines of:

FROM people p JOIN diary d ON d.diarydate = dateadd(week, -1, p.birthday)

You'll notice that we happily did an Index Seek on the diary. The dateadd function on p.birthday had no effect on us. The SARGability applies on whichever table/set we're thinking about at the time, which in this scenario is the diary.

But consider that we wrote the query like this, which essentially means the same, looking for birthdays and diary dates that are a week apart, but applying the function to d.diarydate instead:

FROM people p JOIN diary d ON p.birthday = dateadd(week, 1, d.diarydate)

Logically, this means the same. But this is much more suited to finding the birthday once we know the diarydate, rather than the other way around. This might be fine if we had a thousands of people to consider, and we wanted to look up a mere 365 diary dates in our list of people (indexed by birthday)... but we'd still be looking 365 dates. The function still kills the SARGability of the column, it's a question of which column we want to be considered indexable.

Repeating the matching process over and over like this is known as a Nested Loop. As quick as it might be to find the record because of SARGability, we'd still be doing it 365 times. Things would be much better if we could index both sides in this scenario. Then, we might be able to utilise a different kind of join, such as a Merge Join, which involves running through two ordered sets, comparing the values to find matches in a single pass. But for this to apply, both sides must be SARGable. Consider two tables with numbers in them... one with 1, 1, 5, 10, 10, 200, and one with 3, 4, 5, 6, 7, 8, 8, 10. To do a Merge Join between them, the system will see that the first table starts with 1, while the second one starts with 3. It will skip through the 1s in the first table to the 5, and then start skipping through the second table to find matches. It's very fast, but it needs to be able to know whether the value it's looking for is to the left or right of where it's up to. This would involve an Index Scan on both sides, but it would be quicker than doing lots of Index Seeks. A single seek is faster than a single scan, but something quick done over and over can take longer than using a slow method one time.

Unfortunately, the Query Optimizer currently isn't smart enough to know that dateadd(week, 1, d.diarydate) doesn't change the order of dates in the diary, and that it can easily tell whether it should go left or right. That's the topic of my earlier blog post, and a Connect item to ask that Microsoft fix this.

So should you care about SARGability? After all, if the system is doing an Index Seek anyway, you've got SARGability in play, and it shouldn't matter.

You should try to arm the Query Optimizer with as many options as possible, so that it can use the best plan available, based on the statistics. We might be happy with looking up three diary dates in a Nested Loop, but we wouldn't be happy doing that thousands of times. Think about how YOU would solve the problem without a computer (like my diary analogy, or the phonebook). If your execution plan is similar to your paper-based solution, then you've done well. If it's not, then maybe you should look into it some more and work out if the query is okay or not.

In Josef's situation, the Index Seek was being done 1155 times. That might be okay, but it also might not be. But should the query be rewritten? That's a different question, that I couldn't really answer without knowing more. The most important thing about a query is not its performance, but its correctness. If you need to kill the SARGability to maintain correctness, then so be it. You can always handle it with an indexed computed column if you need to, or even an indexed view.

Posted by Rob Farley | 3 comment(s)
Filed under: ,

SARGable is an adjective in SQL that means that an item can be found using an index (assuming one exists). Understanding SARGability can really impact your ability to have well-performing queries. Incidentally – SARGable is short for Search ARGument Able.

If you have an index on phone numbers using LastName, followed by FirstName, including the suburb and address fields, you have something akin to the phone book. Obviously it becomes very easy to find people with the surname “Farley”, with the first name “Rob”, but often you want to search for people with the surname “Farley” with the first name beginning in ‘R’. I might be listed as “R Farley”, “R J Farley”, “Rob Farley”, “Robert Farley”, “Robert J. Farley”, or a few other variations. It complicates things even more if you need to find someone with a name that shortens a different way, like John/Jack, or Elizabeth/Betty. This is where SARGability comes into play.

Let’s just think about the First names for a minute.

If you want to find all the names that start with R, that’s easy. They’re all together and you can get to them very quickly. This is comparable to a query in SQL Server like this, (taking advantage of the index on the Name column in Production.Product)

select Name, ProductID
from Production.Product
where Name like 'R%' ;

Looking in the Execution Plan, we see an Index Seek to find the 52 rows, and the seek has a Seek Predicate like this (by looking in either the ToolTip of the operator, the Properties window, or the XML itself):

Seek Keys[1]: Start: [AdventureWorks].[Production].[Product].Name >= Scalar Operator(N'R'), End: [AdventureWorks].[Production].[Product].Name < Scalar Operator(N'S')

This shows that the system looks as the LIKE call, and translates it into a greater-than and less-than query. (Interestingly, have a look at the End Seek Key if you tell it to find entries that start with Z)

So the LIKE operator seems to maintain SARGability.

If we want to consider Names that have R for the first letter, this is essentially the same question. Query-wise, it’s:

select Name, ProductID
from Production.Product
where LEFT(Name,1) = 'R' ;

Unfortunately the LEFT function kills the SARGability. The Execution Plan for this query shows an Index Scan (starting on page one and going to the end), with the Predicate (not, not Seek Predicate, just Predicate) “substring([AdventureWorks].[Production].[Product].[Name],(1),(1))=N'R'”. This is bad.

You see, a Predicate is checked for every row, whereas a Seek Predicate is used to seek through the index to find the rows of interest. If an Index Seek operator has both a Predicate and a Seek Predicate, then the Predicate is acting as an additional filter on the rows that the Seek (using the Seek Predicate) has returned. You can see this by using LIKE 'R%r'

Considering the first part of a string doesn’t change the order. SQL knows this because of the way it handles LIKE (if the left of the string is known), but it doesn’t seem to get this if LEFT is used. It also doesn’t get it if you manipulate a field in other ways that we understand don’t affect the order.

select ProductID
from Production.Product
where ProductID + 1 = 901;

This is doing a scan, checking every row, even though we can easily understand what we mean. The same would apply for this query (assuming there’s an index on OrderDate):

select OrderDate
from Sales.SalesOrderHeader
where dateadd(day,1,OrderDate) = '20040101'
;

And perhaps most significantly:

select OrderDate
from Sales.SalesOrderHeader
where dateadd(day,datediff(day,0,OrderDate),0) = '20040101'
;

…which is largely recognised as being an effective method for date truncation (and why you should always compare dates using >= and < instead)

But more interestingly…

…this query is just fine. Perfectly SARGable.

select OrderDate
from Sales.SalesOrderHeader
where cast(OrderDate as date) = '20040101'
;

This query does a little work to figure out a couple constants (presumably one of them being the date 20040101, and another being 20040102), and then does an Index Seek to get the data.

You see, the date and datetime fields are known to have a special relationship. The date type is essentially the left-most three bytes of a datetime type, and therefore the ordering is identical.

It doesn’t work if you want to do something like:

select OrderDate
from Sales.SalesOrderHeader
where convert(char(8), OrderDate, 112) = '20040101'
;

…but did you really think it would? There’s no relationship between strings and dates.

I wish it did though. I wish the SQL team would go through every function and think about how they work. I understand that CONVERT will often change the order, but convert using style 112 won’t.

Also, putting a constant string on the end of a constant-length string shouldn’t change the order. So really, this should be able to work:

select OrderDate
from Sales.SalesOrderHeader
where convert(char(6), OrderDate, 112) + '01' = '20040101'
;

But it doesn’t.

Interestingly (and a prompt for this post), the hierarchyid type isn’t too bad. It understands that some functions, such as getting the Ancestor won’t change the order, and it keeps it SARGable. Here the asker had noticed that GetAncestor and IsDescendantOf are functions that don’t kill the SARGability – basically because the left-most bits of a hierarchyid are the parent nodes.

http://stackoverflow.com/questions/2042826/how-does-an-index-work-on-a-sql-user-defined-type-udt

Spatial types can show similar behaviour.

So I get the feeling that one day we might see the SQL Server team implement some changes with the optimizer, so that it can handle a lot more functions in a SARGable way. Imagine how much code would run so much better if order-preserving functions were more widely recognised. Suddenly, large amounts of code that wasn’t written with SARGability in mind would start running quicker, and we’d all be hailing the new version of SQL Server.

I’ve raised a Connect item about this, at https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=526431

You may have code that would run thousands of times faster with this change. That code may live in third party applications over which you have no control at all. If you think there’s a chance you fall into that bracket, why not go and vote this up?

Posted by Rob Farley | 6 comment(s)

Earlier today you may have seen a blog post of mine about a puzzle involving HAVING. You should read that post before this one. It was part of Adam Machanic’s T-SQL Tuesday meme.

The question was about the query:

SELECT 'No Rows'
WHERE 1=2
HAVING 1=1;

And here’s the explanation.

Start by making yourself a "dual table", like what you'd use in Oracle, and use this instead of having no FROM clause. Put a row in it.

CREATE TABLE dual (dummy bit);
INSERT dual VALUES (1);

--Now count the rows in it
SELECT COUNT(*)
FROM dual;

--Now count how many rows don't match 1=2 (of course, the answer is zero)
SELECT COUNT(*)
FROM dual
WHERE 1=2;

--Naturally we'd get nothing back if we weren't grouping
SELECT 'Something'
FROM dual
WHERE 1=2;

--But HAVING forces the grouping functionality as well (like using COUNT(*))
SELECT 'Something'
FROM dual
WHERE 1=2
HAVING 1=1;

--So in this query, we couldn't put any of our real columns in, only aggregate functions and constants
SELECT *
--Errors
FROM dual
WHERE 1=2
HAVING 1=1;

--And leaving out the FROM clause implies that we're asking all this of a secret internal table with a single row. All these queries work just the same without the FROM clause at all.

--Count the rows in our pretend table (one)
SELECT COUNT(*)

--Now count how many rows don't match 1=2 (zero)
SELECT COUNT(*)
WHERE 1=2;

--Naturally we'd get nothing back if we weren't grouping
SELECT 'Something'
WHERE 1=2;

--But HAVING forces the grouping functionality as well
SELECT 'Something'
WHERE 1=2
HAVING 1=1;

So the answer to the question posed is that you get a single row, containing the text provided. The fact that I used the text 'No Rows' was just a bit of fun.

Now, to remove the trivia a little…

When would you ever use HAVING without GROUP BY in a practical situation?

How about this:

Using sp_MSforeachdb, find the number of objects in non-system databases. It’s an undocumented system stored procedure which runs a query on each database, replacing a question mark in the query with the name of the database. It can be quite handy, just don’t look at how it’s implemented.

EXEC sp_MSforeachdb 'SELECT ''?'', COUNT(*) FROM ?.sys.objects WHERE ''?'' NOT IN (''master'',''tempdb'',''model'',''msdb'');';

But this won't do it. It will still return the entries for the system databases, but with zeroes (because none of the objects satisfied the WHERE clause). Replace WHERE with HAVING and it's just fine – the rows get eliminated from the resultset.

EXEC sp_MSforeachdb 'SELECT ''?'', COUNT(*) FROM ?.sys.objects HAVING ''?'' NOT IN (''master'',''tempdb'',''model'',''msdb'');';

Honestly, HAVING doesn't require a GROUP BY clause. It doesn't require anything. It filters based on groups, and if there are no groups yet, it makes some – like how using an aggregate will count the rows in an empty set and return one row representing that group.

It's generally taught as "HAVING is for filtering based on aggregates", and that's true, but only half the story. And I find that if I'm teaching people to write better queries, I want them to have a thorough understanding of what each construct is really doing.

Posted by Rob Farley | 4 comment(s)
Filed under: , ,

Adam’s hosting another T-SQL Tuesday, for which this post is jumping in. He’s themed it around T-SQL Puzzles, which I found quite interesting, because the world is full of them.

Most of the questions that I answer on forums, help sites, and so on, are puzzles. I guess there’s the difference between “Problem” and “Puzzle”, but I prefer to think of thing as puzzles.

For Adam’s meme though, I thought I’d share a Puzzle that I ask students who take my Advanced T-SQL course. The idea is to have them start thinking about what each component of T-SQL is actually doing, so that they can better address problems they face. If you have a rifle, it’s nice to actually know what the various components of it are for, so that you can use it more effectively.

I actually ask them a large number of things, but the one that I thought I’d pose for you all today is about the results of this. The answer will be in the next blog post, which hopefully you haven’t read yet. I will have them both published on Tuesday 12th, this one at the start of the day, and the answer towards the end of the day.

The question is simply a query. Can you predict the output, and explain why? Feel free to comment to your heart’s content, as I will moderate them and only publish them afterwards. In fact, I’ll probably take a few days to get to them (being holiday period), so I apologise if you’re wanting to read what other people thought too.

Naturally, you can check your answer by actually running the query, but please provide your thoughts before you do. The query is below. There is no FROM clause. There is no GROUP BY clause. Does it error, do you get an empty resultset, do you get a single row containing NULL, do you get a single row with data, do you get multiple rows, or something else I haven’t suggested? Enjoy.

SELECT 'No rows'
WHERE 1=2
HAVING 1=1;

Posted by Rob Farley | 4 comment(s)
Filed under: , ,

Foreign Keys are great. For those of you who read some of my other posts, or heard me present on this material before, a foreign key can really help the Query Optimizer simplify out your query, because it knows that it must be able to find a match in an equality join (unless the field allows NULLs, which is a whole nother story).

I also blogged recently about the fact that a foreign key doesn’t have to reference the primary key of a table. Turns out it can reference any candidate key, that is, one that has a unique index on it. This presents all kinds of interesting design ideas, such as avoiding joins by storing a different field (such as the username, rather than userid, if the username is the one thing you keep looking up to display at the top of the webpage). Referential integrity can still be maintained happily using an index seek (just a slightly larger index, but quick nonetheless), but the benefit could be huge if many queries no longer need to do join to get that info.

But I found an interesting question on Stack Overflow, which was asking about whether a foreign key could reference a view. He was trying to do it for the purpose of data integrity – which would probably require an expensive trigger. But let’s talk about the Query Optimizer, which I think is another useful reason to have a foreign key hook into a view.

As a view is only a stored sub-query, it’s a strange request, but one that is definitely interesting. A view can be instantiated as an indexed view, so I thought there was definitely potential in his question.

Unfortunately I had no luck trying to help him.

My first thought was “Well, you’re going to need a unique index on the view – well that’s doable – a Clustered Index on an Indexed View.”, but you get an error saying that the foreign key needs to reference a user table. A user table? So it can’t even reference a system table. But I guess that’s fair enough these days, since we don’t really have system tables any more (they’re all system views, referencing underlying functions and the like).

My next thought was whether or not you could recreate something similar using a filtered index. After all, a filtered index satisfies many of the scenarios that indexed views often address. They can be unique for example, so you might use an indexed view to make sure that once a particular value is reached in a Status field, some other field must be unique. In SQL Server 2008 it’s no problem to create a unique index and provide a WHERE clause for it. But unfortunately when you create a foreign key, you indicate the table you’re referencing, not the index. I tried using a check constraint, to make sure that the foreign column could only contain values that were in the unique filtered index, but SQL just doesn’t consider that a filtered index is available for referencing by a foreign key.

If you know a way to help this guy, follow the link and provide an answer. In a couple of days’ time, the answer with the most votes will be automatically marked as the answer by the Stack Overflow system, which won’t help him much. He understands that using views doesn’t negate the FK constraints on the underlying tables, and I hope he can still work out a way of doing it without triggers, but at the end of the day, this other answer of mine, where I eventually say “Use triggers on the tables which contribute to your view” is probably the unfortunate truth.

Posted by Rob Farley | 6 comment(s)
Filed under: , ,

Far too often I see SQL code which uses apostrophes around column aliases, like this:

SELECT ProductID, SUM(LineTotal) AS 'Total'
FROM Sales.SalesOrderDetail
GROUP BY ProductID;

This is fine, but the worrying thing about this is if the user decides to use this alias in an outer query.

SELECT 'ID', 'Total'
FROM
(
SELECT ProductID AS 'ID', SUM(LineTotal) AS 'Total'
FROM Sales.SalesOrderDetail
GROUP BY ProductID
) AS p
;

Here, the outer query will assume that ‘ID’ and ‘Total’ are strings, not the names of columns in sub-query. It’s really not pretty. The query runs, but doesn’t give the correct answers. Furthermore, if this had’ve been done in a GROUP BY clause, or a JOIN, etc, then the error may have been hidden some more. An error might have occurred, but only in certain circumstances.

What should have been done is to have used square brackets, like [Total], or even no brackets at all. Using the table alias in the outer query would have helped too.

SELECT p.ID, p.Total
FROM
(
SELECT ProductID AS ID, SUM(LineTotal) AS Total
FROM Sales.SalesOrderDetail
GROUP BY ProductID
) AS p
;

I’m not sure why SQL Server allows apostrophes to be used around column aliases, but unfortunately it does (and because it does now, it likely always will, if only to maintain backward-compatibility). So instead, any time you see code that uses apostrophes this way, please change it – just to help any developers that come after you who don’t understand where things can fall down.

Posted by Rob Farley | 1 comment(s)
Filed under:

There are so many places to ask a question these days. I get plenty of questions via MSN Msgr and email, and do my best to answer those of course. But there are many others too. I figured I’d list some of the ones that I frequent, and challenge some of the readers here to check some of them out.

The MSDN Forums are terrific. Lots of really good people hang out there, including many Microsoft staff. They’re effectively the new version of the public newsgroups. It’s definitely worth asking (and answering) questions here, and I should probably choose this option more for answering questions myself.

Experts-Exchange is a much-maligned site, largely because to ask questions you need to have points. You can get points through a paid subscription, but you can also get points by answering questions. If you answer just a few questions each month, you can become a recognised Expert on the site, which lets you ask as many questions as you like, and also gives you the option of a “ad-free” environment. Many people still joke about the way that the site reads if you ignore the hyphen, but if you are an expert, this site is definitely worth hanging out on. You can register for free (getting you no points to ask questions until you’ve started answering them) at http://www.experts-exchange.com/registerFree2.jsp, so why not go there and register, so that you can start answering questions. They have a facility so that Designated Experts can get emails for neglected questions, giving you a much better chance of an answer than many other sites around. (Note – if you are a SQL MVP, or a MS employee, and you want to be fast-tracked into receiving the Neglected Questions notices, drop me a line and I’ll see what I can do for you)

Stack Overflow is a current favourite amongst many, because of the number of people that seem to frequent the site. It’s clean (very few adverts hanging around), and people seem to rush to answer questions as soon as possible. From a purely SQL perspective, I find that there is too much weighting on the iterative languages there, so many of the SQL responses seem to be provided by people who aren’t really SQL specialists. But it doesn’t mean that you won’t pick up some good tips there. I got started there by answering a question that has even ended up in the source for the site – which I’m still hoping will reach the magical “100 up-votes”, and I’ve continued to keep my eye out for questions there that need answering.

Server Fault is the system administrator cousin to Stack Overflow. If you have DBA-style questions rather than developer-style, then this site is very useful.

Using the same interface as Stack Overflow and Server Fault, but purely focussed on SQL Server is Ask SCC, run by the guys from sqlservercentral.com. It’s a new players on the scene, but I think will turn into a very useful site. The Stack Overflow engine isn’t bad at all, and the quality of answer at Ask SCC is excellent. I would love to see more people hang out there, as it serves a useful market for SQL specialists. At the moment it doesn’t do much traffic, but many of the people there are good SQL experts, and I’m convinced that you’ll get an excellent answer if you ask a question there. At the moment it doesn’t seem to be collecting poor answers as much as many of the other sites, so the ratio of good answers to poor ones puts you in a good position as an asker. I’ve posted my Ask SCC and Stack Overflow ‘flairs’ here, so that you can compare the two. If the numbers on the Ask SCC one have reached as high as the Stack Overflow one, then you’ll have a good indication that the traffic on Ask SCC has increased nicely.

 

In many ways, I tend to find that my efforts are focussed more on the questions that aren’t getting answered, rather than trying to catch the newest questions. On many of these sites, I’d rather find the one that the asker has had trouble with, hoping to provide the elusive answer rather than the obvious one. That question that got me started on Stack Overflow was an exception because I didn’t feel like any of the previous answers had really solved the question properly, but on the whole, my approach to Stack Overflow doesn’t really fit with most of the answerers on the site. I like EE because there really seems to be a focus on getting those elusive answers for people, and I know that Microsoft really focuses on getting answered questions sorted on their forums.

My challenge to you is to give back to the community this Christmas. Make it a resolution for 2010 if you will. Why not try to answer a question every week? And better still, make it one that everyone else has had trouble answering. Go to the lists of unanswered questions, and help someone out. Next time it might be you asking, and you’ll hope that someone takes the time to find your elusive question.

Plus, you might learn something!

Posted by Rob Farley | 2 comment(s)
Filed under: ,

Grant Paisley of Angry Koala and Report Surfer put me onto this a while back, and I have to admit I’m a bit of a fan. The idea comes from the fact the way that SQL Server Reporting Services (both 2005 and 2008) handles parameters with Analysis Services, and lets you make a report that drills through into itself, deeper and deeper into a hierarchy. Today I did a talk at the Adelaide SQL Server User Group, and mentioned this was possible (but didn’t have the time to demonstrate it properly).

If you make a parameterized query in an MDX query in SSRS, you use the STRTOMEMBER or STRTOSET function to handle this. But the MDX has no other indication of what dimension, hierarchy or level is being passed in. If you grab the children of whatever you’ve passed in, you can easily put this on the Rows axis and get one level down. Passing the UniqueName of whatever you’ve just provided back in as the next parameter, and you have infinite drill-through.

Look at the following MDX query:

WITH
MEMBER [Measures].[NextLevel] as StrToMember(@SomeDate).Hierarchy.CurrentMember.UniqueName
MEMBER [Measures].[NextLevel Name] as StrToMember(@SomeDate).Hierarchy.CurrentMember.Member_Name

SELECT
NON EMPTY { [Measures].[Internet Sales Amount], [Measures.NextLevel], [Measures].[NextLevel Name] } ON COLUMNS,
NON EMPTY { (StrToMember(@SomeDate).Children ) } ON ROWS

FROM [Adventure Works]

You see that I provide the UniqueName and Member_Name properties in known columns, so that I can easily reference them in my report. You’ll also notice that nowhere do I actually indicate which dimension I’m planning to drill down on, or to which hierarchy the @SomeDate parameter refers. I have suggested it’s a date, but only in name. At this point I also make sure that the Report Parameter is not restricted to values from a particular query, and I hide it from the user. I’m going to be passing in UniqueName values, which aren’t particular user-friendly.

If I start with [Date].[Fiscal].[Fiscal Year].&[2003], then my NextLevels will be [Date].[Fiscal].[Fiscal Semester].&[2003]&[1] and [Date].[Fiscal].[Fiscal Semester].&[2003]&[2]. This then continues down as far as I want it to go. I could always put a condition on my Action to pick up when there are no more levels, and potentially start down a different hierarchy. After all, I can always use a bunch of other parameters in the WHERE clause to slice the cube in other ways first, for placeholders. It really just comes down to MDX creativity to investigate different ways of drilling through the data.

Please bear in mind that other people may well have achieved the same sort of thing using a different query – I’m just posting what has worked for me. Hopefully by doing this, you can avoid making five drill-through reports just because your hierarchy has five levels. This might just remove 80% of your reporting effort!

Posted by Rob Farley | with no comments

To persuade your boss to embrace an upgrade to SQL 2008, you need to know which features have high Return On Investment. They may have seen presentations talking about features like Spatial, or MERGE (and been quite impressed), but they may well have left those presentations thinking about the effort that’s would be involved in rewriting applications to take advantage of these features. It’s all well and good to see your customers on a map, but someone has to make that spatial data appear somewhere.

This post is a callout for features that will benefit you (and your boss) as soon as you do the upgrade (or soon after). And I welcome comments to list other items as well.

  • Block Computation (in SSAS – which reduces the effort in processing significantly, for no change in the application )
  • Transparent Data Encryption (in the Database Engine – which makes sure that data at rest is encrypted, with no change in the application)
  • Backup Compression (which reduces the size of backups, and can be set as the default so that existing backup scripts don’t need to change)
  • Data Compression (minimal change to turn on compression on tables which will compress nicely)
  • Filtered Indexes (because how far off is your next index creation, really?)
  • Auditing & Change Data Tracking (because it’s very easy to turn on and then review the data as you need it)
  • Export to Word in SSRS (because everyone’s wanted this for so long)
  • SSRS paging (because SSRS used to get _all_ the data for a report before rendering it – but not in 2008)
  • Resource Governor (easy to set up, nice to have in place for when you might want it)
  • Hot-add memory (so that you can just plug in more memory without having to do restarts)

I’m not suggesting that an upgrade should be done flippantly. You should still consider the effort of thoroughly testing your system under SQL 2008. But hopefully this list can highlight some of the things that I’ve found are good persuaders. A list of “What’s New in SQL 2008” can be found at http://www.microsoft.com/sqlserver/2008/en/us/whats-new.aspx

Like I said, you may have other items on your own list, and I invite you to comment on this. You may also have things in place to handle things like encryption, and you may be running Hyperbac or one of the other compression tools.

Quite a few people have asked me to blog about what I do for a date dimension table. I’m talking about a table that Analysis Services references for a Time dimension. It’s going to contain every date in a particular range, and be flexible enough to cater for public holidays and other custom details.

There are plenty of options for this, and I’ll mention some of them a bit later. What I use most of the time is an actual table in the Data Warehouse, which I populate with a row for each date in the range I want to consider. This range starts well before the earliest date I could want, and I don’t leave gaps either. Some people like to only use dates that have fact data, but I prefer to have the dates going back as far as I like.

Let’s talk about what the table looks like, and then how it can be created.

I have a primary key on an integer based on the date, in the format YYYYMMDD. So today would have the number 20091208. I haven’t tried using the date type that’s available in SQL Server 2008 for a date dimension yet – I generally try to use numbers for dimension keys, and haven’t tested the alternative yet. Using an integer like this for the key in a date dimension is generally considered best practice.

I also have a column which is the actual date itself. I will use this as the Value column for the dimension key in Analysis Services. I also have various representations of the date in string form, such as “Tuesday December 8th, 2009”, “08/12/2009”, “8-Dec-2009”. One of these will be the Name column, but I may have others available for other properties and translations. A “12/08/2009” option may be preferable for a US translation, for example.

Columns in my table should indicate which year it is, such as 2009. I’ll also throw in the start of the year (in a date format), and something which indicates which Financial Year it is. In Australia, this is most easily handled by adding six months onto the current date and considering the year of this adjusted date (our FY starts on July 1st). I can subtract the six months back again to work out what the start of the Financial Year is. I try to keep things in the code quite simple, as I leave this code with the client and hope they can maintain it as required. The trickiest I get is to use the DATEADD(month,DATEDIFF(month,0,ActualDate),0) technique for truncation, but I think this should be required knowledge when handling dates.

For months, quarters, semesters, weeks, and so on, I will also prefer to have an integer as the key. A Month Key would take the format 200912 for this month, or 201001 for next month. Quarters can be done using 20094 and 20101, and so on.

This may all seem quite complex, but it’s something you only need to do one time.

Let me explain…

My table only really contains one field. Yes, just one. More might be required for custom fields, but where possible, I will just populate one field and let all the rest be handled using computed columns.

Even the primary key will be a computed column.

CREATE TABLE dbo.Dates (
  DateKey AS CONVERT(int, CONVERT(char(8), ActualDate, 112)) PERSISTED NOT NULL
,ActualDate DATETIME NOT NULL
,CalendarYearKey AS YEAR(ActualDate) PERSISTED NOT NULL
,CalendarYearName AS CONVERT(char(4), YEAR(ActualDate)) PERSISTED NOT NULL
,CalendarYearStart AS DATEADD(year,DATEDIFF(year,0,ActualDate),0) PERSISTED NOT NULL
,FinancialYearKey AS YEAR(DATEADD(month,6,ActualDate)) PERSISTED NOT NULL
,FinancialYearName AS CONVERT(char(4),YEAR(DATEADD(month,6,ActualDate))-1) + '/' + RIGHT(CONVERT(char(4),YEAR(DATEADD(month,6,ActualDate))),2) PERSISTED NOT NULL
,FinancialYearStart AS DATEADD(month,-6,DATEADD(year,DATEDIFF(year,0,DATEADD(month,6,ActualDate)),0)) PERSISTED NOT NULL
,MonthKey AS CONVERT(int, CONVERT(char(6),ActualDate,112)) PERSISTED NOT NULL
,MonthName AS CASE MONTH(ActualDate)
                 WHEN 1 then 'Jan'
                 WHEN 2 then 'Feb'
                 WHEN 3 then 'Mar'
                 WHEN 4 then 'Apr'
                 WHEN 5 then 'May'
                 WHEN 6 then 'Jun'
                 WHEN 7 then 'Jul'
                 WHEN 8 then 'Aug'
                 WHEN 9 then 'Sep'
                 WHEN 10 then 'Oct'
                 WHEN 11 then 'Nov'
                 WHEN 12 then 'Dec'
               END
                  + ' ' + CONVERT(char(4), YEAR(ActualDate)) PERSISTED NOT NULL
,FrenchMonthName AS CASE MONTH(ActualDate)
                 WHEN 1 THEN 'janv'
                 WHEN 2 THEN 'févr'
                 WHEN 3 THEN 'mars'
                 WHEN 4 THEN 'avr'
                 WHEN 5 THEN 'mai'
                 WHEN 6 THEN 'juin'
                 WHEN 7 THEN 'juil'
                 WHEN 8 THEN 'août'
                 WHEN 9 THEN 'sept'
                 WHEN 10 THEN 'oct'
                 WHEN 11 THEN 'nov'
                 WHEN 12 THEN 'déc'
               END
                  + ' ' + CONVERT(char(4), YEAR(ActualDate)) PERSISTED NOT NULL
--Many more columns following
);
GO

You will notice that I have used ugly long CASE statements for the MonthName columns. I do the same for the names of the days of the week. The reason is betrayed in the second example. DATENAME (or any kind of conversion that relies upon the language setting, such as CONVERT(char(3),ActualDate,100)) is non-deterministic, and therefore can’t be used in a persisted computed column (I do wish that CONVERT could take a Language setting, so that I could tell it to convert in English, French, etc, and make it deterministic). Why do I want them to persist? Well… I’m just more comfortable with them being persisted. After all, I could use a view for the whole thing at this stage, but I’m really not that comfortable with the table being generated on the fly when it comes to processing. The table is essentially read-only, after all.

As well as many computed columns like this, I will also have some that are not computed, such as a column to indicate if it’s a public holiday. This could be computed, at a push, as public holidays generally follow a system. Even Easter follows a formula that could be applied. But if the company takes a special day, or if government declares an extra day for some reason, then problems can start popping up. I find it convenient to have columns that can be updated directly (but which have defaults, of course).

One great thing about this method is that it can be populated very easily. The only field you insert data into is the ActualDate column. Generating a list of dates is as easy as using DATEADD() with a nums table, as I’ve written many times before, including this StackOverflow question. If you need more dates, just insert more.

As I mentioned before, a view could be used for this. It is very easy to generate a list of dates, and then all the other calculations could be done as other columns in the view. You could perform an OUTER JOIN into a table which lists public holidays and other special days. Analysis Services will happily handle this in much the same way. I just prefer to have it exist as a table, which I feel I have more control over.

This post has been part of T-SQL Tuesday, hosted this month by Adam Machanic. You should be able to see many other posts related to datetime mentioned as Trackbacks to Adam’s post.

Posted by Rob Farley | 2 comment(s)
Filed under: , ,

I recently needed to randomise some data to keep some information secret. The idea was that it looked roughly similar to the real data, but was sufficiently different to avoid any identifying features.

After discussing it with the client, it was agreed that…

1/ ID numbers would be mixed around the people in the list. Therefore, they were all real numbers (that therefore matched the rules governing what made up a legitimate number), but they would be reordered at random amongst the people.

2/ Dates would be set randomly between the minimum and maximum dates available.

3/ Strings would become a series of random letters, but the same length as the original.

4/ Genders would be assigned a random value of M, F or N (Not Specified).

5/ Numeric fields (such as salaries) would be multiplied by somewhere between 0.1 and 10, with 1 being the median value used.

Here’s how I did it.

1/ I used row_number of this, twice. I used one ordered by the original ID field, and one ordered by newid() (which is a good-enough random order). I could then perform a self-join, and do the update.

with twonums as
(
select row_number() over (order by id) as orig_rownum,
      row_number() over (order by newid()) as new_rownum,
      *
from dbo.People
)
update t1 set id = t2.id
from twonums t1
      join
      twonums t2
      on t1.orig_rownum = t2.new_rownum
;

This mechanism takes advantage of the fact that you can update a CTE. The fact that row_number() assigns each number exactly once means that I update every row, and no row gets updated twice.

2/ To generate a random positive value less than some number N, I use abs(checksum(newid())) % N. Apparently this gives a good distribution of values. If N is the number of days between two dates (plus one, in case the two dates are identical), then the result can be added back onto the first date to get a random date between the two.

update dbo.theTable
set theDate = (
  
select
  
dateadd(day, 
         abs(checksum(newid())) %
            datediff(day, min(theDate), max(theDate)) + 1,
         min(theDate)
         ) 
   from dbo.theTable
);

If you prefer, you could populate variables @startDateRange and @endDateRange and then use them instead of having dbo.theTable in the sub-query like this. The Query Optimizer should be able to kick in and make work out those values for you once though (which is did when I checked the Execution Plan).

3/ Without stepping through each character in a string, it doesn’t seem particularly trivial to change each one to something different. For this, I took advantage of SQL 2005’s ability to use expressions with the TOP clause, and the string concatenation feature available from FOR XML PATH('').

Using any table with sufficient rows in it in my FROM clause, I generated random letters by converting a number from 0 to 25 to a letter. Adding the number to ascii('A') and converting back to a character did the trick. Restricting the number of rows returned to the number of characters in the name gave me a set of characters, which I could easily concatenate using FOR XML PATH('')

select top (len(isnull(GivenNames,''))) char(abs(checksum(newid())) % 26 + ascii('A'))
from sys.all_objects
for xml path('')

4/ Assigning a random gender to a row was very easy. I simply took a random value between 0 and 2 and used it with CASE.

case abs(checksum(newid())) % 3 when 0 then 'M' when 1 then 'F' else 'N' end

5/ Finally, multiplying by a value between 0.1 and 10. It’s easy to generate a value between 0 and 99, add one to it and divide by 10.0 to get values in this range, but this isn’t really what’s desired, as it would give a distribution centred around five. The distribution that I want is actually logarithmic, giving roughly as many values less than 1 as there are greater.

Really what I wanted was to get a number between –1 and 1, and use 10^N, as 10^(-1) is 0.1, 10^0 is 0, and 10^1 is 10. This seemed quite easy, except that the POWER() function in SQL only uses integers. I could easily generate a value in the range –1 to 1, I simply used checksum(newid()) % 1001 (ignoring the ABS() function), and divided by 1000.0. But then to find 10 to the power of this value, I remember the logarithm function from school, which said that x^y was the same as the exponent of log(x) * y. Therefore, I used:

exp(log(10) * (checksum(newid()) % 1000 / 1000.))

…which did the trick nicely.

The client verified that the data was sufficiently random (as well as expressing some surprised over there being a practical use for log() and exp()), and I had an environment to which I could grant developers access.

Posted by Rob Farley | with no comments

I’ve presented this material at three conferences recently, so it’s about time I wrote a blog post on it...

As programmers, we love modularisation – even in the SQL space. We make stored procedures, views, and functions to encapsulate our code. This improves maintainability, simplifies the development experience, and is generally useful.

But there’s a time when it’s a bad thing for SQL Server.

There’s an amazing component of SQL Server called the Query Optimizer (I always want to write Optimiser, but I’m assuming it’s a proper noun and putting up with the US spelling). When we write queries in T-SQL, it’s the Query Optimizer that works out how to actually run the query. It works out what indexes can be used to improve performance, what order tables (well, indexes and heaps) should be accessed, how to perform the joins, and so on. I find that a rough appreciation of the power of the Query Optimizer can really help query writers.

For example, the Query Optimizer will translate a correlated sub-query in the SELECT clause into a LEFT OUTER JOIN, so that you don’t have to. It will also work out when joins can be rendered pointless and thereby removed from the plan altogether. If you let these principles help you in your query design, you can see significant benefits. It also helps you write queries that are easier to maintain, as there’s little point in trying to be clever by writing a query in a different way if the Query Optimizer will handle it in the same way as before.

If you use a view in another query, the definition of the view is used in the query as if you had written it with a sub-query. A view is simply that – a stored sub-query. They are sometimes referred to as ‘virtual tables’, but I disagree. They are stored sub-queries. Sure, the analogy falls down when you start considering indexed views, but on the whole, a view should be seen as a stored sub-query. The Query Optimizer takes the view definition, applies it in the second query, simplifies it where possible, and works out the best way of executing it. If you’re only interested in a couple of columns out of the view, the Query Optimizer has an opportunity to take that into consideration.

Stored procedures are different. You can’t use a stored procedure in an outer query. The closest you can get to this is to use OPENROWSET to consume the results of a stored procedure in an outer query, but still the whole procedure runs. After all, it’s a procedure. A set of T-SQL commands, not a set of queries. I see the clue to this as the BEGIN and END that stored procedures generally use. I like stored procedures, but I do get frustrated if they’re returning more information than I need, since I have no way of letting the system know that maybe it doesn’t need to do as much work.

Functions are in between, and come in two varieties. A function can be inline, or it can be procedural. I don’t think you find this differentiation in many places – and normally people talk about this particular drawback as being associated with Scalar Functions as compared to Table-Valued Functions, but the problem is actually one of simplification.

An inline function must be a table-valued function at this point in time. It takes the form:

CREATE FUNCTION dbo.fnFunctionName(<paramlist>) RETURNS TABLE AS
RETURN
( SELECT …. );

It is always this form, with a sub-query enclosed in a RETURN statement. It can return many columns and many rows, but the definition of the table is implied by the SELECT clause. This is essentially a view that can take parameters.

The other form is one that involves BEGIN and END. Scalar functions (unfortunately) require this (but hopefully one day will not).

CREATE FUNCTION dbo.fnFunctionName(<paramlist>) RETURNS int AS
BEGIN
RETURN ( ... )
END;

As the RETURN statement is enclosed between a BEGIN and END, it can be preceded by other statements, used in working out what value should be returned.

Table-valued functions can use BEGIN and END, when multiple lines are required to calculate the rows in the table being returned.

CREATE FUNCTION dbo.fnFunctionName(<paramlist>) RETURNS @table TABLE (<fields>) AS
BEGIN
...
RETURN
END;

In this kind of function, the table variable is populated with data, and returned to the outer query when the RETURN command is reached.

But when the Query Optimizer comes across a procedural function, it cannot simplify it out and executes the function in a different context.

The execution plan will report that the cost of running the function is zero. But it’s lying. The way to see the impact of the function is to look in SQL Profiler, where you’ll see potentially many calls to the function, as it needs to work out the result for each different set of parameters it’s passed. The pain can be quite great, and you will never have noticed if you just look at the Execution Plans.

The moral of the story is to make sure that your functions are able to be simplified out by the Query Optimizer. Use inline table-valued functions even in place of scalar functions. You can always hook into them using CROSS/OUTER APPLY in your FROM clause, or even use them in your SELECT clause (not “SELECT Claws” – that would make it related to my company LobsterPot Solutions, and “SELECT Claus” is just a bit Christmassy) using a construct like SELECT (SELECT field FROM dbo.fnMyTVF(someParam)) ...

Consider the Query Optimizer your friend. Study Execution Plans well to look at how the Query Optimizer is simplifying your query. And stay away from BEGIN and END if possible.

Posted by Rob Farley | 1 comment(s)
Filed under: , , ,

Speaking at two SQL conferences in the last two months (SQL Down Under in New South Wales, and SQLBits V in Old South Wales), I’ve had some flights to do. This isn’t normally a big deal, but both times I managed to have some stress getting home.

Firstly, I should point out that both conferences were really good. Very different to each other – SQL Down Under was held at a university campus in a country town, SQLBits was in a 5-star hotel with conference centre – but both great events. There’s something about having a conference with a dedicated technology that makes it special. At TechEd you brush shoulders with people who have very different areas of expertise, but at a dedicated SQL conference, you end up having a lot in common with just about everyone.

At SQL Down Under I got to catch up with many people from around Australia that I see only a couple of times a year. Friends that I know from previous trips to Wagga, or from user groups I’ve visited, TechEd, even the occasional class I’ve taught. The content is always good, and it’s great to see people honing their skills in presenting. This year one of the highlights was seeing John Walker present for the first time.

At SQLBits, I got to meet many people for the first time (first time I’ve done a SQL conference in the UK). I got to see old friends like Simon, Jamie, Tony & Darren again, and meet people like Chris, Chris, Allan, James & Martin (of course there are many more names I could list). I had never heard any of these guys present before, so I tried to get around to as many sessions as I could. I was disappointed that the sessions I was giving clashed with Brent’s, but I was pleased that I could meet him for the first time.

Coming home from Wagga, I had to meet a flight taking me from Melbourne to Adelaide. I had allowed plenty of time to make the transfer, but when the flight out of Wagga was well over an hour and a half late, I knew I couldn’t make it. There was a fair crowd of SQL people at the airport, so we were joking about different tactics that could be used to help me make the connection. The flights were with different carriers, so apart from letting me check in for theSDC14510 second flight on their computer, there was nothing the Wagga staff could help with (they were very nice and helpful though, let me use their printer and everything). When I got to Melbourne, it turned out that the flight I was booked on had been cancelled, and my ticket transferred to a later flight, which I managed to catch. Home later than expected, but crisis avoided somehow…

Not so lucky on the way home from the UK. My flight to Australia stopped at Bangkok on the way, and as I got off, the crew were saying that we had an hour and a half. I got back to the gate in about an hour-fifteen, only to be told that I was too late. Apparently the 90 minutes was from the wheels touching down to the wheels taking off again, and we only had about 30 minutes in which to get back to the gate (bearing in mind that at Bangkok airport you need to wander down from the gate to a security area, and get re-admitted to the Departure area, before returning back to the same place you got off the plane in the first place). 24 hours later I got on a flight to Australia, but not before a stressful night trying to work out how best to get a replacement ticket, considering that nowhere in Bangkok was open for the first 16 hours I was accidentally in Thailand.

It hasn’t put me off the idea of travelling to conferences. Everything that happens gives me a story to tell, and I guess these last couple of months have just given me more stories than I expected. If you’re into SQL, and there’s a SQL conference near you, you should really try to get to it. Just pray that you have a better time getting home than I did.

Posted by Rob Farley | with no comments
Filed under: ,

My talk on StreamInsight is up next. I’ll try to blog more about that later. For now, I want to mention more about SQLBits itself. This is by far the largest SQL-only conference I’ve attended (I haven’t been to SQL-PASS yet), and it’s great to be involved.

Yesterday I had an all-day seminar about the new items for Developers in SQL 2008. It was a good time – the delegates responded very positively, and many of them have caught up with me since.

But for me, the conference is being a great way of catching up with (and meeting for the first time) a bunch of SQL people that I rarely see. I’ve met people that lived only a few miles from where I grew up, and people that read my blog (Hi!), discovered people who have connections to Adelaide, and even found that my Adelaide friend Martin Cairney (who is also here) has a strange connection to Donald Farmer (of Microsoft), that their parents shared a back fence or something… Now Trevor Dwyer tells me a colleague of his knows me from somewhere… the world is very small here.

My StreamInsight talk will be interesting I hope. I have some stuff to show off, and I plan to involve the audience a little as well. If you’re at SQLBits and feel like being involved in an interactive session, then definitely come along. I want to hear from people in the audience who have dabbled with StreamInsight and also other vendors’ Complex Event Processing offerings. This is a brand new technology from Microsoft, and there will be a large range of adoption levels in the room.

I recently gave a talk in New South Wales, so now I’m going to give one in Old South Wales. In Newport, to be precise.

As I’ve written before, I’ve been a big fan of the SQLBits conferences that is run by many UK-based friends of mine. Unfortunately for them, they had a presenter pull out recently, and unfortunately for them, I’m going to fill in.

Weather-wise, it’ll be a nice change from the scorching weather we’ve had in Adelaide recently. We’re setting new records for days over 30C here, the streak which will be broken on Monday if the temperature drops to 28C, before climbing again after that. I’ll be going to temperatures which are more like 40F than 40C. I think I’ll be the one wearing two jumpers and a coat.

I’ll be involved in all three days of the conference, doing a full day of SQL 2008 for Developers on the Thursday, and hour-long sessions on the next two, on the topics of StreamInsight and Query Simplification respectively.

It’s a great opportunity to be involved, and I’m sure it’ll be a good time. There are several tracks, and the quality is bound to be high. I’m planning to attend sessions by friends, lots of people I’ve never heard present before.

If you’re going to be in the UK on Nov 19-21, make sure you get along, and say hi! I’ll also be receiving delivery of my (signed) copies of the SQL Server MVP Deep Dives book, which is going to be good too.

Posted by Rob Farley | with no comments
Filed under: ,

This is really just a reminder blog post for me. Way too often I find that I have closed the Microsoft File Transfer Manager for one reason or another, and I want to start it up again to resume some download from my MSDN Subscription. Like today, I need to grab the latest version of SQL Server 2008 R2, which includes built-in Split, RegEx and Fuzzy matching features for T-SQL (something I’ve wanted for a long time, and that I’ll blog more about later, once I’ve had a chance to try it out). It’s a large download, and not something I want going when I’m on a 3G connection, but happy to use a WiFi connection for. And there’s just the odd time when I’ve forgotten I’m downloading something and it’s sitting there, partly downloaded…

So I find myself looking for the application to run. When it installs by default because you click on a download, it doesn’t put a shortcut in the Start Menu anywhere. So I’ve been known to even go and restart a download (just telling it to Cancel instead of agreeing to it), just to start the process. Then I can jump into PowerShell and run this command:

get-process -name transfermgr | select path

…which tells me that the path of the transfermgr process is at c:\windows\downloaded program files\TransferMgr.exe

So now I can just run it from there. The idea of this post is to remind me where it is, so that I don’t have to hunt for it every time.

Posted by Rob Farley | with no comments
Filed under: ,
More Posts Next page »