Normalization really does matter

Published Thu, Sep 7 2006 15:30 | William

It seems like years ago, but I remember one of my professors giving us an assignment. It involved taking a database chock full of redundancies and writing the SQL Statements to make a few small updates.  We thought it was an assignment to test our SQL Skills.  However we were graded on how close to 'right' we got all of the changes.  It wasn't about SQL at all, he could care less about SQL at this point. It was about illustrating, with much painful monotony, why "Normalization matters."

After we turned the assignment in, he asked everyone how long it took.  Suffice to say it took everyone a long time.  But the interesting thing was that not one single person got all of the changes correct. Many got close, but no one got all of them.

Then he explained some more why normalization matters.  You see, not only did working with non-normalized data (as opposed to denormalized) take forever to do, it was very error prone.  No matter how careful you are, if you do the same thing over and over, you're likely to miss something.

Afterward, he proceeded to give us a story from the 'real world'.  He told us how some things in CS have great theoretical value but not much practical value.  But he said that in general, good theory makes for good practice and no where is this more true than in DB Design.  We were told that when we get out into the real world, we'll hear all sorts of 'Senior' folks trashing normalization as ivory tower stuff that works great in college but doesn't have value in the real world.  He said that *when* we hear this, to immediately ask the person "At which level doesn't normalization make sense in the 'real world'?  All of them or just certain forms"  This was because, in his opinion, the only people that will trash normalization are ones who don't understand it.  Granted, there were many caveats in the lectures.  Sometimes it's best to denormalize (which, is something you do AFTER you've normalized - for the uninitiated).  Many will design garbage and when the redundancies are pointed out will say "I totally do believe in normalization but I denormalized the design".  If it wasn't normalized in the first place it couldn't have been denormalized, such statements are mere excuse making.  He also pointed out that sometimes when time is of the essense and maintainability isn't a real issue, then normalization may not justify the costs.  There were definitely exceptions to his rule, but one thing that always stuck with me was that folks that trash normalization almost always have never studied it very far.

So why am I babbling about this?  It's simple.  Some things can be totally screwed up and then fixed quickly/easily/inexpensively.  I could get the color scheme of a CSS layout totally upside down and fix it pretty quickly in many cases. Other things are a little more difficult but still doable.  Let's say I didn't use inheritance and coded some total redundant classes.  Making the code modification won't be trivial but if my code is modular, it should be fairly straightforward considering the development tools in place these days.  Other things though are deal breakers.  Absolute and utter deal breakers.  Database design is the epitome of a deal breaker.  To those new to development, you may be asking why DB Design is so vastly different from other design issues.

Just imagine the following scenario.  You have a Dog and a Cat class and they are in a running production app that's working great. You decide that you should have had a Mammal base class and decided to implement it.  You guess it'll take you about 16 hours to find every instance and make the appropriate changes.  So the app runs along, users are happy, you take your 16 hours, make your changes , test them (or test them, make them for many) and redeploy.  No problems. Users are none the wiser and everyone's happy.

Now what if this was a production database?  Well, if you have downtime on the db then you can create a replica of the data, make your changes, verify they worked correctly and then implement them at some time when no one is using the system.  I've only worked in two places where such was the case.  If the design flaws go out the door on a live system, you're stuck.

In the Dog Cat example, if you totally screwed something up, you can redeploy the previous version and in many cases it won't be too big of a deal (in some it obviously will but that's related to installation more than anything else).  But what if you damaged a lot of data, the impact very well may (and probably will) be immediate.  Moreoever, you are probably dealing with a moving target b/c the data is constantly changing. Even if you think you coded everythign correctly, new data may be introduced that you weren't counting on. And even then, even assuming you got everything right, many changes of this sort are going to cause service interruptions.  Depending on what the db is used for, this could be a huge deal.

We all know that the further downstream it is that a bug is caught the more it costs to correct. We've all seen the upward trending curve.  The thing is that in most case, with DB changes in a production system, that curve is very short and very steep.  And if you do any of the following, you should be flogged:

  • Use Reserved words as object names, column names etc (they are called Reserved words for a reason)
  • Use inconsistent naming conventions
  • Use unintelligible naming conventions
  • Ignore data types.  NVARCHAR isn't the solution to everything. Yes, I know you can cast dates held in VARCHAR fields to DATETIME Values but that's irrelevant
  • Fail to use keys. At a minimum you're going to have a lot of trouble getting tools to update data but that's the least of your worries
  • Use multiple fields to represent similar data and then using inconsistent rules for maintaining the data that they are populated with
  • Fail to enable integrity constraints. I once heard a developer, with a straight face, say he didn't believe in integrity contsraints b/c he was a 'real developer' and preferred to write his own.  He contended that he wanted to make sure things were done right as opposed to being lazy and leaving it to Oracle to handle.  You have to have the initials CC or have forgotten to take your lithium if you think you're going to do a better job writing your own integrity constraints than Oracle/Microsoft etc are.  And for those of you that are going to get all pedantic and tell me of some obscure bug or situation where MS's or Oracle's constraints fail, I'd counter with this.  Use their AND use your own.  You can't guarantee that they only way the data will be accessed is through your killer app and it's uber integrity algorithms. So why leave it bare?  If you're constraints are really that good, then it shouldn't ever be an issue anyway right?
  • Freestyling!  Freestyling is cool as long as it's done in a Rap/BMX/Skateboard etc context.  It has no place in the development world other than in R & D scenarios.  RDBMS systems are well tested and rooted in very strong theoretical foundations.  They have stood the test of time and they work.  They aren't perfect but they are damn good and they work well when used correctly.  Unless you really are a super genius, chances are that you're not going to have some superior way of doing things that the rest of the world has missed. And even if you did  find one, the chances of you finding another one are slim to none.  And why do these folks never seem to freestyle with other things like addition, subtraction, multiplication, division, gravity, intertia etc?  What do you think would happen if you owed the IRS 10k and used your own math to compute that they actually owed you 10k?  Why would it be any different with anything else? So if you're some Maverick has a ton of 'unique' ways of building your dbs - all you're doing is shoving the burden of your sloppiness onto someone else.
  • Failing to at a minimum follow the first two normal forms.

It's a matter of "you can get with the program or you can pay a ton of money to have someone else fix it, or spend a ton of time down the road fixing it yourself."  And if this happens, you can't cry foul - you're the one that decided to break the rules. 

Comments

# It's Way Too Early For This said on September 7, 2006 5:57 PM:

Some good points in here about normalization. And he does make the point that sometimes it is best to

# Brian Madsen said on September 8, 2006 9:35 AM:

Bill,

that is probably one of the best posts you've made in a long time...as you may recall, my pet-hate/pet-love is developers and their sometimes absolute lack of knowledge on databases and the rules governing them.

<rant id="near_hysteria" runat="server">

At work, i have the pleasure of maintaining the database server (yes i'm not a DBA, but my department is a bunch of top notch guys) which contains probably one of the worst written database in histor.

It knows nothing about:

1) Data integrity, not one single foreign key in the entire database schema

2) normalisation, there are tables with an absurd amount of columns in them...

ex. a Customer table.

usually holds primary key(s), immediate information such as username and passwords, email addresses...but we go further here, it keeps absolutely everything about the client in that table...it then uses that tables foreign key in other tables to link them together (such as products and services the client has, invoices, logs etc) but in those tables there is not relationship set up to the "Customer" table. it just fully trusts that there will never be an error inserting data, or that some data will be corrupted...no, this pleasant application (a billing and provisioning system called Platypus for ISPs..owned by TuCows now) trusts it's data explicitly.

overall you could overcome this with modern technogies..say object oriented databases or by using strongly typed dataset ala ADO.Net..but HA..no, this application is written in Visual FoxPro..to make things worse, it allows you to extend the database by adding "custom tables" to the schema, via the application UI itself..again, no relationships or normalisation here..and to make matters worse, we have had more than 300 tables in this schema at some points.

whoever wrote this appliation should be shot..or even better, have a funnel down their throat and a corner office desk pushed through.

Anyways, now with more than 20.000 customers in the database performance is absolutely shocking...to the extent that corruptions are occuring on almost a weekly basis, which of course has to be fixed by C'est Moi..

Since taking over the department and getting a very depe insight into the applications and systems spread out through the company, this database backed application should never have seen the light of day.

i would give my right testicle for a solution i could replace this piece of crap with..it is the ultimate proof that there should be some divine governing body to regulate developers..

</rant>

# Brian Madsen said on September 8, 2006 9:38 AM:

"it then uses that tables foreign key in other tables to link..."

should have been...

"it then uses that table's PRIMARY KEY in other tables to link..."

# William said on September 8, 2006 9:50 AM:

Then it's good to hear I'm not alone.  I've worked with so many people who come up with utterly lame reasons not to normalize and in every case, they have never studied relational theory to make even a modest argument against it.

I think there's two big issues. One is normalization. The other is sloppy implementation.  using reserved words as field names isn't a normalization issue.  But it still sucks almost as much.  Slapping 50 indices on a 20 field table isn't a normalization issue but it does present overwhelming issues in most cases.  On and on.

So I ask you Brian - why is it that in spite of overwhelming evidence for and virtually no evidence against - good design is still the bain of many?  And as far as violators being shot, the first country to make it a crime to violate good db design practices will be my new home the day after the law passes.

# Brian Madsen said on September 8, 2006 10:16 AM:

I think the answer to your question is simple...

and here i'm most likely going to be shot down in flames by some utter brain-lacking moron (pardon the language here, but this is one aspect of development i'm VERY fanatic about)..

here we go:

The industry boomed, and brought a ton of "developers" out of the bush who's only really in the industry for a paycheck..they don't have the same passion for development that this industry really deserves, so they come into work every day, put in 8hrs in their chair, and thinks that's the end of the story..they don't take pride in having a working application implemented while knowing that the engine is going to keep on ticking and ticking and ticking..

I see it like this..developers are divide into two main groups..Aston Martin and Hyundai..they both do the same thing - bring you from point A to point B, yet you can tell you've just paid $5.000 for the car because you find nearly any excuse you can to catch the bus.

Real life story here:

my senior developer and myself, where told to take a look at two of our old legacy applications at work...one is a very very important MS Access form application, backed by an SQL Server..the second is another MS Access form application, also backed by SQL Server..

now, first we found that there was absolutely no documentation on either of the applications (both are developed in-house), so we took time to have a talk to the "manager of the DB department" who built parts of the first application..we asked him to send us the DB schemas of the DB first so we could see what type of data we were going to be dealing with..he did, and we both sat shocked for nearly 20 minutes at how badly this database was structure...first off, second pet-hate of mine is some idiot using MS_ID as the primary key for every single table..yes, you read that right..each and every primary key in that database had the same name...it did contain relationships so data integrity should be ok..well, to our shock and horror it wasn't..the application had been built years ago by a technician of sorts who bought an MS Access book..so can't blame that gentleman..but, the "manager" we spoke to is a VERY VERY VERY known figure in the SAP space..yet, he had continued to patch this application in the last year or so by using the exact same methodology as the technician initially did.

So who here is the culprit that i should burn at the stake?? the technician or the "manager"??

i think the answer here is simple..the "manager" should be quartered and drowned by pushing pages of a relational database book down his throat..

and that was when i realised that not everybody working in this industry is as passionate about what we do as i was..can we say reality check!!

# Brian Madsen said on September 8, 2006 10:31 AM:

Here in Australia we have a gentleman known as Greg Linwood..

now 5-6 years ago i had no idea who this guy was..but after having read his responses in an SQL Server newsgroup i've seriously gotten some respect for him...not only does he know what he's talkign about, but he also steps outside of the box when he deals with technical issues.

He's going to present in Perth in a week or so..once for my group, Perth SQL Server User Group and once for Perth .Net Community of Practice..both are going to deal with indexes and how to deal with them..one of the sessions are geared towards developers (durr) and the other towards database administrators - needless to say, i'm going to attend both sessions as he's going to touch on some of the things in the industry that i'm extremely keen to know more about.

He recently dispelled the myth about Clustered Indexes and i think i'll ask his permission to post some of his answers on my own blog as he's absolutely right..

Anyways, Greg is one of those guys that steps over the boundaries and takes a stance on technology which clearly indicates that he doesn't compromise when it comes to SQL Server...

seriously wish you could attend these two session Bill..i think you would have gotten a lot out of it.

# Brian Madsen said on September 8, 2006 10:35 AM:

Just going to post Greg's blog link..

http://blogs.sqlserver.org.au/blogs/Greg_Linwood/

if there's a guy out there that i have more respect for regarding SQL Server i'd be very very surprised....

# Brian Madsen said on September 13, 2006 11:00 AM:

Bill seems to have dissapeared off the face of the planet again......

# Brian Madsen said on September 21, 2006 12:17 AM:

Yep - he's officially gone underground... can't raise him on messenger or email :(

# Brian H. Madsen - .Net Powered by Caffeine said on September 28, 2006 12:27 AM:

Now, i&#39;m not going to mention any names here, because somebody may get offended. but - i&#39;ve been

Search

This Blog

Tags

Community

Archives

News

My other sites

Cool Stuff

Book Stuff

Security

ORM

Data Access

Funny Stuff

Compact Framework Stuff

Web Casts

My KnowledgeBase Articles

My MVP Profile

Design Patterns

Performance

Debugging

Remoting

My Fellow Authors

My Books

LINQ

Misc

Speech

Syndication

Email Notifications