IsAlive and LooksAlive, and “Physical Disk Resource”
Ok, so this is my first blog post, oh boy; to pick a topic, well here goes: lets talk about disk IsAlive and LooksAlive, and about what exactly is the “Physical Disk Resource” in Cluster. Reason for this is that I want to build up some general knowledge about terminology and operations before I start to bore you with troubleshooting stories.
IsAlive and LooksAlive:
This is my current understanding on how this works for clustered disks in Windows 2003: Many definitions are given to these terms. In order to troubleshoot clusters in general, we need to understand what tests are executed on “Physical Disk Resources” by MSCS. Only when we are able to understand the different checks and the mechanism of cluster itself, we can understand the messages may be logged in the cluster log or system event log. There are two tests that are conducted by MSCS; first is the “IsAlive” and “LooksAlive” mechanism, which is executed by the Resource Monitor / resource.dll. The second is the actual device checks that are performed by the “clusdisk.sys” filter driver.
File system level checks: At the file system level, the Physical Disk resource type performs the following checks:
LooksAlive: By default, a brief check is performed every 5 seconds to verify that a disk is still available. The LooksAlive check determines whether a resource flag is set. This flag indicates that a device has failed. For example, a flag may indicate that periodic reservation has failed. The frequency of this check is user definable.IsAlive: A complete check is performed every 60 seconds to verify that the disk and the file system can be accessed. The IsAlive check effectively performs the same functionality as a “dir” command that you type at a command prompt. The actual command it uses is the “FindFirstFile()” API. The frequency of this check is also user definable.
Device level checks: At the device level, the clusdisk.sys driver performs the following checks:
SCSI Reserve: Every 3 seconds, a SCSI Reserve command is sent to the LUN to make sure that only the owning node has ownership and can access that drive. If the test fails a resource flag is set. The flag will be picked up by the LooksAlive mechanism. These timings cannot be changed by the user. Private Sector: Every 3 seconds, the clusdisk.sys driver performs a read- and write-operation to sector 12 on the LUN to make sure that the device can be written to. If this test fails a resource flag will be set that will be picked up by the LooksAlive mechanism. These timings cannot be changed by the user.
It is vital to know the differences of the checks. And it is especially important to know that the resource.dll is doing File System checks, not actually checks on the disk. As an example: it is important to understand that your ‘disk’ can be 100% OK, but a File System corruption can cause the FindFirstFile() API to fail, and mark the ‘disk’ as failed.
Disks, Volumes, File Systems, Partitions and “Physical Disk Resource”
Working for a storage company, you quickly learn that:
“a Disk is not a Partition is not a NTFS-Volume” these are different terms and although tightly linked to each other, they each have their own ‘features’. You also learn is that “you do not mount a disk” in fact you “mount a Volume or File System”.
In Cluster terms the “Physical Disk Resource” is a combination of all of these including the mounting part, so a ‘failure’ in one of these different parts may offline or fail the full resource, so people call it “the disk has failed”. Sometimes that is true, but many other times it is not.
An example would be: if for some reason Cluster (OS/mountmgr) fails to mount the volume, cluster will fail the “physical disk resource”, although the disk is OK, the File System might be OK. Still in general people will call this “the disk has failed”
So in short (hoping that I am not creating more confusion):
- A Disk contains 1 or more Partitions
- A Volume can be created on 1 or more partitions
- A Volume can be formatted with a File System, such as NTFS
- A Volume / File System is what you mount under a drive letter or mount point
A Cluster only supports Basic Disks, and with Best Practices in mind:
1 Disk contains 1 Partition, which contains 1 NTFS formatted Volume.
Although this is a 1-to-1-to-1 relationship, still they are three different terms, and each of those can cause their own problems.
In subsequent posts, I will use the terms Disk, Volume, File System, Partition and will keep explaining what exactly is meant with those, this to ensure we all understand the complexity behind clustered Disks and their representation in Cluster as “Physical Disk Resources”
I think that is it for today, back shortly with more about Disk Resources in Cluster.
I hope I didn’t bore you too much and if you made it this far with reading all this, there might be hope for me after all.
If anyone has any remarks, comments or corrections on the above, please let me know.
Another Cluster Blog ?
Another Cluster Blog ?
That is the question, which was haunting me for the better part of last year. There are already good Cluster Blogs out there by MVPs, which pretty much cover all topics adequately. From new features, which are introduced, to inner workings of Geographically Dispersed Clustering, all is already covered.
Here they are
The blog from the cluster team in Microsoft:
Then the various MVPs (in no particular order)
Rodney R. Fournier (former MVP):
Nail Own (in German):
So why would I blog about clustering?
What can I write without repeating what has already been said?
Well, maybe there is a small subject, which I can publish, and that is the topic of troubleshooting, Cluster log analysis, and maybe “this is how it works” articles.
Saying that, and realizing that this is going to be dry, geeky and theoretical; I probably loose 95% of the intended audience with my first blog, and general people going to be “bored out of their brains”. However, if there is only 1 person out there which I can help by blogging about troubleshooting, or 1 person out there which gets the “aha !” moment, then I have reached my goal with this blog.
So, cluster analysis and cluster log files, that is going to be the red line in this blog.
Let me briefly introduce myself :
My name is Edwin van Mierlo, and for those who are wondering, that is a Dutch name.
I moved from Holland to the South West of Ireland about 11 years ago. Married, no children (yet), no pets other than my stack of PC’s in one of the spare rooms at home.
I am indeed a “geek”, as many who know me would undoubtedly agree; I love computers, technology and off course Failover Clustering !
I work for EMC (www.emc.com) in Customer Service and in my role I deal with Clusters on a daily basis. Hence the topic of Cluster troubleshooting and cluster log analysis. My specialty is Geographically Dispersed Clustering, same as another cluster MVP – John Toner, but as John is already blogging about this, and doing a great job, I will only blog about it if I get an interesting troubleshooting case.
As you know now that I am working for a storage company, it should be of no surprise that I will blog about some disk problems. Having that said, I will not limit my topics to disk, I will take any real life example I am involved in. I will try to publish some “how does it work” articles as well, just to increase knowledge on internal cluster operations.
Just a little disclaimer. I do publish information to my best knowledge, ability and experience, however there is always the risk that the information I am posting is not accurate. When you find inaccuracies, please do notify me! For two reasons, one so I can correct the post, second; so I can learn as well! Oh.. be gentle, this is after all my first blog.
If you have any questions, you can leave a response here, or you can find me (and all the other Cluster MVPs) in the clustering newsgroups.