SQL Strikes Again
I learned today that SQL 2008 will actually not install by default in many multi-site cluster solutions. Why? Well, during the SQL 2008 installation, it runs through a Configuration Checker where one of the tests the environment to ensure that the cluster is configured properly. In a geographically dispersed cluster (specifically in an SRDF/CE cluster) this check will fail with the error:
The cluster on this computer does not have a shared disk available. To continue, at least one shared disk must be available.
Doing a little bit of research on this might lead you to the MSKB article 955780 where it states the following:
When you install failover clustering in SQL Server 2008, the node on which failover clustering is installed must own the resource group and the shared disks in that group. If the disk resource is not owned by the local node, if the disk resource is a cluster quorum disk, or if the disk resource has dependencies, the failover clustering installation will fail.
What? If the disk has dependencies it will fail??? Why on earth would you have this limitation, Microsoft? This is certainly a valid cluster configuration so why would SQL care about the granular details of the cluster configuration? Other than affecting just about all multi-site clustered solutions, this would also affect any installations that would also want to use mount points.
Microsoft’s solution for mount points is to just not setup your cluster properly and let SQL handle setting up the disk dependencies for you. This will be the same solution if you want to setup SQL 2008 in a multi-site cluster, but it is mind boggling that they would force you to do something like this.
I’ve submitted feedback to Microsoft regarding this issue. Feel free to rate/comment about this issue here:
http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=366673
It’s not surprising that someone has already submitted this feedback regarding mount points also…where they gave the lame workaround to remove dependencies:
http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=331910
Multiple Subnets with Windows 2008 Clusters
I’ve been playing around with Microsoft Windows 2008 clusters for a while now trying to test out some of the new features and how they will affect multi-site clusters. One of the best new features (in my opinion) in 2008 clusters is the ability to have cluster nodes on different subnets. Microsoft does this by introducing a new OR dependency option for resources. In 2008 clusters, you have two options when making a resource depend on more than one resource:
- AND. This indicates that both the resource in this line and one or more previously listed resources must be online before the dependent resource is brought online.
- OR. This indicates that either the resource listed in this line or another previously listed resource must be online before the dependent resource is brought online.
So with the OR dependency, you can set up your cluster Network Name resources using an IP address from multiple subnets. Let’s take a look at this in action. Here’s my configuration:

I’m using a subnet mask of 255.255.255.0 for all subnets. When you create your cluster or an application group, you are prompted to provide an IP address for each subnet. For example:

Here, you would enter a valid IP address for each subnet. Once this is complete, cluster will automatically setup these dependencies properly for you. Looking in the cluster GUI, you’ll see that only one of the IP resources will be online at a time as the other IP address is not valid for the node:

Looking at the dependencies, cluster automatically sets up this OR relationship for you:

You might also notice that the other cluster IP address actually FAILS while coming online. This will generate an error in the system event log every time the group is moved.

This is mildly annoying, but well worth it.
One issue that you will likely run into with multiple subnets is with the DNS replication. Upon failover to the other node, DNS records will be updated to point to the new IP address. While this is occurring, clients may not be able to connect to the cluster workload even though it is online. I’ve heard reports that this replication had taken up to an hour at some sites (yuck) so the cluster is effectively offline to the clients during this time.
If you’re setting this up in your environment, please leave me a comment and let me know how long this replication takes in your environment.
SQL 2008 Team - Please Add multi-subnet support!
I was informed that feedback for Microsoft product groups should be done through the http://connect.microsoft.com page. So I've submitted a request to add support for clusters running multiple subnets here:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=353894
Please feel free to join in and rate/add your comments to my submission. Let the SQL team know that this is something that they need to address.
SQL 2008
One of the great new features of Windows 2008 Failover Clustering is the removal of the "same subnet" restriction. This is a huge feature for multi-site clusters as I personally know of several customers that did not implement a multi-site cluster specifically because of this limitation. Most of the folks that I spoke with at TechEd are excited about this new feature and are considering multi-site clusters specifically because of this change.
Anyway, at TechEd this year the SQL folks made it publicly known that SQL 2008 will NOT support clusters with nodes running on different subnets. Ouch! This is a mighty blow to anyone considering a stretched 2008 cluster as SQL is certainly one of the top cluster applications.
Personally, I'm dumbfounded by this and I don't understand how Microsoft allows a product to ship that doesn't support this feature in clusters. This issue was revealed to us at the MVP conference in March and my response was to obnoxiously shout at the SQL team, "Boooooooo, hissssssss!"
Hopefully, this will change quickly as more customers demand support for this feature in SQL.
EMC MirrorView Cluster Enabler Released!
If you’ve read through some of my previous posts, you might have picked up on the fact that this product was coming soon. EMC has finally released an SRDF/CE equivalent for the Clariion arrays utilizing MirrorView. Sure, we’ve got other products for other clusters, but this is the first that fully integrates with Microsoft clustering and MirrorView.
This first release (named 3.0 to fool you
) has some limitations as this is our first crack at Clariion support. Here are some of the requirements/limitations:
-
Support for Windows 2003 and 2008 clusters only. No 2000 support (thankfully)
-
Requires Clariion arrays to run Flare 26 or higher. Some arrays have a max Flare level of 19…RPQ might be needed for these arrays
-
Initial release supports MirrorView/S only. Future releases will look at adding MirrorView/A support, but initially it will be synchronous only
-
Only supports non-disk based quorum models. So your cluster has to be MNS (Node Majority) or MNS w/FSW (Node and File Share Majority). Clariion arrays do not have the same locking capabilities to perform their own geographical arbitration, so disk quorums are not supported
The software is currently available for download on EMC’s Powerlink web site. The software is listed under “Cluster Enabler for Microsoft Failover Clusters” and the product guide and release notes are currently under “MirrorView” (though this may change).
Also, if you’re attending EMC World this week, make sure you see the Cluster Enabler sessions and chat with the developers about these new releases.
Cluster Enabler 3.0 Released
EMC recently released the latest version of SRDF/CE for MSCS. This new release has the following changes:
-
New Name - The overall product line has been renamed to "Cluster Enabler" and the first product released in this line is SRDF/CE for MSCS. In the near future, MV/CE will be added to this family of products.
-
Redesigned GUI - SRDF/CE GUI has been completely re-written for this release. The GUI in 3.0 has been simplified and wizards have been streamline, making it easier to configure. The SRDF/CE GUI is more tightly integrated with Cluster Administrator and functions like changing Quorum model in SRDF/CE will also make these changes in MSCS (previous version forced you to manually make this change).


-
Multiple CE Cluster Management - The CE GUI lets you manage remote clusters and multiple clusters, similar to Cluster Administrator
-
Windows 2008 support added - Version 3.0 adds support for Windows 2008 clusters.
-
Windows 2000 support dropped - Good riddance
-
Support for 5773 code added - Minimum microcode for this release is 5x70.
-
Support for multiple Symms per cluster - You can now have multiple Symmetrix pairs in a single SRDF/CE cluster. Concurrent SRDF is now tolerated per the product guide.
-
Site mode changes - The default behavior is now called “Restrict Group Movement” and this is basically a combo of the old “No New Onlines” and "Local Override" settings from previous releases. Basically, groups are allowed to come online where a disk is RW while the RDF link is down, but if the disk is WD, the CE resource will fail to come online. The other option is “Automatic Failover” and this feature is essentially the same as the old “SRDF Override” setting. The “Failstop” option is no longer listed, but it is configurable via CLI. The "Forced Failover" and "Local Override" values have been removed.
-
Site Mode now set at the GROUP level - Previously, this setting was a cluster wide setting. Now, this value can be adjusted on the individual group level. This gives you greater flexibility on which groups you might consider putting at risk by enabling automatic failover during a site outage.
-
SRDF/CE resource/registry changes - Most settings for SRDF/CE are no longer stored in the SRDF/CE\Config hive in the registry. Instead, these settings are now stored as private properties of the SRDF/CE resources in the cluster.
-
Installation changes - CE installation now requires that MSCS be installed first before attempting to configure the cluster for CE...this is similar to the old "Convert MSCS to SRDF/CE" wizard. MSCS must be installed on at least one of the cluster nodes prior to configuring CE.
For more information, you can get a copy of the software, product guide and release notes on EMC's Powerlink website.
Clustering Webcasts on Demand
Microsoft gave some great webcasts related to 2008 clustering last month. If you missed any of these sessions, you can view these recording on-demand and/or snag the PPT files here:
Building High Availability Infrastructures with Windows Server 2008 Failover Clustering
Webcast - http://go.microsoft.com/?linkid=8131531
PPT - http://go.microsoft.com/?linkid=7933793
Failover Clustering 101
Webcast - http://go.microsoft.com/?linkid=8146346
PPT - http://go.microsoft.com/?linkid=7933794
Failover Cluster Validation and Troubleshooting with Windows Server 2008
Webcast - http://go.microsoft.com/?linkid=8173778
PPT - http://go.microsoft.com/?linkid=7933795
Geographically Dispersed Failover Clustering in Windows Server 2008 Enterprise
Webcast - http://go.microsoft.com/?linkid=8187007
PPT - http://go.microsoft.com/?linkid=7933796
Deep Dive on Failover Clustering in Windows Server 2008 Enterprise Storage and Understanding Quorum
Webcast - http://go.microsoft.com/?linkid=8213972
PPT - http://go.microsoft.com/?linkid=7933798
FYI, in case you are not aware, the Microsoft Cluster team has re-started their blog. You should visit http://blogs.msdn.com/clustering so you don’t miss any additional updates from the cluster team.
Future Releases of SRDF/CE
It's been far too long since I've written a blog post, mostly due to the holiday rush. There's much going on in my world of geo-clustering with 2008 RC1 being released, and also there's been some development activity for SRDF/CE. I’ve been asked questions many times in various forms about the future outlook of geo-clusters and specificly, the longevity of a product like SRDF/CE. Some ask if EMC has any plans to support 2008 cluster and other ask if the product is dying due to new application features such as Exchange CCR or SQL mirroring. I’ll address this second part first as it’s easier to answer and I’m not sure how much I am allowed to disclose about future EMC product releases.
Some have suggested that Exchange 2007 CCR eliminates the need for clusters and SRDF/CE, so EMC’s product must be of the dying breed. In reality, I see an exact opposite effect occurring. With Microsoft promoting the idea of disaster recovery and multi-site clusters in their products, I see more customers looking at their current infrastructure and realizing that they have a need to implement some form of multi-site clustering solution. As they look at the major applications like Exchange and SQL, many customers will then realize that they have many other business critical applications/services in their existing clusters that they would also like to protect from a total site outage. Believe it or not, some customers use applications other than SQL and Exchange in a MSFT cluster…I see lots of Oracle, file & print and other custom apps in clusters. This triggers customers to look for a more robust and total multi-site clustering solution. This is why I feel that a product like SRDF/CE will be around for a long time to come.
As for future releases of SRDF/CE, more versions are planned and being developed. The next major release of SRDF/CE (version 3.0) will have support for 2008 server…I think this is pretty safe to say and I’m not really giving away any big company secrets. 2008 cluster is going to be HUGE for multi-site clustering due to the new AND/OR dependencies which remove the “same subnet” requirement that is in place for clusters today. I personally know of several customers that have dropped their goal of implementing a geographically dispersed Microsoft cluster due to this specific single subnet limitation. If you haven’t tested 2008 cluster, I would recommend grabbing a copy of 2008 RC1 from the Microsoft website and start poking around with this version to get used to some of the differences.
There are several MAJOR changes coming in this next release of SRDF/CE and one will be support for 2008 clusters. Another change will be another slight name change for the product. Version 3.0 will be dropping the SRDF part of the product name and will be named simply “EMC Cluster Enabler” or CE…rolls of the tongue a little nicer than SRDF/CE. The main reason for the name change is the addition of a little feature that I had mentioned in a previous post. I’ll write up some more about CE 3.0 in the near future.
What applications are supported by SRDF/CE?
I often get asked if SRDF/CE supports various versions of Exchange or SQL, especially just after a new version is released. If you are one those that have asked, you have likely received my sarcastic, yet completely accurate answer that SRDF/CE does NOT support ANY applications other than MSCS. Hence the name, SRDF/CE FOR MSCS. There is only one application that SRDF/CE supports in a Windows environment and that is Microsoft Cluster Service.
If this seems too harsh, I can also accurately claim that SRDF/CE supports ALL applications that are supported in a Microsoft cluster. So if a new version of XYZ application comes out or if you want to know if a specific application is supported in an SRDF/CE cluster, my alternate answer is that SRDF/CE will instantly support this application/version with no qualification needed. I guess this sounds better to folks as there usually isn’t a follow up question after this answer.
I guess my point would be that the SRDF/CE product works to enable cluster disks so that they can be used over a distance with SRDF. SRDF/CE only cares about the disks and its own cluster resources. Any application that you wish to install in the cluster is up between MSFT and the application vendors to support. Anything that you can do in a normal cluster should be doable in an SRDF/CE cluster. The only real difference as far as cluster applications are concerned is that the failover time between nodes will be delayed while the SRDF magic occurs under the covers.
Quorum Arbitration in a Geographically Dispersed Cluster
Thanks to Oliver Simpson for his comments and his question…how does quorum arbitration work in a geographically dispersed cluster and how is this different between SRDF/CE and MirrorView. This is a great question and I hope I don’t bore you with too many details here.
There are many storage vendors out there and I’m sure that they all have their own method of performing box to box replication and controlling access to these mirrors. With EMC SRDF, the source device is labeled an R1 device and the remote device is considered an R2 device. SRDF pairs can be manipulated by users with the Solutions Enabler software to make either (or both) mirrors read/write enabled or write disabled. With MirrorView, it labels one mirror the Primary mirror and the remote is considered the secondary mirror. MirrorView volumes can be manipulated using the Navicli software to promote or fracture mirrors controlling the read/write access to the volumes. So far, there we’re pretty much equal with their capabilities.
In a geographically dispersed cluster, we’re going to have one node of the cluster connecting to the R1 or primary mirror, while the other node attempts to access the R2 or secondary mirror. Typically, the R2/secondary mirror is either write disabled or not ready to the host while the cluster owns the resources on the R1/primary mirror. When the cluster needs to fail over to the remote site, you will need to issue commands to make the R2/secondary mirror read/write enabled before MSCS attempts to bring the disk resource online, and also make the R1 mirror write disabled/not ready.
For all other Physical Disk resources in the cluster (other than the quorum), we can use cluster resource dependencies so that the disks do not attempt to come online until the storage has been read/write enabled to the host. With SRDF/CE, we’ve created our own resource type to control these actions for all non-quorum resources and as I’ve described in a previous blog entry, you could create a generic application or script resource to control this behavior with MirrorView using Navicli commands.
The big problem comes in when we need to deal with a shared quorum disk in Microsoft cluster. The quorum disk CANNOT be made to depend on any other resources. Microsoft has done this so customers cannot shoot themselves in the foot by accidentally making their quorum disk depend on a faulty or mis-configured resource. When you attempt to make the quorum depend on a resource, you will receive an error message:
Additionally, if you do find a way to overcome this issue, you will next need to contend with the issue of the dreaded “split brain” syndrome. What happens if you have a network failure between sites and MSCS attempts to arbitrate for the quorum disk in a geographically dispersed cluster? You are accessing the quorum disk over two separate SCSI busses using two separate physical disk drives, so when MSCS issues a SCSI bus/target reset against the local quorum disk, this has no affect on the remote device so we’ve basically broken the quorum arbitration process. If some sort of mechanism is not in place, MSCS would be able to bring both copies online and this will likely cause a split brain to occur.
Because of these challenges, it is difficult to find a work around and have a shared quorum disk in a geographically dispersed cluster. Microsoft has given customers a way around this by introducing the MNS quorum, which is certainly a doable option...though it does have its limitations. See my previous entry for more information on this topic.
With SRDF/CE, we accomplish this by adding a secondary arbitration process whenever a standard Microsoft quorum arbitration event occurs. SRDF/CE adds a filter driver into the stack to detect quorum arbitration events. When an arbitration event occurs, the SRDF/CE filter driver halts the standard quorum arbitration process and completes its own arbitration before allowing the standard quorum arbitration event to continue.
The SRDF/CE arbitration process uses Solutions Enabler APIs to acquire Symmetrix locks to simulate a persistent reservation across the SRDF link. We release/acquire these locks during quorum arbitration in a fashion similar to Microsoft’s challenge/defense arbitration process. When a bus reset occurs from the challenger, we release locks and wait to see if the defender will re-acquire its locks and then fail if the node successfully defends. Once a node has successfully acquired these Symmetrix locks, we issue the appropriate Solutions Enabler commands to make the quorum disk resource read/write enabled to the host. Once the quorum is read/write enabled, the cluster quorum arbitration process continues and MSCS decides whether or not to allow the quorum to come online on the node.
In the Clariion environment, the CX arrays do not have these locking mechanisms and the API is not as robust so we do not have the same capabilities. Therefore, using a MNS quorum is the only available option in the MirrorView environment.
I hope this helps to give some insight about the way that SRDF/CE handles quorum arbitration. Feel free to leave a comment if anything is unclear.
SRDF/CE Training
In a past life, I used to travel the globe delivering training classes to EMC personnel and a few select customers. This training was not widely known about and was only given to customers that begged for more knowledge about the product.
I am happy to announce that EMC is now offering a training class for SRDF/CE that is open to customers. Originally, this training was only available to employees, but we have since seen the error in our ways so this is now open to the public. The class is named “SRDF/CE for MSCS Implementation and Management” and details are available on https://learning.emc.com/Saba/Web/Main/goto/64726739 (PowerLink account required for access). Classes are running about once or twice a quarter in the US and Europe. Here’s a basic overview of the objectives covered in this class:
- Describe the benefits of SRDF Cluster Enabler for MSCS
- Implement the SRDF/CE product into a typical geographical clustered environment
- Transform an existing Microsoft cluster into a SRDF/CE configuration
- Restore application availability after a disaster occurs within a SRDF/CE for MSCS environment
- Perform ongoing administrative and operational tasks for SRDF/CE clusters
- Perform SRDF/CE for MSCS troubleshooting
It’s a 5-day instructor led class with many hands-on labs. I helped to develop this class and I think we've created a pretty solid training experience. The class is based off my original training, though it has been updated for the more current releases of SRDF/CE. I would highly recommend attending this training if you’ve got SRDF/CE in your environment.
MirrorView/CE
Continuing my topic about geoclustering on a Clariion, we've recently discovered that my previous scenario that I tested can actually be FULLY SUPPORTED by Microsoft and EMC! Thanks to Edwin for pointing out the following entry in the Windows Catalog:
http://www.windowsservercatalog.com/item.aspx?idItem=4f28600a-1184-77a3-5768-7091f50227dd
You might notice that this entry lists a product called MirrorView/CE v1.0 with MNS (sounds familiar). I've found that this is not an actual EMC product, but instead is a set of scripts that control the MirrorView LUNs. EMC requires that customers submit an RPQ for support for this type of configuration, but they certainly will support it. The local TSG resources can be commissioned to write a custom script that performs the necessary activities to facilitate failover. This is currently only supported using MirrorView/S, though I'm sure that it could also be supported with MV/A.
This is pretty cool stuff that's been a long time coming.
Majority Node Set Clustering
When I was first introduced MNS, I hated this little feature of Windows 2003. In my opinion, MNS has created some confusion in the marketplace as it has been positioned (incorrectly) by some as “the” solution for geographically dispersed clustering. I’ve seen many posts over the years in the newsgroups from folks that have setup their MNS clusters and now want to know how to make their cluster work without shared storage. News flash: MNS does not mean that you not need shared storage for your cluster. MNS only means that the quorum disk no longer requires a physical drive resource for the quorum and the same rules still apply for the rest of your clustered resources. Want to have a print spool or DTC resource? Sorry, you still will need a physical disk resource in your cluster. If you’ve got an application that does not require shared data, then maybe MNS is the solution for you but most cluster applications will have a shared disk requirement.
It is my belief that no one in their right might would use MNS in a HA environment that is geographically dispersed…unless you plan to span 3 sites. Why would I say this? Well, if the goal of HA in your environment is to maintain uptime, why would you introduce a “feature” that will guarantee a total cluster outage if half of the cluster is suddenly unavailable? When you look at geographically dispersed clustering, you’re typically looking for a solution that can help you survive a total site outage…why else would you spend the time/money on a geo-cluster? With MNS, chances are high that your whole cluster is going down in all site disaster scenarios. Let’s take a look at some of these scenarios with MNS:
Scenario 1 - Primary site has 2 nodes and DR site has 2 nodes. If you lose either site, you will lose the entire cluster since no site can ever have a majority in this situation. This will only happen to you once and the lesson learned here is that we never want to have an even number of nodes with MNS.
Scenario 2 – Primary site has 3 nodes and DR site has 2 nodes. Your cluster can now survive the outage of the DR site, but cluster will not survive an outage of the primary site…which sort of defeats the whole purpose of having a DR site in the first place.
Scenario 3 - Primary site has 2 nodes and DR site has 3 nodes. Your cluster can now survive the outage of the Primary site, but now the cluster will not survive an outage of the DR site…which again seems to defeat the whole purpose of having a DR site when your DR site causes an outage of your production cluster applications.
Some will argue that in each of these scenarios, you can MANUALLY get your cluster up and running if you use the FORCEQUORUM procedure…which I do not deny. At least you do have some capability to get a somewhat working cluster up in these DR scenarios, even if it is a manual solution. There’s another HUGE gotcha here that is not often talked about or documented well. After you’ve started your cluster using /forcequorum, when the other nodes come back online, these nodes CANNOT join back into the cluster. In order to get your cluster back up and running again, you need to TAKE DOWN THE ENTIRE CLUSTER and then start all of the cluster nodes normally with no flags. Of course you can plan this downtime but no one ever wants to hear that their whole cluster has to go offline. This seems to defeat the entire purpose of HA when you are requiring a cluster shutdown to recover from your recovery procedure.
Based on the above scenarios, you might start to see why I would make the claim that no sane cluster admin would use MNS in their environment. If you substitute a shared disk quorum in any of these scenarios, the cluster would survive the outage as long as any one node survives. Also, a total cluster outage is not required to get the other nodes back into the cluster.
I think I’ve done enough bashing of MNS so let’s start to look at some of the good points. One of the major distance limitations of geographically dispersed clustering is Microsoft’s requirement that the quorum disk is only supported using synchronous replication for the quorum disk. From KB 280743, “The quorum disk must be replicated in real-time, synchronous mode across all sites.” This limits the possible distance for your geo-cluster solution based on the replication technology you are using…with EMC’s SRDF, this limit is approximately 200km for SRDF/S. Well if you use a MNS quorum instead of a disk quorum, you are no longer limited by the requirement of synchronous replication technology. With MNS, your only limit is the network latency requirement, and even this has some flexibility now with the introduction of hotfix 921181. So this is one key reason why you might consider using MNS over a shared disk quorum resource. If you are looking for an extended distance geo-cluster solution, MNS is the only way to go.
Another new feature introduced in hotfix 921181 is the ability to use a File Share Witness (FSW) node with 2-node MNS clusters. This allows you to have a file share anywhere in the network (doesn’t need to be on the same subnet!) and this node will be used as the decision maker when one of the nodes fails. You could even setup this FSW as a clustered file share resource in a separate cluster giving another level of protection to this decision maker. The downside to FSW is that it currently only works in 2-node clusters. In Longhorn, this will change but today the feature only works in 2-node clusters. If you add a third node to your cluster, the cluster will ignore the FSW settings. Another minor downside is that the FSW share does not contain a full copy of the CLUSDB so you could not restore a cluster registry hive using the data from the FSW. The FSW is only used to help make the decisions during quorum arbitration and does not contain the clusdb info.
Another place where MNS might be the better option would be the geographically dispersed cluster that spans three sites. A three site synchronous disk replication of the quorum disk will prove to be a challenge for any vendor. Based on the quorum disk’s synchronous requirement, you’re also going to need 3 sites that are all within 200km of each other which may also prove to be a challenge. With three sites, I typically would picture having two Primary sites replicating data to a single DR site. Each of the primary sites would have replication running between themselves and the DR site, but no replication between the primary sites. This sort of configuration would only be possible with a MNS quorum. In this three site scenario, you could survive the failure of any single site and keep the other two up and running.
So overall, I don’t hate MNS nearly as much as I used to. I can see that it has its place, and can see some benefits in specific scenarios.
Geographically Dispersed Clustering with Clariion
Thanks to Ira Pfeifer for jump starting this topic…I would’ve hit this subject eventually as this has been a rather frustrating experience that I’ve gone through over the years. One of the things that I’ve been pushing for many, many years at EMC is to develop a “GeoSpan” equivalent product for the Clariion environment. In my opinion, our main target audience for CX arrays is the Windows server environments so it seemed like a logical progression for the GeoSpan product to release a similar product for the Clariion line. Symmetrix has SRDF, Clariion has MirrorView. Seemed like a no-brainer to me.
The first sign that this was NEVER going to happen was the name change from GeoSpan to SRDF/CE…SRDF = SYMMETRIX Remote Data Facility so it would’ve been silly to name a product “SRDF anything” if we had plans to incorporate the CX arrays. Next sign was the level of resistance from product management. The PMs kept ignoring my pleas and finally, they asked that I provide a business case scenario showing the market for such a product…this was management’s way of politely telling me to go away. They were fully aware that I was a technical resource and it would be unlikely for someone like myself to provide the necessary logistical information to pursue this further.
Unfortunately for them, I actually do have the vision and desire to see such a thing through and could not be swatted down by some menial paperwork. I found a dozen different customers that have requested exactly this type of product and I submitted a pretty decent case to the different product groups. This got some attention and the developers from the MirrorView team were engaged by the SRDF/CE team to research the feasibility for such a software product. The outcome was that the MirrorView APIs were not capable of handling the types of commands necessary for a GeoSpan equivalent product and it would require a total rework of the MirrorView coding to make this functional. In the end, this idea was rejected even though this is really something we should be providing to our customers.
With that said, it is still possible to setup a geographically dispersed cluster using Clariion with MirrorView with Windows 2003. One of the main reasons why we were told that MV/CE (my name for the SRDF/CE equivelant for MirrorView) couldn't be done was that MV couldn't handle the quorum arbitration needed to support a geocluster. If you use a MNS quorum, there's no need for MV to get involved with the quorum arbitration. You could build a MNS cluster and then create a generic script or application resource to control the MirrorView failover. You would create your physical disk resources and make then depend upon these generic app/script resources and voila…you now have a geographically dispersed MirrorView cluster. If you’re not familiar with NaviCLI commands, I’ve been told that you could contact EMC professional services to help you develop these scripts...some customers are actually doing this today in their environment.
I spent a little bit of time trying this last week and I successfully setup this type of configuration in my lab environment so I know that this is possible. Now whether or not such a thing would be supported…from an EMC perspective, you would need to submit an RPQ for support from EMC. For MSFT support, this would need to pass HCT testing and submit it to MSFT for review. I do not think there are currently and MirrorView clusters in the Windows Catalog yet.
Another possibility for the Clariion environment would be to use NSI’s GeoCluster or DoubleTake to create host-based mirrors over IP. This is something I plan on testing myself in the next couple of weeks. Stay tuned for those updates.
EMC SRDF/CE for MSCS
Let me first discuss my pet project, SRDF/CE for MSCS. What is SRDF/CE you ask? Well it is EMC’s geographically dispersed clustering software. SRDF/CE is an EMC proprietary product that is used to integrate EMC’s SRDF technology with MSCS. Basically, it is the glue that combines HA from MSCS with the DR capabilities of SRDF remote mirroring to form a true geographic clustering solution. Upon failover, SRDF/CE issues the necessary SYMCLI commands to automatically RW enable the devices at the DR site then allowing the disks to come online at the DR location. This provides a fully automated DR solution with high availability.
It’s entirely possible that you work for a company that owns EMC hardware and you’re not aware of this little product. I’d love to hear from you if this is true. For some unknown reason, the marketing folks at EMC have never given this product any respect so it wouldn’t surprise me at all that you haven’t heard of this. One clear example would be to take a look at the EMC web site and attempt to find some information about this product. There used to be a good amount of info available for GeoSpan, but since the product “evolved” into SRDF/CE, the marketing has shifted away from the product. It really is upsetting to have spent so much time working on a project and then see that your company does not support the work that you’ve done. Am I bitter? I suppose so. I honestly don’t know how this product has survived and grown in popularity over the years.
SRDF/CE was originally called GeoSpan for MSCS. Many moons ago, the company decided to change the name to align with other product naming conventions and this became SRDF/Cluster Enabler…rolls off the tongue, doesn’t it? Anyway, some good and some bad came with this name change. The good was that we added some wizards to help with the install. The GeoSpan install was a nightmare and practically could never be installed properly without the assistance of someone that had previously installed the product. Also, we got rid of some of the ugliness of GeoSpan which really wasn’t a GUI. We made the SRDF/CE GUI a MMC snap-in so it was now certainly a GUI product. The bad…well, we added some additional checking which caused big performance issues on the RDF links. This is one issue that still plagues us today and has left a bad impression at a few customer sites. Other bad news, we added some instability to a product that is supposed to be used in a HA environment. Early versions of SRDF/CE were infamous for causing database corruption which triggered unnecessary failures in a cluster environment.
Over the past couple of years, SRDF/CE has significantly improved its stability and performance. We still need to tweak the registry and some other software settings to get the RDF link performance just right, but we’re definitely releasing a better product these days.
A few years ago, someone (I think it was a customer) asked me if I would use SRDF/CE in my environment if I were an EMC customer. I had to think really hard about this one since I work in a support center and really only see the bad side of our products…believe it or not, customers never call in to tell us how well things are running. J My answer at that time was something like since I knew the ins and outs of the product, I would’ve been comfortable implementing this in my own environment. I don’t know if that’s very comforting endorsement, but it was the truth.
Today, I really do enjoy the SRDF/CE product and would easily recommend implementing this into any customer’s site…including my own were I an EMC customer. The newer versions (versions 2.1 or higher) have really made significant improvements over the first couple of releases and continue to evolve for the better.
Welcome to my blog!
I’ve finally decided that it’s time to start a blog. Why? Well, it wasn’t only a bandwagon thing…I thought that this might help to keep me motivated to do some of the things that I’ve been intending on doing for many, many moons. I hope that if I put these plans in print, it might actually help force me to complete some if not all of these goals…and also I want to be cool like Rod and Russ <g>
About me
The main focus in my work life for the past seven years has been geographically dispersed clustering at EMC. So far, I have focused on the GeoSpan and SRDF/CE for MSCS products. I’m hoping to expand my horizons a bit and try some others products and compare and contrast the technologies.
I’ve worked at EMC Corporation since 1999. My main focus at EMC has revolved around their software products for Microsoft operating systems. I started as a “New Product Analyst” where my job was basically to learn the new products and then police the release of the product and fill in the gaps as necessary. One of those new products was a little gem called GeoSpan for MSCS. Little did I know that this would be a life altering project that I would never truly escape. It was the only project I was assigned to at the time so I spent ALL of my time testing this product.
I had never worked with Microsoft clusters before so this was totally new ground for me. I worked with our SVT guy, George, for months installing and testing the code. We tried every possible scenario with the available hardware we had and found many different issues with the code. We delayed the release of the product for over six months because of the issues that George and I uncovered. By the time the product was released, I was truly an expert with this software and knew every little detail (minus the coding).
Since I was now one of two non-engineering GeoSpan experts in the company, I was called on frequently for assistance with implementations at customer sites across the globe. It was because of this field experience that I really became known as a God of GeoSpan. As popularity increased, I was also called upon to deliver training sessions around the world to internals as well as customers.
As new versions were released, I would again test, implement and teach the new versions. Today, I am still constantly asked to help teach and implement, though my job officially is to support the product. I suffered some burnout from all the traveling and I switched into a tech support role so I wouldn’t have to travel as much. The travel has slowed, though I am still occasionally teaching.
During my downtime, I spend a lot of time in the Usenet newsgroups related to clustering. I read the questions posted on the newsgroups and would do a little research to come up with answers. I always found it amazing that people would be wise enough to use newsgroups to seek out assistance, but didn’t have enough common sense to search Microsoft’s knowledgebase for the answers. I always made a point to include a link to the KB in my posts so they might take the hint. Anyway, I found it fun to help out people and I admit that I would also do some self-promotion of “my” products. I try to keep an open mind and not be too biased, but I can’t help myself at times.
Anyway, I spent a few years helping out in the newsgroups and one day I got an email announcing that I had been nominated as a Microsoft MVP. I was awarded the MVP for Windows Server – Clustering on Jan 1st, 2004. I’ve been re-awarded each year since because of my work in the newsgroups. I truly do not feel worthy of this award, but who am I to turn down a free gift from Bill Gates J
To this day, I don’t know how I was nominated or who nominated me, but I do really appreciate it. Seeing the work done by my fellow MVPs is a great way to keep motivated and makes me want to do more for the community.
Goals
Some of the things I’d like to begin working on are the following:
- I’d like to start my own website dedicated to geographically dispersed clusters. I’m not sure how much different this will be compared to this blog so for starters, I’ll just blog about it and see where this leads
- I’d like to test and become more versed in other geo-cluster products. I’d eventually like to compare and contrast the strengths and weaknesses of each solution.
- I’m very interested in testing the new geographic capabilities of SQL and Exchange.
- I have every intention on testing Longhorn beta 3 (when released). I’m very excited about the new features coming for LH and of course I haven’t had the time to even consider looking at it yet.
- I’ve also been considering writing a book about geographic clusters, but figured I needed to broaden my horizons first. Also, after reading about the horrors that Russ has gone through, maybe I'll just stick to blogging