Exchange Server 2007 SCC/CCR lessons learned
This past weekend I ran into a few issues with Exchange Server 2007 and wanted to share, so anyone with them won’t have to call Microsoft PSS and go through the fun (ok, not really fun…) that I went through.
Partition in Time with CCRYou have a partition in time, but what does that mean. You lost a node or the witness, and while that
was happening the remaining node/witness thought a change was made. When the
down node/witness came back it detected that a change has occurred and
killed the entire cluster. This is by design. Now, how do you fix it?
http://support.microsoft.com/kb/258078 ForceQuorum section:
Function: When you use a Majority Node Set (MNS) quorum model on a Windows
Server 2003 cluster, in some cases a cluster must be allowed to continue to
run even if it does not have "quorum" (majority). Consider the case of a
geographically dispersed cluster with four nodes at the "primary" site and
three nodes at the "secondary" site. While there are no failures, the
cluster is a seven-node cluster where resources can be hosted on any node,
on any site. If there is a communications failure between the sites or if
the secondary site is taken offline (or fails), the primary site can
continue because it will still have quorum. All resources will be re-hosted
and brought online at the primary site.
In the event of a catastrophic failure of the primary site, however, the
secondary site will lose quorum, and, therefore, all resources will be
terminated at that site. One of the primary purposes for having a multi-site
cluster is to survive a disaster at the primary site; however, the cluster
software itself cannot make a determination about the state of the primary
site. The cluster software cannot differentiate between a communications
failure between the sites and a disaster at the primary site. That must be
done by manual intervention. In other words, the secondary site can be
forced to continue even though the Cluster service believes it does not have
quorum. This is known as forcing quorum.
Because this mechanism is effectively breaking the semantics associated with
the quorum replica set, it must only be done under controlled conditions. In
the example above, if the secondary site and primary site lose communication
and an administrator forces quorum at the secondary site, resources will be
brought online at BOTH sites, thus allowing the potential for inconsistent
data or data corruption in the cluster.
Requirements: Forcing quorum is a manual process that requires that you stop
the Cluster service on ALL the remaining nodes. The Cluster service must be
told which nodes should be considered as having quorum.
Usage scenarios: Special care must be taken if and when the primary site
comes back because the nodes are configured as part of the cluster. While a
cluster is running in the force quorum state, it is fully functional. For
example, nodes can be added or removed from the cluster; new resources,
groups, and so forth can be defined.
Note The Cluster service on all nodes NOT in the force quorum node list must
remain stopped until the force quorum information is removed. Failure to do
so can lead to data inconsistencies OR data corruption.
Operation: Set up the Cluster service startup parameters on ALL remaining
nodes in the cluster. This is done by starting up the Services control
panel, selecting the Cluster service, and then entering the following in the
Start parameters option:
net start clussvc /forcequorum node_list
For example, if the secondary site contains Node5, Node6, and Node7, and you
wanted to start the Cluster service and have those be the only nodes in the
cluster, use the following command:
net start clussvc /forcequorum /forcequorum node5,node6,node7
Note There should be no spaces in the key (except where there are spaces in
the node names themselves).
The only problem I could not get the above commands to work on a 64-bit Windows Server 2003 R2, Enterprise Edition SP2 machine. I most got invalid syntax. Here is what PSS told me to do:
1. We shutdown one of the nodes, a true power off. We will call this the passive node.
2. We added the following value to this registry key on the surviving node (active node):
HKLM/System/CurrentControlSet/Services/Clussvc/Parameters3. Replace nomenamea with the machines name, such as exch2007nodea - where this is the node that is currently running.
4. We attempted to start the cluster service on the active- surviving node and it started.
5. We then stopped the cluster service on the active - surviving node and added nodenameb to the ForceQuorum data value on the surviving node.
6. We restarted the powered off (passive) machine.
7. We then started the cluster service on the active node and it started. The registry with the ForceQuorum containing both node names.
8. We attempted to start the cluster service on passive (with no parameters or registry changes) and it started.
9. We verified that the Cluster group resources were online.
10. Undo the registry changes by deleting the ForceQuorum key from the Active node.
Exchange Server 2007 System Attendant fails to come online within a CCR/SCC clusterAfter the cluster was up and running, the Exchange SA was not. Looking in the Application event log and we were getting the following errors with regards to the Exchange SA failing to start:
Event ID 1011, 1030, 1003, and 1019 errors.
We found that a bug exists where the Exchange SA times out after 40 seconds when the default of 180 seconds is used for the resource.
We changed the value to 179 and the Exchange SA resource came online. This is scheduled to be fixed in SP1. This bug was confirmed for SCC & CCR Exchange Server 2007 Clusters.
Update from PSS - find a link to the first issue here http://technet2.microsoft.com/WindowsServer/f/?en/library/e70333db-5048-4a56-b5a9-8353756de10b1033.mspx, we are still waiting on the KB to be updated though.