I’d like to share some of the things I look at while do a health check on a server. Its funny how few resources there are out there on the Internet. I believe people keep this kind of stuff to them self because they are scared they are going to miss something and they will never live it down. My response to that is, So What! Heck, I don’t claim to know it all but why not share what I do know and maybe others can share via the Comments!!!
When I’m troubleshooting I like to compartmentalize what I'm looking for. With that my health checks are set up the same way. I also believe health checks are quick snapshots of the health of a server. Sure there are tools that you can use to analyze systems further but in this case we are doing a quick health check. Not all of these need to be done but some should, you get to decide.
Occasional high CPU spikes are ok as long as you are aware of the process causing this. A server should maintain 80% CPU utilization for an extended period of time. If it does it may be time to upgrade. Its a good idea to keep Task Manager open during the duration of your troubleshooting to see trends.
Check CPU Usage
Open Task Manager
Check the Processes tab, ensure there are no processes consuming excessive CPU
Check the Performance tab, ensure there are no single CPU’s that have excessive CPU usage
Check CPU HW
Open Device Manager (right click computer –> Manage)
Ensure that no CPU’s have red X or yellow ! underneath the Processors
This is one area that you may not want to do for quick health checks but is something you should be familiar with. Task Manager only gives you basic info on processes and you will find that you may need to dig a bit deeper. For that I recommend Process Monitor from the great SysInternal tools. Process Explorer can also be used. In fact download and play with all these tools…they will save your bacon, I guarantee it.
Copy Process Monitor locally, then launch it.
- Analyze each process and watch what operations open the reg keys, file etc.
Copy Process Explorer locally, then launch it.
- Analyze each process based upon the number of threads, handles, loaded DLL’s, etc.
Two great webcasts can be viewed here to see these types of tools in action.
General rule of thumb is to make sure the general memory utilization does not exceed 80%within a given period of time.
Check Memory Availability
- Open Task Manager
Select the Performance tab
Look at the Physical memory box, and multiply the total memory by .2
If the total available memory is less than this number then the box is currently utilizing more than 80 percent of the memory.
Current utilization by process
Select the Process tab
Check the ‘show processes from all users’ box in the bottom left corner
Click the column header ‘Mem Usage’ to sort the processes by memory utilization, highest to lowest. This will help you determine what processes are currently utilizing the memory on the box and can help you narrow your search for memory intensive processes.
Check NIC HW
Verify both ends of the network cable are securely seated in the port
On the back of the server verify you have a green blinking link light on the NIC port
Verify NIC HW is working properly by using Device Manager and ensure the active NICs are showing green
Verify gateway, IP, subnet mask, DNS, DNS suffixes, etc. are properly configured.
If everything is properly configured and HW is working, you should be able to get a ping response from the gateway.
Check Network Connections
Here are some other checks you should perform to ensure proper network connectivity:
ipconfig /all will display all you TCP/IP settings including you MAC address
ipconfig /flushdns will flush your dns resolver cache
ipconfig/displaydns will display what is in your dns name cache
Netstat -an command will show all the connections & ports from a machine
Nbtstat command will show net bios tcp/ip connection stats
Tracert <IP or DNS Name> command will show you the path the packet takes, the routers, and the response time for each hop.
pathping <IP or DNS Name> command combines ping and tracert to the 100th degree. It pings each hop 100 times and is great for testing wan connectivity
All kinds of bad stuff can happen when your disk space is filling up. The best way to alleviate this is to write a script to notify you when you reach a certain threshold. In a future post I'll share a method for you to do just that…however if there is a problem and you need to perform a health check then here is how you check the space the old fashion way.
To check disk space manually:
Right Click on My Computer
Select Disk Management
Validate each disk more than 10 percent free space
Event logs can reveal a more historical perspective on what is going on with the system and applications. Things to look for when troubleshooting event logs is to query either the system or the application logs and look for the presence of events that have a timestamp near the time of the issue you are troubleshooting.
Events have 3 categories in the event viewer:
Warning: Noted with a yellow icon and exclamation point. These usually are looked up as they serve as predictive future failure indicators, such as disk space running low, dhcp ip address lease renewal failures, etc.
Error: Noted with a red circle icon and ‘x’. These are indications that something has failed outright and are a good starting point for troubleshooting.
When looking at event logs, use the information to determine the following:
Also make sure you take a look at eventcombmt from Microsoft. This tool allows you to search the logs of multiple machines. The benefit to this is to see if a specific error or warning message is also occurring on other systems. This can help rule out issues.
Troubleshooting services should be limited to the specific that is affected by the problem being troubleshot. Each server will have specific services varying upon the types of applications running. You should document how your servers services are configured to and compare that to the server in question to see if anything is not configured correctly.
Servers that host applications and services that require high availability should be clustered so that if one node fails the other can pick up the workload. Clustered servers need the same type of health checks as stand-alone systems except you will want to check on the health of the cluster.
Check Cluster Resource Status
Open Cluster Administrator: Log onto server, select Start –> Run –> cluadmin
Check the Resources and ensure all are Online
If Cluster Administrator does not open, ensure that the Cluster Service is running on the node.
Cluster resource status can also be checked from a remote server. From a command prompt, just type - cluster res <cluster name>
Client Side Health
Right click on My Computer, select Manage
Open Device Manage
Drill down to SCSI and RAID Controllers, verify that the HBA HW is visible and does not show any errors
If it does not show up in Device Manager, you may need to re-scan for the HW, re-seat the fiber card, or re-install the driver.
If the HBA is showing healthy in Device Manager, open the tool that you use to view configuration and settings for the fiber card and verify there aren’t any transmit/receive errors on link statistics or counters
Make sure fiber is properly connected to each switch
Make sure switch has no errors
If you’re using zoning verify it is properly configured
Check Fiber and SAN Connectivity
Log onto san appliance and verify that the SAN is in general good health and no major errors are present for the controllers, loops, switches, or ports.
Ensure that the LUNs are presented to the servers in the cluster
Some applications will require you to spread the load across multiple servers. Web servers are a very popular choice to network load balance. As with clusters we will need to check the status of the load balancing.
Check NLBS Status CMD Line
From a command prompt on the local system, run ‘wlbs query’. This will give you the convergence status of the local node with the nlbs cluster.
Other useful NLBS commands: wlbs stop (stops nlbs), wlbs start (starts nlbs), wlbs drainstop (drains node)
Check NLBS Configurations
Open up the network properties –> Network Load Balancing, right click & select Properties
On the Cluster Parameters tab, verify that the IP address is configured for the shared NLBS IP and that the subnet mask, domain, and operation mode are configured correct1y.
On the Host Paramters tab, make sure each node of the cluster has a unique host identifier. Also verify the IP and subnet mask are configured for the local values.
Also make sure that your switch has a static ARP entry if using multi-cast NLBS. The entry should be that of the virtual MAC of the cluster. To get the virtual MAC of the cluster, you can run the following command: WLBS IP2MAC <virtual IP address>
To healthcheck name resolution, open a command prompt and enter the following
Verify that the servername is correctly entered in DNS
If a record does not show up in the DNS query, or maps to a different name, perform a reverse lookup by IP address to see what name is associated with the IP address * nslookup <IP address>
If no name shows up associated with the IP address, log into the domain controller and check the DNS records for this particular name/ip address
From a Domain Controller go to start–>run–>dnsmgmt.msc
Expand the Forward Lookup Zones
Expand the zone for you primary zone that holds the records for the system/s you are troubleshooting
Validate that the record exists. If it does not exist manually enter the record name and IP address by right clicking on this same zone,
Select new host (a)
Enter the name and IP address
Check the box next to Create associated pointer (PTR) record
Click add Host
Additionally log back into the node that you manually entered the record for and ensure that DNS is registering in DNS
Right click on the My Network Places icon on the desktop and select Properties
Double click on the primary adapter
Highlight internet protocol (TCP/IP) and select properties
Validate the IP addresses of the DNS servers are correct
Select DNS tab
Make sure the box is checked next to Register this connection’s address in DNS
As I wrap this up I realize there is so much more that can be done. Each application type of server needs its own set off health checks. For example web servers, terminal servers and database servers. Remember this is just the baseline for each server and that other components can and should be layered on top of it. Again I would love to hear from others so please feel free to add you comments below.
If you have been playing with the the AD PowerShell cmdlets you know that it requires a few things to run, first Windows Server 2008 R2 or Windows 7, the .NET Framework 3.5.1 and of course if you want to manage an AD domain you need Active Directory Web Services (ADWS) installed on at least one domain controller.
By the way ADWS requires TCP port 9389
So how in the world does a Windows 7 system know how to find a DC running ADWS? Well your client running PowerShell will use the normal DC locator process. First the client will determine which site it is in nltest /dsgetsite and then it will determine the closest DC nltest /dsgetdc:<FQDN Domain>. It is looking at the DC for the following flag:
More info on that flag can be found here.
Now what if you don’t have Server 2008 R2 DCs? With Server 2003 and Server 2008 a problem occurs because the Net Logon service of those domain controllers does not recognize the DS_WEB_SERVICE_REQUIRED flag. There are two hotfixes (one for what ever version of AD you are running) available to fix that in those environments. Server 2003 and Server 2008
After you install this hotfix the AD PowerShell module and Active Directory Administrative Center will be able to locate DCs that have Active Directory Management Gateway Service installed, similar to Active Directory Web Services (ADWS) on a Windows Server 2008 R2-based computer.
UPDATE - Microsoft appears to have taken this download down. No word why or when it will be back up.
Looks like Microsoft just make the Windows 7 LDS (Lightweight Directory Services) client available. You can find both 32 and 64 bit clients here.
For those that aren't familiar with LDS, it is the Server 2008 replacement for ADAM, otherwise known as Active Directory Application Mode. While i'm no developer LDS is a good platform that applications that require directory storage and access. Have most of the components of Active Directory without the complete infrastructure needed for Active Directory.
For the last several years I've worked in a team that is spread all across the world. The following ramblings are the items I've picked up from working in a virtual team as well as from books that I've read on the subject. One thing is key, leadership is leadership. It doesn’t matter if you are there local or remote. Enjoy.
Trust is an important aspect in all levels of leadership. The degree in which trust is used across virtual teams is usually much deeper than what level is used with a local team. Trust is the key to getting performance from a team that is distributed geographically. Trust must be gained:
- In you as a virtual leader
- In the virtual project or virtual organization
- In all virtual team members across distance
Building Relationships and Trust
Since virtual teams have limited interaction and limited knowledge of each other in their isolation, the virtual team must establish ways to help team members learn about each other quickly and frequently.
- Establish ways for the team to learn more about each other professionally and personally so they will collaborate even when distant
- Establish a short meeting for the team to talk with one another to troubleshoot and discuss current issues
- Pair off people to work together on parts of the project
- Acknowledge all types of recognitions including, birthdays, academic success, and other personal achievements
Virtual Team Alignment
People who work across distance tend to lose focus after any single meeting. Therefore, it is critical that the virtual team create:
- A clear vision so every team member knows exactly where the team is headed
- A clear emotional link so each remote team member stays motivated when distant
- A published roadmap that is used as each person does work remotely to align work and efforts
Virtual Team Equality
Be extremely fair in treating all team members, near and far, equally. Even appearances or suggestions of favoritism break trust.
- Avoid the temptation to rely more on those on-site with you than those at a distance
- Take culture differences into consideration
- Give every team member an equal opportunity to excel and contribute to the result
- Confront nonperformance in a constructive manner
- Be consistent and fair in holding everyone accountable for every factor needed to insure team success
Miscommunication and unequal access to information are trust-breakers.
Keep communications flowing to counteract the out of sight out of mind phenomenon on distributed teams.
- Be extremely clear when making decisions
- Frequency of communication should be increased compared to a team that is only local
- Understand that members will have different communication preferences
- E-mail, forum, phone, face-to-face, instant messaging, etc
- What isn’t said matters too
- Check for understanding or ask for clarification
Again these are items I've picked up over the years and through books. Please feel to share you thoughts if you have anything good to add to the conversation.