Field Notes: The case of the failed SQL Server Failover Cluster Instance – Binaries Disks Added to Cluster

I paid a customer a visit a while ago and was requested to assist with a SQL Server Failover Cluster issue they were experiencing.  They had internally transferred the case from the SQL team to folks who look after the Windows Server platform as they could not pick up anything relating to SQL during initial troubleshooting efforts.

My aim in this post is to:

  • explain what the issue was (adding disks meant to be local storage to the cluster)
  • provide a little bit of context on cluster disks and asymmetric storage configuration
  • discuss how the issue was resolved by removing the disks from cluster

Issue definition and scope

An attempt to move the SQL Server role/group from one node to another in a 2-node Failover Cluster failed.  This is what they observed:

Failed SQL Server Group

From the image above, it can be seen that all disk resources are online.  Would you suspect that storage is involved at this stage?  In cluster events, there was the standard Event ID 1069 confirming that the cluster resource ‘SQL Server’ of type ‘SQL Server’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed.  Additionally, this is what was in the cluster log – “failed to start service with error 2”:

Cluster Log

Error code 2 means that the system cannot find the file specified:

Net HelpMsg

A little bit of digging around reveals that this is the image path we are failing to get to:

Registry value

Now that we have all this information, let’s look at how you would resolve this specific issue we were facing.  Before that however, I would like to provide a bit of context relating to cluster disks, especially on Asymmetric Storage Configuration.

Context

Consider a 2 node SQL Server Failover Cluster Instance running on a Windows Server 2012 R2 Failover Cluster with the following disk configuration:

  • C drive for the Operating System – each of the nodes has a direct attached disk
  • D drive for SQL binaries – each of the nodes has a dedicated “local” drive, presented from a Storage Area Network (SAN)
  • All the other drives required for SQL are shared drives presented from the SAN

Disks in Server Manager

Note: The 20 GB drive is presented from the SAN and is not added to the cluster at this stage.

I used Hyper-V Virtual Machines to reproduce this issue in a lab environment.  For the SAN part, I used the iSCSI target that is built-in to Windows Server.

 

Asymmetric Storage Configuration

A feature enhancement in Failover Clustering for Windows Server 2012 and Windows Server 2012 R2 is that it supports an Asymmetric Storage Configuration.  In Windows Server 2012 a disk is considered clusterable if it is presented to one or more nodes, and is not the boot / system disk, or contain a page file.  https://support.microsoft.com/en-us/help/2813005/local-sas-disks-getting-added-in-windows-server-2012-failover-cluster

 

What happens when you Add Disks to Cluster?

Let us first take a look at the disks node in Failover Cluster Manager (FCM) before adding the disks.

Disks in Failover Cluster Manager

Here’s what we have (ordered by the disk number column):

  • The Failover Cluster Witness disk (1 GB)
  • SQL Data (50 GB)
  • SQL Logs (10 GB)
  • Other Stuff (5 GB)

The following window is presented when an attempt to add disks to a cluster operation is performed in FCM:

Add Disks to a Cluster

Both disks are added as cluster disks when one clicks OK at this stage.  After adding the disks (which are not presented to both nodes), we see the following:

Disks in Failover Cluster Manager

Nothing changed regarding the 4 disks we have already seen in FCM, and the two “local” disks are now included:

  • Cluster Disk 1 is online on node PTA-SQL11
  • Cluster Disk 2 is offline on node PTA-SQL11 as it is not physically connected to the node

At this stage, everything still works fine as the SQL binaries volume is still available on this node.  Note that the “Available Storage” group is running on PTA-SQL11.

 

What happens when you move the Available Storage group?

Move Available Storage

Let’s take a look at FCM again:

Disks in Failover Cluster Manager

Now we see that:

  • Cluster Disk 1 is now offline
  • Cluster Disk 2 is now online
  • The owner of the “Available Storage” group is now PTA-SQL12

This means that PTA-SQL12 can see the SQL binaries volume and PTA-SQL11 cannot, which causes downtime.  Moving the SQL group to PTA-SQL12 works just fine as the SQL binaries drive is online on that node.  You may also want to ensure that the resources are configured to automatically recover from failures.  Below is an example of default configuration on a resource:

Resource Properties

 

Process People and Technology

It may appear that the technology is at fault here, but the Failover Cluster service does its bit to protect us from shooting ourselves in the foot, and here are some examples:

Validation

The Failover Cluster validation report does a good job in letting you know that disks are only visible from one node.  By the way, there’s also good information here on what’s considered for a disk to be clustered.

Validation Report

A warning is more like a “proceed with caution” when looking at a validation report.  Failures/errors mean that the solution does not meet requirements for Microsoft support.  Also be careful when validating storage as services may be taken offline.

 

Logic

In the following snippet from the cluster log, we see an example of the Failover Cluster Resource Control Manger (RCM) prevent the move of the “Available Storage” group to prevent downtime.

Cluster Log

Back online and way forward

To get the service up and running again, we had to remove both Disk 1 and Disk 2 as cluster disks and make them “local” drives again.  The cause was that an administrator had added disks that were not meant to be part of the cluster as clustered disks.

Disks need to be made online from a tool such as the Disk Management console as they are automatically placed in an offline state to avoid possible issues that may be caused by having a non-clustered disk online on two or more nodes in a shared disk scenario.

I got curious after this and reached out to folks who specialize in SQL server to get their views on whether the SQL binaries drive should or should not be shared.  One of the strong views is to keep them as a non-shared (non-clustered) drives, especially for cases on SQL patching.  What happens if SQL patching fails in a shared drive scenario for example?

Anyway, it would be great to hear from you through comments.

Till next time…