if you gaze long into the IT abyss: Case of ClusterStorage.000

Recently I worked on an issue: after a reboot of one of the cluster nodes, virtual machines couldn’t migrate back on this node anymore. Cluster events log contained some errors like these ones:

Cluster resource 'SCVMM pxe Configuration' of type 'Virtual Machine Configuration' in clustered role 'pxe' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

The Cluster service failed to bring clustered service or application 'pxe' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

I checked ClusterStorage folder and it turned out that there were three ClusterStorage folders with suffixes 000 and 001.

It all looked like a good reason to dug in to a cluster log

00000cfc.000017c8::2013/11/01-20:00:36.081 INFO [DCM] Cluster Shared Volume Root is C:\ClusterStorage

00000cfc.000017c8::2013/11/01-20:00:36.081 INFO [DCM] UpdateClusDiskMembership(enter): nodeSet (1 2 3)

00000cfc.000017c8::2013/11/01-20:00:36.081 INFO [DCM] CsvFs Listener already started...

00000cfc.000017c8::2013/11/01-20:00:36.081 INFO [DCM] CsvFlt Listener already started...

00000cfc.000017c8::2013/11/01-20:00:36.081 INFO [DCM] NFlt Listener already started...

00000cfc.000017c8::2013/11/01-20:00:36.081 INFO [DCM] DeleteCsvShare: remove csv blockstream C:\ClusterStorage:{db19d832-b034-46ed-a6c5-61e0ebe370d1}

00000cfc.000017c8::2013/11/01-20:00:36.081 WARN [DCM] Failed to delete csv share CSV$ status 2310

00000cfc.000017c8::2013/11/01-20:00:36.097 WARN [DCM] rename attempt C:\ClusterStorage => C:\ClusterStorage.000, status 183

00000cfc.000017c8::2013/11/01-20:00:36.113 WARN [DCM] Renamed existing C:\ClusterStorage to C:\ClusterStorage.001

00000cfc.000017c8::2013/11/01-20:00:36.128 INFO [DCM] CreateRootDirectory: keeping open handle HDL( bb4 ) to CSV root

00000cfc.000017c8::2013/11/01-20:00:36.128 INFO [DCM] create CSV stream file C:\ClusterStorage:{db19d832-b034-46ed-a6c5-61e0ebe370d1}

Then I checked EMC PowerPath – and it contained some dead path to our old SAN array. I deleted them, stopped cluster service on the node, and deleted ClusterStorage.000 and .001 folders. Then I started cluster service again. Issue resolved!

Another quite similar issue once happened with our file cluster - again, the culprit was an old csv record that was not deleted correctly.

So, if you'll face similar issues, all you need to do is to delete unnecessary clusterstorage folders when cluster service is stopped and delete obsolete links to old array in your multipath software so that it won't be accidentally recreated.

Hope that this will be helpful for you.

4 comments:

thomas t.September 9, 2015 at 7:38 PM
nice post. thank you.
Ryan ShepherdOctober 7, 2015 at 5:03 PM
Thank you thank you thank you.
ChrisDecember 14, 2015 at 12:24 PM
Thank you for your post really helpful
UnknownSeptember 3, 2017 at 4:39 AM
This process worked for me.

In summary:
(1) Migrate Roles and Storage off the Node with errors.
(2) Stop the Cluster Service
(3) Delete or rename the C:\ClusterStorage folder on the Node with errors.
(4) Restart the Cluster Service
(5) Migrate Roles and Storage back to the Node with errors.

Tuesday, November 5, 2013

Case of ClusterStorage.000

4 comments: