vSAN Availability during Maintenances

When performing vSphere maintenances while using vSAN, VMware recommends to almost always use the 'Ensure Accessibility' option. This option ensures that all vSAN objects are consistent, minimizes the migrated data, and the time it takes the host to enter maintenance mode. The drawback is that vSAN objects with FTT=1 stops being highly available during the maintenance.

That VM's storage availability is affected by a vSphere maintenance is different from what VMware admins are used to when using standard SAN storage. Many admins I've spoken to have expressed concerns regarding the reduced availability and proposed that either 'Full Data Migration' is used for all maintenances or to configure VMs with a storage policy of FTT=2. Both of these options are highly impactful. 'Full Data Migration' migrates all vSAN data from the host, which not only takes a long time but could also be disruptive as the migration will use IOPS on the capacity disks (although vSAN 6.6 reduces this impact by throttling the migration traffic). Using FTT=2 means that a lot more storage capacity (and write IOPS used) will be utilized which increases the price of the vSAN cluster.

This post looks at the real impact of 'Ensure Accessibility', how this compares to standard SAN storage and how it affects the overall availability of the system.

Disk Failure Impact

When considering how 'Ensure Accessibility' impacts the availability of a VM, the first thing to take into account is the effect of a physical disk failure during a maintenance. The vdisk of the VM will become inaccessible in this scenario causing the VM to go down. But remember that remaining vSAN disk component still exists on the host in maintenance mode. I.e., when the host exists maintenance mode, the vSAN object will become available again. Any changes performed while the storage object had reduced availability, would be lost, but the VM itself wouldn't be lost, only unavailable for a period of time.

Availability SLA Definition

If it is not acceptable for VMs to have reduced availability during a vSphere maintenance, a way around it is to define the availability SLA to not apply during planned maintenance windows and ensure hosts only enter maintenance mode during scheduled maintenance. For different reasons, this might not be possible though.

SAN Comparison

Using standard SAN storage, there would be no impact on the storage availability during a vSphere maintenance, but there are other events that do. The most common RAID group configuration in a SAN storage is to use RAID5 or RAID10. When a disk failure occurs, the whole RAID group has reduced availability until the RAID is rebuilt using a hot spare disk. Any additional disk failure occurring in the same RAID group during the rebuild would bring down the whole RAID group and any VMs depending on it. Most vSphere admins consider this reduced availability to be acceptable and don’t require configuring all RAID groups as RAID6. Besides, vSAN is an object storage with components distributed across the disks in a vSAN cluster. The rebuild operation is therefore spread across disks and arguably faster in vSAN compared to a standard SAN storage where all data must be written to an individual disk.

Availability Calculation Example

How much is the availability reduced by using 'Ensure Accessibility' and how does this impact the total availability SLA of a solution? The availability can quite easily be calculated, and as I show below the impact on the total availability of the system is minimal. To prove the point, I'm using excessive numbers, and in reality, the impact is most likely even lower.

The total availability of a system is calculated by multiplying the availability of all the components (where A is the total availability and Ax, Ay and Az are the availability of the individual components):

A = Ax * Ay * Az

A redundant component would increase its availability according to the following formula (where A is the combined availability of two redundant components and Ax the availability of an individual component):

A = 1 – (1 – Ax)2

Most enterprise disks have a stated Mean Time Between Failures (MBTF) of 2 million hours. As there are 8760 hours in a year, this corresponds to more than 200 years, i.e., the likelihood of a specific disk failing in a year is less than 0.5%. For the sake of this argument, I will be much more conservative and assume that the likelihood is 5%. The availability of an individual disk over a year is therefore 95%.

Let's also assume that a VM is striped over ten disks, i.e., if any of these fails during the period where the vSAN object has reduced availability the VM would fail. The total yearly availability of the system would, therefore, be 97.5% if all vSAN objects were highly available and 60% if the objects were not highly available.

Total storage availability (highly available) = (1 – (1 – Ax)2)10 = 97.5%

Total storage availability (not highly available) = Ax10 = 60%

The difference in total availability of the vSAN object is large but assumes the reduced availability applies to the entire year. In fact, the reduced availability only refers to a tiny fraction of the year, and this significantly minimizes the impact.

The time that objects have reduced availability would be the vSAN wait time, which is by default one hour plus the time it takes to rebuild storage objects if the wait time expires. To be conservative and simplify things let's assume that a maintenance takes 5 hours to complete and that the wait time, therefore, is extended to this time. If a maintenance is performed every week, i.e., about 50 times a year, The total amount of time the vSAN objects have reduced availability would, therefore, be 250 hours a year, which corresponds to about 2.85% of the time.

Portion of time with reduced availability = (5 * 50)/8760 = 2.85%

As indicated above, the total availability of the system is much lower if none of the components are highly available, but this would only be 2.85% of the time. The overall storage system availability of performing all maintenance using 'ensure accessibility' instead of 'full data migration' would be reduced from 97.5% to 96.43%.

Total failure probability = 2.5% * 0.9715 + 40% * 0.0285 = 3.57%

Total storage availability (using ensure accessibility during maintenances) = 96.43%

I believe the likelihood this would affect your availability SLA isn't very high. If the mean time to recover (MTTR) is 5 hour (remember that the failed VM would become available when the maintenance completes), this will represent an increase in the average yearly downtime from 7.5 min to 10.75 min i.e., an increase of 3.3 minutes. Note that this is calculated from very conservative numbers and would most likely, in reality, be even less impactful.

Summary

Although it might sound scary to leave storage objects in the not highly available state during maintenances, the actual impact of the system availability is minimal. Temporarily reduced availability is also something that occasionally occurs in all highly available systems.

Show Comments