vSAN Operations: Disk Failure Impact

Using vSAN as vSphere storage has many advantages, it's cheap, performs well and it's easy to configure. What hasn't been widely discussed are the operational impacts of running vSAN (or any HCI for that matter). This is especially relevant for large organisations, with separate storage, virtualization and datacenter server teams.

I will cover this in a series of posts, with this first one relating to the impact of server disk failures.

Local Disk Failure Impact
In a standard (non vSAN) vSphere environment, VMs are often stored on external storage. Any local disks on the ESXi servers are normally used for the ESXi OS. Server disk failures (even multiple) will in these circumstances only have limited impact on the VMs. The worst case would be that the host goes down causing HA to restart the VMs on other hosts in the cluster.

The impact of local disk failure in a vSAN host is clearly much higher. As the VMs are stored on local server disks, any of these failing or being unexpectedly unplugged can cause VMs to go down or even be destroyed.

Individual Disk Failure Changes
In addition to the more severe impact of local disk failure, using vSAN changes which disk failures could have an impact on VMs.

Most (non vSAN) servers have disks configured in some form of RAID, which means that single disk failure on multiple hosts doesn't have any impact on whatever is running on those servers. This applies whether VMs are located on external storage or not, and is a type of availability most IT professionals are familiar with.

Standard Storage Disk Failure

Due to the components of VM objects in a vSAN cluster are distributed across a cluster, using vSAN introduces the possibility of VMs being impacted when single disks fail on separate hosts. This is a situation which would be unexpected for for anyone without experience of HCI technologies, and could result in lost VMs.

Assuming that vSAN with an FTT policy equal to or higher than 1 is used, all VMs can accept at least the same amount of component failures as they do using traditional storage. If operational procedures aren't updated, the overall risk of VM downtime could still be higher due to teams not understanding the changes in the technology. The SAN has basically moved into the hosts and operational procedures need to reflect this.

Show Comments