The fundamentally different architecture of vSAN compared to traditional SAN storage brings some operational challenges. One of these is what happens when capacity is running low, and what value to set the alert threshold to ensure workloads don’t experience problems.
One problem is that there are significant differences between what happens when storage capacity is running low in a standard VMFS/NFS datastore and a vSAN datastore. VMs on a standard VMFS/NFS datastore works until the datastore runs out of space and suddenly a lot of problems occur. With vSAN this is not as clear. Instead of having a problem at a specific point (i.e. 100% utilization), the probability of having problems starts at a threshold and is increased as more capacity is utilized. The problems that can occur includes:
- VM storage performance problems due to increased vSAN background migrations traffic.
- The time objects stay degraded after component failure can be increased, due to extra background migration traffic.
- Inability for components to be rebuilt after failure, due to lack of raw storage capacity.
- Objects failing due to disk components not having capacity to grow.
Most vSphere admins are familiar with managing VMFS/NFS datastore capacity in vSphere and normally have alarms set up to alert when datastore capacity is running low. A buffer of free space (such as 20%) is generally required on each datastore. This minimizes the probability that the datastore runs out of space, and alerts are set up to ensure this buffer is unutilized. As is indicated by the graphs below, this approach doesn’t work particularly well for a vSAN datastore and you might see issues in the environment long before you reach this limit. My suggestion is that the alert threshold is set to ensure there is no probability at all of any problem, which is generally much lower than on a VMFS/NFS datastore.
A simplified view of this is indicated by the following graphs. Where the X-axis represents the amount of storage space utilized while the Y-axis the probability of having problems (yes I know this is simplified but you get the point).
What value should the alert threshold on a vSAN datastore then be set to? I'm here presenting a calculation takes into account the migration threshold of individual disks and the HA of the cluster.
To minimize the risk an individual capacity disk that contributes to a vSAN datastore runs out of space, vSAN will start migrating components of a disk when it reaches 80% utilized capacity. The data migration will utilize bandwidth and consume IOPS on disks. This could reduce the performance of the vSAN datastore and should, therefore, be avoided. Due to vSAN being an object storage, capacity on all disks in a vSAN cluster are not necessarily uniformly distributed, and some disks can utilize more capacity than others. To minimize the likelihood of any disk reaching 80% capacity, a max limit of 70% utilization of the vSAN datastore is generally recommended.
Something else to consider when calculating the max utilized capacity of a vSAN datastore is the fact that the vSAN storage capacity is part of the of the ESXi hosts in the vSphere Cluster. When it comes to compute and memory, Admission Control is in place to ensure that HA levels can be satisfied, but this doesn’t apply to vSAN storage capacity. If a vSphere cluster is configured as N+1 you must take one host of vSAN storage capacity into consideration when calculating how much storage capacity should be left unused.
An example would be if there were 5 ESXi hosts in a cluster, each with 2TB of raw vSAN storage capacity, where the cluster should be able to handle 1 host failure. To minimize the risk that any of the vSAN capacity disks would use more than 80% and ensure there would be enough capacity in the cluster even if one host went down, the vSAN datastore should be set to never utilize more than 56% of total capacity
0.7 * 4/5 = 0.56
As the total raw capacity of the vSAN datastore in the example is 10TB, the max utilized vSAN datastore space would be 5.6TB. The alert threshold could then be set to 56% and a warning to a value below this such as 50%. This could be set using either vCenter Alarms, vROPs or any 3rd party monitoring solution.
One problem with this is that the percentage would be different depending on how many hosts were in the cluster, and would change when hosts were added. A vSphere admin would need to change the alert threshold every time a host was added. An alternative approach would be to keep the most conservative values (such as a calculation based on 4 hosts). This would waste storage capacity but would simplify the operational procedure of adding hosts to the cluster.