1 - Storage Stack Overview
Understanding storage stack is crucial for understanding what technologies are involved and how (where storage replica is, where is ReFS Multi-resilient Volume, …). Understanding how layers are stacked will also help when IO flow is troubleshooted - like reviewing performance counters or troubleshooting core functionality.
Traditional stack compared to storage spaces stack (note that MPIO is missing, but for Storage Spaces Direct it’s not needed as there is only one path to the physical device, so it was omitted)
You can notice 4 “new” layers, but actually it’s just Spaces layer (Spaceport) and Storage Bus Layer.
To better understand what’s in the stack, you can also explore some parts with PowerShell
Anyway, let’s explore layers a bit. Following info is based on storage description someone somewhere created and pushed to internet. The only version found was from webarchive and can be accessed here.
Layers below S2D Stack
Port & Miniport driver
storport.sys & stornvme.sys
Port drivers implement the processing of an I/O request specific to a type of I/O port, such as SATA, and are implemented as kernel-mode libraries of functions rather than actual device drivers. Port driver is written by Microsoft (storport.sys). If third party wants to use write their own device driver (like HBA), then it will use miniport driver (except if device is NVMe, then miniport driver is Microsoft stornvme.sys)
Miniport drivers usually use storport. performance enhancements such as support for the paralell execution of IO.
storage port drivers storage miniport drivers
Class Driver
disk.sys
A storage class driver (typically disk.sys) uses the well-established SCSI class/port interface to control a mass storage device of its type on any bus for which the system supplies a storage port driver (currently SCSI, IDE, USB and IEEE 1394). The particular bus to which a storage device is connected is transparent to the storage class driver.
Storage class driver is responsible for claiming devices, interpreting system I/O requests and many more
In Storage Spaces stack (Virtual Disk) disk.sys is responsible for claiming Virtual Disk exposed by spaceport (storage spaces)
Partition Manager
partmgr.sys
Partitions are handled by partmgr.sys. Partition is usually GPT or MBR (preferably GPT as MBR has many limitations such as 2TB size limit)
As you can see in the stack, there are two partition managers. One partition layout is on physical disk and it is then consumed by storage spaces (spaceport).
On the picture below you can see individual physical disk from spaces exposed and it’s partitions showing metadata partition and partition containing pool data (normally not visible as it’s hidden by partmgr.sys when it detects spaces).
S2D Stack
Storage Bus Layer
clusport.sys and clusblft.sys
These two drivers (client/server) are exposing all physical disk to each cluster node, so it looks like all physical disks from every cluster node are connected to each server. For interconnect is SMB used, therefore high-speed RDMA can be used (recommended).
It also contains SBL cache.
Spaceport
spaceport.sys
Claims disks and adds them to storage spaces pool. It creates partitions where internal data structures are metadata are kept (see screenshot in partition manager).
Defines resiliency when volume (virtual disk) is created (creates/distributes extents across physical disks)
Virtual Disk
disk.sys is now used by storage spaces and exposes virtual disk that was provisioned using spaceport.sys
Layers above S2D Stack
Volume Manager
dmio.sys, volmgr.sys
Volumes are created on top of the partition and on volumes you can then create filesystems and expose it to the components higher in the stack.
Volume Snapshot
volsnap.sys
Volsnap is the component that creates system provider for the volume shadow copy service (VSS). This service is controller by vssadmin.exe
BitLocker
fvevol.sys
BitLocker is well known disk encryption software that is on the market since Windows Vista. In PowerShell you can expose volume status with Get-BitLockerVolume command.
Filter Drivers
Interesting about filter drivers is, that all FileSystem drivers are actually filter drivers - special ones, File System Drivers - like REFS.sys, NTFS.sys, Exfat.sys.
You can learn more about filesystem using fsutil
There are also many first party and third party filter drivers. You can expose those with fltmc command
As you can see on above example, there are many filters like Cluster Shared Volume (CsvNSFlt, CsvFLT), deduplication (Dedup), shared vhd (svhdxflt), storage QoS (storqosflt) and many more. Each filter driver has defined altitude and 3rd parties can reserve theirs.
2 - Layers Below S2D Stack
2.1 - Storage Devices
Resources
Microsoft Documentation
- https://learn.microsoft.com/en-us/azure-stack/hci/concepts/choose-drives
- https://learn.microsoft.com/en-us/azure-stack/hci/concepts/drive-symmetry-considerations
NVMe vs SATA
- https://sata-io.org/sites/default/files/documents/NVMe%20and%20AHCI%20as%20SATA%20Express%20Interface%20Options%20-%20Whitepaper_.pdf
- http://www.nvmexpress.org/wp-content/uploads/2013/04/IDF-2012-NVM-Express-and-the-PCI-Express-SSD-Revolution.pdf
- https://nvmexpress.org/wp-content/uploads/NVMe-101-1-Part-2-Hardware-Designs_Final.pdf
- https://nvmexpress.org/wp-content/uploads/NVMe_Infrastructure_final1.pdf
- https://www.storagereview.com/review/dell-emc-poweredge-r750-hands-on
- https://dl.dell.com/manuals/common/dellemc-nvme-io-topologies-poweredge.pdf
- https://www.servethehome.com/dell-emc-poweredge-r7525-review-flagship-dell-dual-socket-server-amd-epyc/
Interfaces
While SATA is still well performing for most of the customers (see performance results), NVMe offers benefit of higher capacity and also more effective protocol (AHCI vs NVMe), that was developed specifically for SSDs (opposite to AHCI, that was developed for spinning media). SATA/SAS is however not scaling well with the larger disks.
There is also another aspect of performance limitation of SATA/SAS devices - the controller. All SATA/SAS devices are connected to one SAS controller (non-raid) that has limited speed (only one PCI-e connection).
Drive Connector is universal (U2, also known as SFF-8639)
NVMe drives are mapped directly to CPU
NVMe backplane connection - Example AX7525 - 16 PCIe Gen4 lanes in each connection (8 are used), 12 connections in backplane, in this case no PCIe switches.
Storage Protocol
SSDs were originally created to replace conventional rotating media. As such they were designed to connect to the same bus types as HDDs, both SATA and SAS (Serial ATA and Serial Attached SCSI).
However, this imposed speed limitations on the SSDs. Now a new type of SSD exists that attaches to PCI-e. Known as NVMe SSDs or simply NVMe.
For 1M IOPS, NVMe has more than 50% less latency with less than 50% CPU Cycles used. It is due to improved protocol (AHCI vs NVMe)
Storage Configurations
(slowest to fastest)
- Hybrid (HDD+SSD)
- All Flash (All SSD)
- NVMe+HDD
- All-NVMe
When combining multiple media types, faster media will be used as caching. While it is recommended to use 10% of the capacity for cache, it should be noted, that it is just important to not spill the cache with the production workload, as it will dramatically reduce performance. Therefore all production workload should fit into the Storage Bus Layer Cache (cache devices). The sweet spot (price vs performance) is combination of fast NVMe (mixed use or write intensive) with HDDs. For performance intensive workloads it’s recommended to use all-flash solutions as caching introduces ~20% overhead + less predicable behavior (data can be already destaged…), therefore it is recommended to use All-Flash for SQL workloads.
Performance drop when spilling cache devices:
OS Disks
In Dell Servers are BOSS (Boot Optimized Storage Solution) cards used. In essence it card wih 2x m2 2280 NVMe disks connected to PCI-e with configurable non-RAID/RAID 1
Consumer-Grade SSDs
You should avoid any consumer grade SSDs as consumer grade SSDs might contain NAND with higher latency (therefore there can be performance drop after spilling FTL buffer) or because consumer grade SSDs are not power protected (PLP). You can learn more about why consumer-grade SSDs are not good idea in a blog post. Consumer-grade SSDs do also have lower DWPD (Disk Written Per Day). You can learn about DWPD in this blogpost
Exploring Stack with PowerShell
Get-PhysicalDisk
$Server = "axnode1"
Get-PhysicalDisk -CimSession $Server | Format-Table -Property FriendlyName, Manufacturer, Model, SerialNumber, MediaType, BusType, SpindleSpeed, LogicalSectorSize, PhysicalSectorSize
From screenshot you can see, that AX640 BOSS card reports as SATA device with Unspecified Mediatype, while SAS disks are reported as SSDs, with SAS BusType. Let’s deep dive into BusType/MediaType a little bit (see table below)
Storage Spaces requires BusType SATA/SAS/NVMe or SCM. BusType RAID is unsupported.
You can also see Logical Sector Size and Physical Sector size. This refers to Drive Type (4K native vs 512E vs 512).
“LogicalSectorSize” value | “PhysicalSectorSize” value | Drive type |
---|---|---|
4096 | 4096 | 4K native |
512 | 4096 | Advanced Format (also known as 512E) |
512 | 512 | 512-byte native |
Reference
- https://learn.microsoft.com/en-US/troubleshoot/windows-server/backup-and-storage/support-policy-4k-sector-hard-drives
- https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/hh147334(v=ws.10)?redirectedfrom=MSDN
Storage Reliability Counter
Once disk is added to storage spaces, S.M.A.R.T. attributes can be filtered out. For reading disk status (such as wear level temperatures…) can be get-storagereliability counter used.
$Server = "axnode1"
Get-PhysicalDisk -CimSession $Server | Get-StorageReliabilityCounter -CimSession $Server | Format-Table -Property DeviceID, Wear, Temperature*, PowerOnHours, ManufactureDate, ReadLatencyMax, WriteLatencyMax, PSComputerName
Performance results
From the results below you can see that SATA vs SAS vs NVMe is 590092 vs 738507 vs 1496373 4k 100% read IOPS. All measurements were done with VMFleet 2.0 https://github.com/DellGEOS/AzureStackHOLs/tree/main/lab-guides/02-TestPerformanceWithVMFleet
The difference between SAS and SATA is also 8 vs 4 disks in each node. The difference between SAS and NVMe is more than double.
- AX6515 - 2nodes, 16 cores and 4xSATA SSDs each
- AX6515 - 2nodes, 16 cores and 4xSATA SSDs each, secured core and deduplication enabled
- AX6515 - 2nodes, 16 cores and 4xSATA SSDs each, secured core enabled
- AX6515 - 2nodes, 16 cores and 4xSATA SSDs each, secured core & BitLocker enabled
- AX6515 - 2nodes, 16 cores and 8xSAS SSDs each
- R640 - 2nodes, 32 cores and 8xNVMe SSDs each