Disk Sub-System & Storage Architecture

Design of storage layers that balance throughput, IOPS and durability for HPC workloads.

Service description

This service looks at the layers below the parallel file system or shared storage: RAID, HBAs, NVMe and caching tiers. We evaluate RAID levels, disk group sizing and rebuild characteristics so that the system survives real-world failures without unacceptable performance drops.

Where appropriate, ZFS is used to provide checksums, compression and flexible datasets for scratch and project spaces. For NVMe-heavy deployments we examine wear patterns, over-provisioning and the impact of different I/O schedulers.

The deliverables include capacity and rebuild time estimates, recommended layout for new purchases and practical guidance on monitoring disk health beyond simple SMART status.

Diagram & case study

Service diagram for Disk Sub-System & Storage Architecture

Case study – Avoiding a rebuild storm

An organisation had experienced several painful RAID rebuild events that slowed down the entire cluster for days. We analysed their disk group sizes, RAID levels and background scrub policies.

By redesigning the disk layout into smaller, more manageable groups and adjusting scrub schedules, we reduced worst-case rebuild time while keeping effective usable capacity almost unchanged. Future failures no longer caused cluster-wide slowdowns.

Discuss this service

← Back to all services