Large disk array crash course

This article is meant to be a crash course guide to problems that may arise when building good, fast, cheap and large JBOD storage.

To imagine what is meant by "large", the idea looks like this:

two 24-slot Supermicro chassis (JBOD only)
44 NL-SAS 2TB disks (7200RPM)
two 4x SAS connectors on each chassis
two controllers connected to PCIe8x

This setup should primarily provide data storage for users home folders. These chassis are connected to testing server by two LSI HBA controllers in way where each controller is connected by one 'SAS 4x cable' to each chassis. Testing server used for testing was:

2x 8-core AMD Opteron 6140 processors (2.6GHz)
16GB DDR3 ECC RAM
Debian squeeze 6.0.3 with SElinux

Idea of production setup was that disk array would use RAID10 and LVM. Disks in array would be places in way that we could survive complete sudden failure of one whole chassis, so it would be like disk mirror between chassis.

S.M.A.R.T.

To check if all disks are OK, we dumped S.M.A.R.T. data of all disks (smartctl -a /dev/sdX) and run dd to wipe disks few times with zeroes. Good advice for newbies (such as me in that time) - NL-SAS (wiki) are often dual port capable which means that you can access the same disk by two different ways/paths. In our case using two controllers on one host should show 88 disks.

Side-note: smartmon-tools took more than one minute to start with 44 disks, but looking at the them as they are scanned looks nice :)

We have identified two disks looking suspiciously (having strangely high numbers in SMART stats). Another lesson for those getting around SAS disks for first time. While IDE/SATA disk SMART attributes look self-explanatory considering detecting "bad sectors", SAS disks outputs attributes with different naming concept for "same things". Most important ones in SAS are:

Non-medium error count - this counts number of communication errors and errors in disk electronics. Communication errors could be caused by faulty slot in chassis or by faulty electronics in disk. To check which of this applies just re-plug disk in different slot, if errors disappear there's problem with slot or way how disk was connected to it. In our case we had two disks reporting numbers in thousands, increasing every second by hundreds. We tried re-plugging one disk into the same slot (to verify if it was connected well), the second one we plugged to different slot (to verify if the whole slot is faulty). In both cases this haven't helped and problem was in disks themselves.
Elements in grown defect list - this stands for more detailed 'Reallocated Sector Count' attribute known from IDE/SATA drives and allows to display location of "bad sectors" using command sginfo -G /dev/sdX.

Rest of attributes are quite self-explanatory.

Software raid creation (mdadm)

Another problem raised when we wanted to create large software array over all disks. mdadm command looked like this:

#> mdadm --create /dev/md4 -l10 -pn2 -n44 /dev/disk1 /dev/disk2 /dev/disk3 ...
'mdadm: invalid number of raid devices: 44'

Hmm, recounting the arguments didn't helped, there were 44 disks so why the error. Later investigation by colleague explained the behavior - mdadm by default uses old metadata format 0.9 which can create the array consisting of up to 27 disks. Giving it more makes it ignore the rest of disks specified (the 28th, 29th, ... ). Solution: tell mdadm to use newer metadata format.

#> mdadm --create /dev/md4 -l10 -pn2 -e 1.2 -n44 /dev/disk1 /dev/disk2 /dev/disk3 ...

Only difference compared to first command is -e 1.2 part which specifies that we want to use metadata 1.2 format and leads to successful array creation.

Software raid re-sync speed issues

OK we got a big software array, let the re-sync begin. Adjusting maximum re-sync speed to 10x of default (so to roughly 2 GB/s) and watching how it syncs. After some time comes the disappointment. Re-sync speed of 44 disk array is only ~620 MB/s. Spitting big array into two 22 disk ones and running re-sync on both of them shows total re-sync speed of ~1210 MB/s which is significantly better. Problem is hidden somewhere else and it's probably CPU, because running re-sync on 44 disk arrays made the md_raid10 process to fully utilize one processor core (yeah it looks like md_raid10 process is not multi-core ready). This started discussion about slow re-sync on large number of drives which can be found here.

Another good advice from hardware corner: even if the HBA controller looks like it can fit PCIe4x slot it doesn't have to mean that PCIe4x is enough for it. In euphory of migrating controllers from testing machine to production servers we didn't realize that and in good will of sparing last free PCIe8x slot we have plugged them into PCIe4x slots. Result? Re-sync speed ~360 MB/s, useless blaming of operating system and it's drivers (don't do that, error is in that case usually somewhere else). And yeah moving controllers to PCIe8x slots solved the problem.

More to come later ... (filesystem testing, MD chunk size, LVM, RAID 010)