Large disk array crash course
This article is meant to be a crash course guide to problems that may arise when building good, fast, cheap and large JBOD storage.
To imagine what is meant by "large", the idea looks like this:
- two 24-slot Supermicro chassis (JBOD only)
- 44 NL-SAS 2TB disks (7200RPM)
- two 4x SAS connectors on each chassis
- two controllers connected to PCIe8x
- 2x 8-core AMD Opteron 6140 processors (2.6GHz)
- 16GB DDR3 ECC RAM
- Debian squeeze 6.0.3 with SElinux
S.M.A.R.T.
To check if all disks are OK, we dumped S.M.A.R.T. data of all
disks (smartctl -a /dev/sdX
) and run dd
to wipe disks few times with zeroes.
Good advice for newbies (such as me in that time) -
NL-SAS (wiki) are often
dual port capable which means that you can access the same disk by two different ways/paths.
In our case using two controllers on one host should show 88 disks.
Side-note: smartmon-tools
took more than one minute to start with 44 disks,
but looking at the them as they are scanned looks nice :)
We have identified two disks looking suspiciously (having strangely high numbers in SMART stats). Another lesson for those getting around SAS disks for first time. While IDE/SATA disk SMART attributes look self-explanatory considering detecting "bad sectors", SAS disks outputs attributes with different naming concept for "same things". Most important ones in SAS are:
- Non-medium error count - this counts number of communication errors and errors in disk electronics. Communication errors could be caused by faulty slot in chassis or by faulty electronics in disk. To check which of this applies just re-plug disk in different slot, if errors disappear there's problem with slot or way how disk was connected to it. In our case we had two disks reporting numbers in thousands, increasing every second by hundreds. We tried re-plugging one disk into the same slot (to verify if it was connected well), the second one we plugged to different slot (to verify if the whole slot is faulty). In both cases this haven't helped and problem was in disks themselves.
- Elements in grown defect list - this stands for more detailed
'Reallocated Sector Count' attribute known from IDE/SATA drives and allows to
display location of "bad sectors" using command
sginfo -G /dev/sdX
.
Software raid creation (mdadm)
Another problem raised when we wanted to create large software array over all disks.
mdadm
command looked like this:
#> mdadm --create /dev/md4 -l10 -pn2 -n44 /dev/disk1 /dev/disk2 /dev/disk3 ... 'mdadm: invalid number of raid devices: 44'Hmm, recounting the arguments didn't helped, there were 44 disks so why the error. Later investigation by colleague explained the behavior -
mdadm
by default uses old metadata format 0.9 which can create
the array consisting of up to 27 disks. Giving it more makes it ignore the rest of disks
specified (the 28th, 29th, ... ). Solution: tell mdadm
to use newer metadata format.
#> mdadm --create /dev/md4 -l10 -pn2 -e 1.2 -n44 /dev/disk1 /dev/disk2 /dev/disk3 ...Only difference compared to first command is
-e 1.2
part which
specifies that we want to use metadata 1.2 format and leads to successful array creation.
Software raid re-sync speed issues
OK we got a big software array, let the re-sync begin. Adjusting maximum re-sync speed to 10x of default (so to roughly 2 GB/s) and watching how it syncs. After some time comes the disappointment. Re-sync speed of 44 disk array is only ~620 MB/s. Spitting big array into two 22 disk ones and running re-sync on both of them shows total re-sync speed of ~1210 MB/s which is significantly better. Problem is hidden somewhere else and it's probably CPU, because running re-sync on 44 disk arrays made the md_raid10 process to fully utilize one processor core (yeah it looks like md_raid10 process is not multi-core ready). This started discussion about slow re-sync on large number of drives which can be found here.
Another good advice from hardware corner: even if the HBA controller looks like it can fit PCIe4x slot it doesn't have to mean that PCIe4x is enough for it. In euphory of migrating controllers from testing machine to production servers we didn't realize that and in good will of sparing last free PCIe8x slot we have plugged them into PCIe4x slots. Result? Re-sync speed ~360 MB/s, useless blaming of operating system and it's drivers (don't do that, error is in that case usually somewhere else). And yeah moving controllers to PCIe8x slots solved the problem.
More to come later ... (filesystem testing, MD chunk size, LVM, RAID 010)