How RAID system reliable? possible of raid system failure

aTun@lemm.ee · 2 days ago

How RAID system reliable? possible of raid system failure

Moonrise2473@feddit.it · 2 days ago

Raid wasn’t designed for data safety but to minimize downtime. Just swap the drive an continue operating the server seamlessly. Full backups are still required as the chance of complete failure isn’t zero

computergeek125@lemmy.world · 2 days ago

For recovering hardware RAID: most guaranteed success is going to be a compatible controller with a similar enough firmware version. You might be able to find software that can stitch images back together, but that’s a long shot and requires a ton of disk space (which you might not have if it’s your biggest server)

I’ve used dozens of LSI-based RAID controllers in Dell servers (of both PERC and LSI name brand) for both work and homelab, and they usually recover the old array to the new controller pretty well, and also generally have a much lower failure rate than the drives themselves (I find myself replacing the cache battery more often than the controller itself)

Only twice out of the handful of times I went to a RAID controller from a different generation

first time from a mobi failed R815 (PERC H700) physically moving the disks to an R820 (PERC H710, might’ve been an H710P) and they were able to foreign import easily
Second time on homelab I went from an H710 mini mono to an H730P full size in the same chassis (don’t do that, it was a bad idea), but aside from iDRAC being very pissed off, the card ran for years with the same RAID-1 array imported.

As others have pointed out, this is where backups come into play. If you have to replace the server with one from a different generation, you run the risk that the drives won’t import. At that point, you’d have to sanitize the super block of the array and re-initialize it as a new array, then restore from backup. Now, the array might be just fine and you never notice a difference (like my users that had to replace a failed R815 with an 820), but the result pattern is really to the extremes of work or fault with no in between.

Standalone RAID controllers are usually pretty resilient and fail less often than disks, but they are very much NOT infallible as you are correct to assess. The advantage to software systems like mdadm, ZFS, and Ceph is that it removed the precise hardware compatibility requirements, but by no means does it remove the software compatible requirements - you’ll still have to do your research and make sure the new version is compatible with the old format, or make sure it’s the same version.

All that’s said, I don’t trust embedded motherboard RAIDs to the same degree that I trust standalone controllers. A friend of mine about 8-10 years ago ran a RAID-0 on a laptop that got it’s super block borked when we tried to firmware update the SSDs - stopped detecting the array at all. We did manage to recover data, but it needed multiple times the raw amount of storage to do so.

we made byte images of both disks in ddrescue to a server that had enough spare disk space
found a software package that could stitch together images with broken super blocks if we knew the order the disks were in (we did), which wrote a new byte images back to the server
copied the result again and turned it into a KVM VM to network attach and copy the data off (we could have loop mounted the disk to an SMB share and been done, but it was more fun and rewarding to boot the recovered OS afterwards as kind of a TAKE THAT LENOVO…we were younger)
took in total a bit over 3TB to recover the 2x500GB disks to a usable state - and took about a week of combined machine and human time to engineer and cook, during which my friend opted to rebuild his laptop clean after we had images captured - to one disk windows, one disk Linux, not RAID-0 this time :P

SayCyberOnceMore@feddit.uk · 2 days ago

I can confirm that moving the disks to a very similar device will work.

We recovered “enough” data from what disks remained of a Dell server that was dropped (PSU side down) from a crane. The server was destroyed, most of the disks had moved further inside the disk caddy which protected them a little more.

It was fun to struggle with that one for ~1 week

And the noise from the drives…

anamethatisnt@lemmy.world · 2 days ago

RAID is never a replacement for backups.
Never work directly with a surviving disk, clone it and work with the cloned drive.
Are you sure you can’t rebuild the RAID? That really is the best solution in many cases.
If a RAID failure is within tolerance (1 drive in a RAID5 array) then it should still be operational. Make a backup before rebuilding if you don’t have one already.
If more disks are gone than that then don’t count on recovering all data even with data recovery tools.

B0rax@feddit.org · 2 days ago

Lots of people have moved away from raid entirely because of some of these issues. There are alternatives these days. For example mergerfs or the ZFS file system.

Theoriginalthon@lemmy.world · 2 days ago

Can confirm that moving a zfs array to a new system after a failure is simply connect the disks and zpool import -f <pool_name>

Every raid card I use now is put in hba mode it’s just simpler to deal with

NeoNachtwaechter@lemmy.world · 2 days ago

recover data from unfunctioned remaining RAID disks due to RAID controller failure

In this case, you need a new RAID controller of similar type.

Can I even simply attach one of the RAID 1 disk to the desktop system

No. One disk out of a RAID array is different from a normal disk.

Recovery becomes easy if you do not use a hardware RAID controller, but a ZFS software RAID instead. It does nearly all automatically. But you need to do a little more reading tutorials for the first setup.

tburkhol@lemmy.world · 2 days ago

RAID is more likely to fail than a single disk. You have the chance of single-disk failure, multiplied by the number of disks, plus the chance of controller failure.

RAID 1 and RAID 5 protect against that by sharing data across multiple disks, so you can re-create a failed drive, but failure of the controller may be unrecoverable, depending on availability of new, exact-same controller. With failure of 1 disk in RAID 1, you should be able to use the array ‘degraded,’ as long as your controller still works. Depending on how the controller works, that disk may or may not be recognizable to another system without the controller.

RAID 1 disks are not just 2 copies of normal disks. Example: I use software RAID 1, and if I take one of the drives to another system, that system recognizes it as a RAID disk and creates a single-disk, degraded RAID array with it. I can mount the array, but if I try to mount the single disk directly, I get filesystem errors.

catloaf@lemm.ee · 2 days ago

RAID is more likely to fail than a single disk. You have the chance of single-disk failure, multiplied by the number of disks, plus the chance of controller failure.

This is poorly phrased. A raid with a bad disk is not failed, it is degraded. The entire array is not more likely to fail than a single disk.

Yes, you are more likely to experience a disk failure, but like you said, only because you have more disks in the first place. (However, there is also the phenomenon where, after replacing a failed disk, the additional load during the rebuild might cause a second disk to fail, which is why you should replace failed disks as soon as possible. And have backups.)

Nicht BurningTurtle@feddit.org · 2 days ago

Are there differences in the context of failure, when using a controller vs software raid with mdadm?

catloaf@lemm.ee · 2 days ago

With software raid, there is no controller to fail.

Well, that’s not strictly true, because you still have a SATA/SAS controller, HBA, backplane, or whatever, but they’re more easily replaceable. (Unless it’s integrated in the motherboard, but then it’s not a separate component to fail.)