We want to run multi-drive systems we have in a JBOD mode, where each drive is basically a filesystem to itself. With the drives we currently have, we expect to have multiple failures, primarily unrecoverable ECC read errors or sometimes the drive just dying altogether. How does ext[23] handle these two primary conditions? Using them in a software RAID mode, I have sometimes seen problems with disks hang all access to the filesystem and even the entire system, but I'm not sure at what level that's happening (low-level driver? scsi layer? raid layer? filesystem layer?). If I have a drive fail taking out the entire ext3 filesystem, will I be able to stop using the filesystem (say, my application gets the error from the fs indicating some sort of problem in whatever system call it's made, who cares what), forcibly unmount the filesystem, and replace the drive? Or will the system panic? Or worse, will my application just enter an uninterruptible sleep never to return success or error? Obviously, we'll be doing our own testing, but any knowledge of these scenarios would be most appreciated. Philip * Philip Molter * Texas.Net Internet * http://www.texas.net/ * philip at texas.net
On Mar 17, 2004 19:15 -0600, Philip Molter wrote:> We want to run multi-drive systems we have in a JBOD mode, where > each drive is basically a filesystem to itself. With the drives > we currently have, we expect to have multiple failures, primarily > unrecoverable ECC read errors or sometimes the drive just dying > altogether. > > How does ext[23] handle these two primary conditions? Using them > in a software RAID mode, I have sometimes seen problems with disks > hang all access to the filesystem and even the entire system, but > I'm not sure at what level that's happening (low-level driver? > scsi layer? raid layer? filesystem layer?).This is entirely an issue with the bus or SCSI layer, and not the filesystem.> If I have a drive fail taking out the entire ext3 filesystem, will > I be able to stop using the filesystem (say, my application gets > the error from the fs indicating some sort of problem in whatever > system call it's made, who cares what), forcibly unmount the > filesystem, and replace the drive? Or will the system panic? Or > worse, will my application just enter an uninterruptible sleep > never to return success or error?Of all Linux filesystems, I think you'll find that ext2/ext3 probably handle media and device errors the most gracefully (i.e. not panicing because of cascading errors, unless you want that with errors=panic). Whether you'll be able to unmount is really dependent on a lot of factors so it's hard to comment. When our storage servers (running ext3) have some catastrophic disk problem we can usually unmount. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/
Leandro GuimarĂ£es Faria Corsetti Dutra
2004-Mar-19 15:03 UTC
How does ext3 handle drive failures?
On Wed, 17 Mar 2004 19:15:09 -0600, Philip Molter wrote:> Using them in a > software RAID mode, I have sometimes seen problems with disks hang all > access to the filesystem and even the entire system, but I'm not sure at > what level that's happeningCheck my posts to this list... there is some nasty interaction between soft RAID and ext3 since 2.5.X, already reported by several people here, at linux-kernel and linux-raid. Until now these reports have gone unanswered, presumably because the people In The Know are busy either trying to reproduce and diagnose it or because there are even more critical -- or more interesting -- things to do. So for now the options are either not using software RAID, not using 2.6, or not using ext3. If you can afford it, I'd suggest hardware RAID, and SCSI if you're really rich. -- Leandro Guimar?es Faria Corsetti Dutra +55 (44) 3028 7467 WebLink Tecnologia +55 (44) 269 71 78 Maring?, PR BRAZIL