Arthur Perry
2004-May-27 14:11 UTC
(regards to ext3 and RAID) Re: Linux consultation needed (fwd)
---------- Forwarded message ---------- Date: Thu, 27 May 2004 08:43:30 -0400 (EDT) From: Arthur Perry <alp at perryconsulting.net> To: ext3 at linuxfarms.com Subject: (regards to ext3 and RAID) Re: Linux consultation needed (fwd) ---------- Forwarded message ---------- Date: Thu, 27 May 2004 08:11:17 -0400 (EDT) From: Arthur Perry <alp at perryconsulting.net> To: Christopher Welton <cwelton at jumpnowusa.com> Cc: ext3-users at redhat.com Subject: (regards to ext3 and RAID) Re: Linux consultation needed (fwd) Hi Chris, I put the whole thread into ext3-users at redhat.com, so others can benefit. e2fsck will repair filesystems damage, but it will do nothing for the RAID container. If your damage exists primarily on the RAID container, then what you want to do is repair that first! Otherwise, you may be making corrections to a filesystem and writing those changes to what could be mapped as bad blocks, and only worsening your situation. It all depends on what kind of RAID container you have, the type of damage, and the extent of damage. So in a simple answer, the filesystem check MAY appear to fix it for you in the really short term, or it may not. It's really all about your RAID container first. In my experience, we have had bad luck with RAID5 on certain controllers. I do not know anything about the particular one you have listed below. I would be sure to back up anything you can with a rescue disk boot before making any changes to your disk, simply because at this time, it is unknown exactly what you are dealing with. Best Regards, Art Perry On Wed, 26 May 2004, Christopher Welton wrote:> Thanks Art, your response is useful. I have posted my original question > to the RH 7.3 and RH ext3 lists. Feel free to post your question and/or > answers. > > One thing I would like to know is how to recover from the damage to the > RAID container scenario. Is this just a matter of running e2fsck? Pls > let me know. > > Thanks for your help! > > Chris > > On Wed, 2004-05-26 at 17:20, Arthur Perry wrote: > > Hi Chris, > > > > I just wanted to know if you have tried to fix this problem yet, and if > > any of my info and suggestions were helpful. > > > > Best of Luck, > > Art Perry > > > > > > > > On Tue, 25 May 2004, Arthur Perry wrote: > > > > > > > > > > > Oh, I just wanted to add a very important suggestion: > > > Do not run fsck or anything that may modify the filesystem on that > > > container until you have attempted to back up the important data first! > > > At least, not until you are confident that what you are dealing with here > > > is not any related to any damage to that container. > > > In theory, the fsck may work out fine, but we (at least I not being there) > > > are not sure about what is really going on. > > > > > > In practice, when there is any question about a medium's integrity, don't > > > do anything further to it until the important data is extracted from it > > > and backed up the best that you can before proceeding. > > > I would mount read-only. > > > > > > Just a heads up! > > > > > > > > > > > > ---------- Forwarded message ---------- > > > Date: Tue, 25 May 2004 12:14:16 -0400 (EDT) > > > From: Arthur Perry <alp at perryconsulting.net> > > > To: Christopher Welton <cwelton at jumpnowusa.com> > > > Cc: alp at linuxfarms.com > > > Subject: Re: Linux consultation needed > > > > > > Hi Christopher, > > > > > > By your description, I do believe that I can fix this problem and it may be rather easy. > > > > > > Unfortunately, I live in Massachusetts and it may be rather expensive to get me down there. > > > At the moment, I am working full-time for a large international computer corporation. So to make that trip, it would cut into my > > > vacation time and so I would have to be compensated for that properly. > > > > > > > > > > > > Off the top of my head, I see two possible scenarios: > > > > > > 1) damage to the raid container > > > if you were able to mount the filesystem with a rescue system, (and you > > > monted it in ext3, not ext2), then I can assume that the filesystem may > > > believe that it is in order. (One would know for sure once you run fsck). > > > However, that does not mean the underlaying block layer is ok. > > > The RAID container presents itself to the OS as a uniform block device, > > > which the filesystem sits on top of. > > > How each block gets distributed across the physical disks is entirely up > > > to the RAID hardware, and identifying whether or not a problem exists with > > > the RAID container is also the responsibility of the RAID hardware. > > > That being said, if the OS has no drivers that would directly interface > > > with the RAID hardware to collect the status of the container, there is no > > > way you could tell, unless this hardware had some sort of beep or buzzer to warn you > > > of this. You could also enter the setup screen of the RAID controller at > > > boot time and check the health of the container there. > > > > > > The reason why I think this is a possibility is because there may be > > > "corrupted" blocks in this RAID container that exist in areas that are > > > read during the boot process (and I use this term generally), and not > > > necessarily in locations where the kernel itself reside. > > > The kernel has such a small footprint, that this is not only > > > unlikely but probably wouldn't happen because the kernel would not be able > > > to completely decompress successfully to continue excecution before boot. > > > If this were the case, the failure mode would probably be more severe. > > > > > > > > > > > > 2) damage to the hardware > > > It is possible that the hardware has become somehow damaged or changed, > > > where at boot time when a hardware probe is performed by Kudzu, it locks > > > up the system entirely. > > > An example of this is a bad DIMM or possibly some other peripheral. > > > Maybe there is a hung-up SCSI device on the chain that is on a separate > > > power supply that just needs to be "reset" during the next reboot. > > > > > > > > > > > > If you were able to mount the filesystem from a rescue disk (in ext3 not > > > ext2), then there is no reason why it would not work at boot time, granted > > > the configs (/etc/fstab and kernel parameters for root) have not been > > > changed. > > > Therefore, the journal is probably fine and it is not the root cause. > > > We have already ruled out kernel damage. > > > > > > > > > > > > > > > My suggestion: > > > 1) check out the RAID container status in the setup screen (USE GREAT > > > CAUTION!! DO NOT CHANGE ANYTHING OR MAKE IT DO ANYTHING THAT YOU DO NOT > > > UNERSTAND OR YOU CAN LOSE ALL OF YOUR DATA!!!). > > > This can give you a rough idea of what may be going on. > > > > > > You can begin recovery by: > > > 1) Get another disk onto the machine that is large enough to store the > > > data that you think is necessary to salvage. > > > 2) boot into that rescue image again, mount your RAID container, and > > > slowly (little at a time) copy over the necessary files that you need to > > > the new disk. > > > > > > > > > There may be more possibilities here, but I am jsut going by the > > > information presented and the first things that come to mind. > > > > > > > > > I wish you good luck. > > > If you think you really need my assistance, I can fly out there.. We will > > > just have to go over the costs. > > > If this is enough to help you on your way, then that is great. > > > > > > > > > Also, I would like to post this back onto the newsgroups just so that > > > other people who experience the same problem can benefit, if that is ok > > > with you. > > > I have been working with Linux professionally for over 7 years, and have > > > not contributed back to the community much at all. ;) > > > > > > > > > > > > Let me know if this helps, or if you would like to move forward with > > > consultation. > > > > > > > > > > > > Best Regards, > > > Art Perry > > > alp at perryconsulting.net > > > http://www.perryconsulting.net > > > http://www.linuxfarms.com > > > > > > > > > > > > > > > > > > > > > > > > On Tue, 25 May 2004, Christopher Welton wrote: > > > > > > > Arthur: > > > > > > > > I read your reply to a problem in the redhat ext3 mailing list. I'd like > > > > to request your assistance in solving a serious problem we are having > > > > with one of our servers. Here are the details: > > > > > > > > I run a RH 7.3 installation on a Compaq Proliant 6500 with dual pentium > > > > 266Mhz processors, approx. 630MB ram and a hardware SMART-2DH RAID > > > > controller and array. All file systems are ext3. > > > > > > > > The server has been in service for a couple of years now. From time to > > > > time we will lose power in our office or have another situtation that > > > > causes the server to lose power without a proper shutdown. We had such a > > > > situation today. > > > > > > > > Usually the server reboots to runlevel 5 without a problem. However, > > > > today the server rebooted to the point in the boot process just after > > > > mounting the root filesystem. It then stalls indefinitely and does not > > > > continue to boot. > > > > > > > > I used a recovery CD to boot into a rescue shell. Once there I > > > > successfully mounted all the partitions on all drives and examined the > > > > files successfully, so the data, filesystem and hardware all look good. > > > > The filesystems were mounted ext2, not ext3. > > > > > > > > At this point, my suspicions are that some portion of the kernel > > > > required for booting was damaged during the power-loss shutdown or that > > > > the ext3 journal was damaged in such a way as to block booting. > > > > > > > > I need suggestions on possible causes of the problem and, better yet, > > > > possible solutions. > > > > > > > > Pls let me know: > > > > 1. If you think you can solve this issue. > > > > and > > > > 2. What rate you would bill at > > > > > > > > Thank you in advance > > > > > > > > Chris Welton > > > > Owner > > > > JumpNowUSA! > > > > 562-946-6683 > > > > cwelton at jumpnowusa.com > > > > > > > > > > > >
Theodore Ts'o
2004-May-27 19:04 UTC
(regards to ext3 and RAID) Re: Linux consultation needed (fwd)
The one thing I would add to this is that if you are using any kind of partitioning scheme on top of the hardware RAID device, make sure your partition table is sane first. If the starting block or the size of the partition is wrong, running e2fsck will also do much more damage. In general, the rule is: 1) Make sure the block device is sane. 2) Make sure the partition table is sane 3) Run e2fsck If you're not sure, you can try running e2fsck -n first, or making a full image backup first. - Ted On Thu, May 27, 2004 at 10:11:12AM -0400, Arthur Perry wrote:> > > ---------- Forwarded message ---------- > Date: Thu, 27 May 2004 08:43:30 -0400 (EDT) > From: Arthur Perry <alp at perryconsulting.net> > To: ext3 at linuxfarms.com > Subject: (regards to ext3 and RAID) Re: Linux consultation needed (fwd) > > > > ---------- Forwarded message ---------- > Date: Thu, 27 May 2004 08:11:17 -0400 (EDT) > From: Arthur Perry <alp at perryconsulting.net> > To: Christopher Welton <cwelton at jumpnowusa.com> > Cc: ext3-users at redhat.com > Subject: (regards to ext3 and RAID) Re: Linux consultation needed (fwd) > > Hi Chris, > > I put the whole thread into ext3-users at redhat.com, so others can benefit. > > e2fsck will repair filesystems damage, but it will do nothing for the RAID > container. > If your damage exists primarily on the RAID container, then what you want > to do is repair that first! > Otherwise, you may be making corrections to a filesystem and writing those > changes to what could be mapped as bad blocks, and only worsening your > situation. > > It all depends on what kind of RAID container you have, the type of > damage, and the extent of damage. > > So in a simple answer, the filesystem check MAY appear to fix it for you > in the really short term, > or it may not. It's really all about your RAID container first. > > In my experience, we have had bad luck with RAID5 on certain controllers. > I do not know anything about the particular one you have listed below. > > I would be sure to back up anything you can with a rescue disk boot before > making any changes to your disk, simply because at this time, it is > unknown exactly what you are dealing with. > > > Best Regards, > Art Perry > > > > > On Wed, 26 May 2004, Christopher Welton wrote: > > > Thanks Art, your response is useful. I have posted my original question > > to the RH 7.3 and RH ext3 lists. Feel free to post your question and/or > > answers. > > > > One thing I would like to know is how to recover from the damage to the > > RAID container scenario. Is this just a matter of running e2fsck? Pls > > let me know. > > > > Thanks for your help! > > > > Chris > > > > On Wed, 2004-05-26 at 17:20, Arthur Perry wrote: > > > Hi Chris, > > > > > > I just wanted to know if you have tried to fix this problem yet, and if > > > any of my info and suggestions were helpful. > > > > > > Best of Luck, > > > Art Perry > > > > > > > > > > > > On Tue, 25 May 2004, Arthur Perry wrote: > > > > > > > > > > > > > > > Oh, I just wanted to add a very important suggestion: > > > > Do not run fsck or anything that may modify the filesystem on that > > > > container until you have attempted to back up the important data first! > > > > At least, not until you are confident that what you are dealing with here > > > > is not any related to any damage to that container. > > > > In theory, the fsck may work out fine, but we (at least I not being there) > > > > are not sure about what is really going on. > > > > > > > > In practice, when there is any question about a medium's integrity, don't > > > > do anything further to it until the important data is extracted from it > > > > and backed up the best that you can before proceeding. > > > > I would mount read-only. > > > > > > > > Just a heads up! > > > > > > > > > > > > > > > > ---------- Forwarded message ---------- > > > > Date: Tue, 25 May 2004 12:14:16 -0400 (EDT) > > > > From: Arthur Perry <alp at perryconsulting.net> > > > > To: Christopher Welton <cwelton at jumpnowusa.com> > > > > Cc: alp at linuxfarms.com > > > > Subject: Re: Linux consultation needed > > > > > > > > Hi Christopher, > > > > > > > > By your description, I do believe that I can fix this problem and it may be rather easy. > > > > > > > > Unfortunately, I live in Massachusetts and it may be rather expensive to get me down there. > > > > At the moment, I am working full-time for a large international computer corporation. So to make that trip, it would cut into my > > > > vacation time and so I would have to be compensated for that properly. > > > > > > > > > > > > > > > > Off the top of my head, I see two possible scenarios: > > > > > > > > 1) damage to the raid container > > > > if you were able to mount the filesystem with a rescue system, (and you > > > > monted it in ext3, not ext2), then I can assume that the filesystem may > > > > believe that it is in order. (One would know for sure once you run fsck). > > > > However, that does not mean the underlaying block layer is ok. > > > > The RAID container presents itself to the OS as a uniform block device, > > > > which the filesystem sits on top of. > > > > How each block gets distributed across the physical disks is entirely up > > > > to the RAID hardware, and identifying whether or not a problem exists with > > > > the RAID container is also the responsibility of the RAID hardware. > > > > That being said, if the OS has no drivers that would directly interface > > > > with the RAID hardware to collect the status of the container, there is no > > > > way you could tell, unless this hardware had some sort of beep or buzzer to warn you > > > > of this. You could also enter the setup screen of the RAID controller at > > > > boot time and check the health of the container there. > > > > > > > > The reason why I think this is a possibility is because there may be > > > > "corrupted" blocks in this RAID container that exist in areas that are > > > > read during the boot process (and I use this term generally), and not > > > > necessarily in locations where the kernel itself reside. > > > > The kernel has such a small footprint, that this is not only > > > > unlikely but probably wouldn't happen because the kernel would not be able > > > > to completely decompress successfully to continue excecution before boot. > > > > If this were the case, the failure mode would probably be more severe. > > > > > > > > > > > > > > > > 2) damage to the hardware > > > > It is possible that the hardware has become somehow damaged or changed, > > > > where at boot time when a hardware probe is performed by Kudzu, it locks > > > > up the system entirely. > > > > An example of this is a bad DIMM or possibly some other peripheral. > > > > Maybe there is a hung-up SCSI device on the chain that is on a separate > > > > power supply that just needs to be "reset" during the next reboot. > > > > > > > > > > > > > > > > If you were able to mount the filesystem from a rescue disk (in ext3 not > > > > ext2), then there is no reason why it would not work at boot time, granted > > > > the configs (/etc/fstab and kernel parameters for root) have not been > > > > changed. > > > > Therefore, the journal is probably fine and it is not the root cause. > > > > We have already ruled out kernel damage. > > > > > > > > > > > > > > > > > > > > My suggestion: > > > > 1) check out the RAID container status in the setup screen (USE GREAT > > > > CAUTION!! DO NOT CHANGE ANYTHING OR MAKE IT DO ANYTHING THAT YOU DO NOT > > > > UNERSTAND OR YOU CAN LOSE ALL OF YOUR DATA!!!). > > > > This can give you a rough idea of what may be going on. > > > > > > > > You can begin recovery by: > > > > 1) Get another disk onto the machine that is large enough to store the > > > > data that you think is necessary to salvage. > > > > 2) boot into that rescue image again, mount your RAID container, and > > > > slowly (little at a time) copy over the necessary files that you need to > > > > the new disk. > > > > > > > > > > > > There may be more possibilities here, but I am jsut going by the > > > > information presented and the first things that come to mind. > > > > > > > > > > > > I wish you good luck. > > > > If you think you really need my assistance, I can fly out there.. We will > > > > just have to go over the costs. > > > > If this is enough to help you on your way, then that is great. > > > > > > > > > > > > Also, I would like to post this back onto the newsgroups just so that > > > > other people who experience the same problem can benefit, if that is ok > > > > with you. > > > > I have been working with Linux professionally for over 7 years, and have > > > > not contributed back to the community much at all. ;) > > > > > > > > > > > > > > > > Let me know if this helps, or if you would like to move forward with > > > > consultation. > > > > > > > > > > > > > > > > Best Regards, > > > > Art Perry > > > > alp at perryconsulting.net > > > > http://www.perryconsulting.net > > > > http://www.linuxfarms.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, 25 May 2004, Christopher Welton wrote: > > > > > > > > > Arthur: > > > > > > > > > > I read your reply to a problem in the redhat ext3 mailing list. I'd like > > > > > to request your assistance in solving a serious problem we are having > > > > > with one of our servers. Here are the details: > > > > > > > > > > I run a RH 7.3 installation on a Compaq Proliant 6500 with dual pentium > > > > > 266Mhz processors, approx. 630MB ram and a hardware SMART-2DH RAID > > > > > controller and array. All file systems are ext3. > > > > > > > > > > The server has been in service for a couple of years now. From time to > > > > > time we will lose power in our office or have another situtation that > > > > > causes the server to lose power without a proper shutdown. We had such a > > > > > situation today. > > > > > > > > > > Usually the server reboots to runlevel 5 without a problem. However, > > > > > today the server rebooted to the point in the boot process just after > > > > > mounting the root filesystem. It then stalls indefinitely and does not > > > > > continue to boot. > > > > > > > > > > I used a recovery CD to boot into a rescue shell. Once there I > > > > > successfully mounted all the partitions on all drives and examined the > > > > > files successfully, so the data, filesystem and hardware all look good. > > > > > The filesystems were mounted ext2, not ext3. > > > > > > > > > > At this point, my suspicions are that some portion of the kernel > > > > > required for booting was damaged during the power-loss shutdown or that > > > > > the ext3 journal was damaged in such a way as to block booting. > > > > > > > > > > I need suggestions on possible causes of the problem and, better yet, > > > > > possible solutions. > > > > > > > > > > Pls let me know: > > > > > 1. If you think you can solve this issue. > > > > > and > > > > > 2. What rate you would bill at > > > > > > > > > > Thank you in advance > > > > > > > > > > Chris Welton > > > > > Owner > > > > > JumpNowUSA! > > > > > 562-946-6683 > > > > > cwelton at jumpnowusa.com > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users