thr3ads.net - zfs discuss - [zfs-discuss] ZFS crash [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Mark Christooph

2010-Jul-07 17:09 UTC

[zfs-discuss] ZFS crash

I had an interesting dilemma recently and I''m wondering if anyone here
can illuminate on why this happened.

I have a number of pools, including the root pool, in on-board disks on the
server. I also have one pool on a SAN disk, outside the system. Last night the
SAN crashed, and shortly thereafter, the ZFS system executed a number of cron
jobs, most of which involved running functions on the pool that was on the SAN.
This caused a number of problems, most notably that when the SAN eventually came
up, those cron jobs finished, and then crashed the system again.

Only by [i]zfs destroy[/i] on the newly created zfs file system that the cron
jobs created was the system able to boot up again. As long as those corrupted
zfs file systems remained on the SAN disk, not even the rpool would boot up
correctly. None of the zfs file systems would mount, and most services were
disabled. Once I destroyed the newly created zfs file systems, everything
instantly mounted and all services started.

Question: why would those one zfs file systems prevent ALL pools from mounting,
even when they are on different disks and file systems, and prevent all services
from starting? I thought ZFS was more resistant to this sort of thing. I will
have to edit my scripts and add SAN-checking to make sure it is up before they
execute to prevent this from happening again. Luckily I still had all the raw
data that the cron jobs were working with, so I was able to quickly re-create
what the cron jobs did originally.   Although this happened with Solaris 10,
perhaps the discussion could be applicable to OpenSolaris as well (I use both).
-- 
This message posted from opensolaris.org

Garrett D''Amore

2010-Jul-07 17:33 UTC

head link

[zfs-discuss] ZFS crash

On Wed, 2010-07-07 at 10:09 -0700, Mark Christooph
wrote:> I had an interesting dilemma recently and I''m wondering if anyone
here can illuminate on why this happened.
> 
> I have a number of pools, including the root pool, in on-board disks on the
server. I also have one pool on a SAN disk, outside the system. Last night the
SAN crashed, and shortly thereafter, the ZFS system executed a number of cron
jobs, most of which involved running functions on the pool that was on the SAN.
This caused a number of problems, most notably that when the SAN eventually came
up, those cron jobs finished, and then crashed the system again.
> 
> Only by [i]zfs destroy[/i] on the newly created zfs file system that the
cron jobs created was the system able to boot up again. As long as those
corrupted zfs file systems remained on the SAN disk, not even the rpool would
boot up correctly. None of the zfs file systems would mount, and most services
were disabled. Once I destroyed the newly created zfs file systems, everything
instantly mounted and all services started.
> 
> Question: why would those one zfs file systems prevent ALL pools from
mounting, even when they are on different disks and file systems, and prevent
all services from starting? I thought ZFS was more resistant to this sort of
thing. I will have to edit my scripts and add SAN-checking to make sure it is up
before they execute to prevent this from happening again. Luckily I still had
all the raw data that the cron jobs were working with, so I was able to quickly
re-create what the cron jobs did originally.   Although this happened with
Solaris 10, perhaps the discussion could be applicable to OpenSolaris as well (I
use both).

This sounds like a bug.  It would certainly be informative to see the
backtrace of the panic from your logs.  Also, it would be more
informative if you were seeing this problem on OpenSolaris.  Many of us
will not be able to do much with a Solaris 10 backtrace, since we don''t
have access to S10 sources.

	- Garrett

Lori Alt

2010-Jul-07 17:41 UTC

head link

[zfs-discuss] ZFS crash

On 07/ 7/10 11:33 AM, Garrett D''Amore wrote:> On Wed, 2010-07-07 at 10:09 -0700, Mark Christooph wrote:
>    
>> I had an interesting dilemma recently and I''m wondering if
anyone here can illuminate on why this happened.
>>
>> I have a number of pools, including the root pool, in on-board disks on
the server. I also have one pool on a SAN disk, outside the system. Last night
the SAN crashed, and shortly thereafter, the ZFS system executed a number of
cron jobs, most of which involved running functions on the pool that was on the
SAN. This caused a number of problems, most notably that when the SAN eventually
came up, those cron jobs finished, and then crashed the system again.
>>
>> Only by [i]zfs destroy[/i] on the newly created zfs file system that
the cron jobs created was the system able to boot up again. As long as those
corrupted zfs file systems remained on the SAN disk, not even the rpool would
boot up correctly. None of the zfs file systems would mount, and most services
were disabled. Once I destroyed the newly created zfs file systems, everything
instantly mounted and all services started.
>>
>> Question: why would those one zfs file systems prevent ALL pools from
mounting, even when they are on different disks and file systems, and prevent
all services from starting? I thought ZFS was more resistant to this sort of
thing. I will have to edit my scripts and add SAN-checking to make sure it is up
before they execute to prevent this from happening again. Luckily I still had
all the raw data that the cron jobs were working with, so I was able to quickly
re-create what the cron jobs did originally.   Although this happened with
Solaris 10, perhaps the discussion could be applicable to OpenSolaris as well (I
use both).
>>      
>
> This sounds like a bug.  It would certainly be informative to see the
> backtrace of the panic from your logs.  Also, it would be more
> informative if you were seeing this problem on OpenSolaris.  Many of us
> will not be able to do much with a Solaris 10 backtrace, since we
don''t
> have access to S10 sources.
>    Also, the logs from /var/svc/log would be helpful.

Lori

zfs discuss - Jul 2010 - ZFS crash

[zfs-discuss] ZFS crash

[zfs-discuss] ZFS crash

[zfs-discuss] ZFS crash