thr3ads.net - zfs discuss - [zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Brent Jones

2009-Jun-05 18:32 UTC

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Hello all,
I had been running snv_106 for about 3 or 4 months on a pair of
X4540''s.
I would ship snapshots from the primary server to the secondary server
nightly, which was working really well.

However, I have upgraded to 2009.06, and my replication scripts appear
to "hang" when performing zfs send/recv.
When one zfs send/recv process hangs, you cannot send any other
snapshots from any other filesystem to the remote host.
I have about 20 file systems I snapshots and replicate nightly.

The script I use to perform the snapshots is here:
http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh

On the remote side, I end up with many "hung" processes, like this:

  bjones 11676 11661   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd pdxfilu02
  bjones 11673 11660   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd pdxfilu02
  bjones 11664 11653   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd pdxfilu02
  bjones 13727 13722   0 14:21:20 ?           0:00 /sbin/zfs recv -vFd pdxfilu02

And so on, one for each file system.

On the receiving end, ''zfs list'' shows one filesystem
attempting to
receive a snapshot, but I cannot stop it:

$ zfs list
NAME                                       USED  AVAIL  REFER  MOUNTPOINT
pdxfilu02/data/fs01/%20090605-00:30:00  1.74G  27.2T   208G
/pdxfilu02/data/fs01/%20090605-00:30:00



On the sending side, I CAN kill the ZFS send process, but the remote
side leaves its processes going, and I CANNOT kill -9 them. I also
cannot reboot the receiving system, at init 6, the system will just
hang trying to unmount the file systems.
I have to physically cut power to the server, but a couple days later,
this issue will occur again.


I''f I boot to my snv_106 BE, everything works fine, this issue has
never occurred on that version.

Any thoughts?

-- 
Brent Jones
brent at servuhome.net

Brent Jones

2009-Jun-05 21:45 UTC

head link

[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers

On Fri, Jun 5, 2009 at 2:28 PM, Mike La Spina <mike.laspina at laspina.ca>
wrote:> Hi,
>
> I have replications between hosts and they are working fine with zfs
send/recv''s after upgrading to Indiana snv_111b (2009.06).
>
> Have you run the commands manually to see any messages/prompts are
occurring?
>
> It sounds like its waiting for some input.
>
> Regards,
>
> Mike
>
> http://blog.laspina.ca/
> --
> This message posted from opensolaris.org
> _______________________________________________
> storage-discuss mailing list
> storage-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/storage-discuss
>
If I power cycle the server, I can run the replication script manually.
The script will go automatically again for another night or two,
before hanging up.
I''ve piped all output to a file, and there isn''t any prompt
for user
input, and the zfs receive on the remote side is un-killable (and
hangs the server when trying to restart).

It appears to be the receiving end choking on a snapshot, and not
allowing any more to run.
Once one snapshot freezes, running another (for a different file
system) zfs send/recv will just stall, with another un-killable zfs
receive.


-- 
Brent Jones
brent at servuhome.net

Rick Romero

2009-Jun-05 21:49 UTC

head link

[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers

On Fri, 2009-06-05 at 14:45 -0700, Brent Jones wrote:> On Fri, Jun 5, 2009 at 2:28 PM, Mike La Spina <mike.laspina at
laspina.ca> wrote:
> > Hi,
> >
> > I have replications between hosts and they are working fine with zfs
send/recv''s after upgrading to Indiana snv_111b (2009.06).
> >
> > Have you run the commands manually to see any messages/prompts are
occurring?
> >
> > It sounds like its waiting for some input.
> >
> > Regards,
> >
> > Mike
> >
> > http://blog.laspina.ca/
> > --
> > This message posted from opensolaris.org
> > _______________________________________________
> > storage-discuss mailing list
> > storage-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/storage-discuss
> >
> 
> If I power cycle the server, I can run the replication script manually.
> The script will go automatically again for another night or two,
> before hanging up.
> I''ve piped all output to a file, and there isn''t any
prompt for user
> input, and the zfs receive on the remote side is un-killable (and
> hangs the server when trying to restart).
> 
> It appears to be the receiving end choking on a snapshot, and not
> allowing any more to run.
> Once one snapshot freezes, running another (for a different file
> system) zfs send/recv will just stall, with another un-killable zfs
> receive.
> 
Is it the version of ZFS?   I think it was upgraded.  I noticed
something similar after upgrading ZFS on FreeBSD 7 STABLE.  I was trying
to zfs send my @Tuesday, and an automatic script ran (which deletes
@Tuesday and takes a new snap) - and rather than failing as I expected,
the destroy and snapshot commands hung around until the send was done
(hosed up my incrementals - doh :)

Rick

Brent Jones

2009-Jun-05 21:55 UTC

head link

[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers

On Fri, Jun 5, 2009 at 2:49 PM, Rick Romero <rick at havokmon.com>
wrote:> On Fri, 2009-06-05 at 14:45 -0700, Brent Jones wrote:
>> On Fri, Jun 5, 2009 at 2:28 PM, Mike La Spina <mike.laspina at
laspina.ca> wrote:
>> > Hi,
>> >
>> > I have replications between hosts and they are working fine with
zfs send/recv''s after upgrading to Indiana snv_111b (2009.06).
>> >
>> > Have you run the commands manually to see any messages/prompts are
occurring?
>> >
>> > It sounds like its waiting for some input.
>> >
>> > Regards,
>> >
>> > Mike
>> >
>> > http://blog.laspina.ca/
>> > --
>> > This message posted from opensolaris.org
>> > _______________________________________________
>> > storage-discuss mailing list
>> > storage-discuss at opensolaris.org
>> > http://mail.opensolaris.org/mailman/listinfo/storage-discuss
>> >
>>
>> If I power cycle the server, I can run the replication script manually.
>> The script will go automatically again for another night or two,
>> before hanging up.
>> I''ve piped all output to a file, and there isn''t any
prompt for user
>> input, and the zfs receive on the remote side is un-killable (and
>> hangs the server when trying to restart).
>>
>> It appears to be the receiving end choking on a snapshot, and not
>> allowing any more to run.
>> Once one snapshot freezes, running another (for a different file
>> system) zfs send/recv will just stall, with another un-killable zfs
>> receive.
>>
>
> Is it the version of ZFS? ? I think it was upgraded. ?I noticed
> something similar after upgrading ZFS on FreeBSD 7 STABLE. ?I was trying
> to zfs send my @Tuesday, and an automatic script ran (which deletes
> @Tuesday and takes a new snap) - and rather than failing as I expected,
> the destroy and snapshot commands hung around until the send was done
> (hosed up my incrementals - doh :)
>
> Rick
>
>
>
Running the latest version of ZFS on all my file systems.
My replication script adds a user property to the file system, to
effectively "lock it".
My cleanup scripts check for that lock flag, and will die if they see it set.

Its the send/receive that is hung up, I see the pending receiving
still sitting there, more than 24 hours later.

Sad

-- 
Brent Jones
brent at servuhome.net

Ian Collins

2009-Jun-05 22:25 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Brent Jones wrote:>
> On the sending side, I CAN kill the ZFS send process, but the remote
> side leaves its processes going, and I CANNOT kill -9 them. I also
> cannot reboot the receiving system, at init 6, the system will just
> hang trying to unmount the file systems.
> I have to physically cut power to the server, but a couple days later,
> this issue will occur again.
>
>   I have seen this on Solaris 10.  Something appears to break with a pool 
or filesystem causing zfs receive to hang in the kernel.  Once this 
happens, any zfs command that changes the state of the pool/filesystem 
will hang, including a zpool detach or an int 6.

Can you get truss -p or mdb -p to work on the stuck process?

-- 
Ian.

Brent Jones

2009-Jun-05 22:31 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins <ian at ianshome.com>
wrote:> Brent Jones wrote:
>>
>> On the sending side, I CAN kill the ZFS send process, but the remote
>> side leaves its processes going, and I CANNOT kill -9 them. I also
>> cannot reboot the receiving system, at init 6, the system will just
>> hang trying to unmount the file systems.
>> I have to physically cut power to the server, but a couple days later,
>> this issue will occur again.
>>
>>
>
> I have seen this on Solaris 10. ?Something appears to break with a pool or
> filesystem causing zfs receive to hang in the kernel. ?Once this happens,
> any zfs command that changes the state of the pool/filesystem will hang,
> including a zpool detach or an int 6.
>
> Can you get truss -p or mdb -p to work on the stuck process?
>
> --
> Ian.
>
>
I cannot.

# truss -p 11308
truss: unanticipated system error: 11308
(root at pdxfilu02)-(06:29 PM Fri Jun 05)-(log)
# mdb -p 11308
mdb: cannot debug 11308: unanticipated system error
mdb: failed to initialize target: No such file or directory


All the hung zfs receives PID''s have ''1'' as their
PPID.
Is it safe to truss PID 1?  :)

When you saw this, how did you escape it? I''ve found only pulling the
plug will fix it.

-- 
Brent Jones
brent at servuhome.net

Ian Collins

2009-Jun-05 22:51 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Brent Jones wrote:> On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins <ian at ianshome.com>
wrote:
>   
>> Brent Jones wrote:
>>     
>>> On the sending side, I CAN kill the ZFS send process, but the
remote
>>> side leaves its processes going, and I CANNOT kill -9 them. I also
>>> cannot reboot the receiving system, at init 6, the system will just
>>> hang trying to unmount the file systems.
>>> I have to physically cut power to the server, but a couple days
later,
>>> this issue will occur again.
>>>
>>>
>>>       
>> I have seen this on Solaris 10.  Something appears to break with a pool
or
>> filesystem causing zfs receive to hang in the kernel.  Once this
happens,
>> any zfs command that changes the state of the pool/filesystem will
hang,
>> including a zpool detach or an int 6.
>>
>> Can you get truss -p or mdb -p to work on the stuck process?
>
> I cannot.
>
> # truss -p 11308
> truss: unanticipated system error: 11308
> (root at pdxfilu02)-(06:29 PM Fri Jun 05)-(log)
> # mdb -p 11308
> mdb: cannot debug 11308: unanticipated system error
> mdb: failed to initialize target: No such file or directory
>
>   
Same as me...> All the hung zfs receives PID''s have ''1'' as
their PPID.
> Is it safe to truss PID 1?  :)
>
> When you saw this, how did you escape it? I''ve found only pulling
the
> plug will fix it.
>
>   I''m several miles away from the boxes, so I had to resort to a hard 
reset through the ILOM.

I have yet to identify the root cause, all I know is the problem happens 
"sometimes".  I have sent over several 10s of thousands of snapshots
to
the last system that hung over the past few days without incident.

-- 
Ian.

Tim Haley

2009-Jun-05 23:20 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Brent Jones wrote:> Hello all,
> I had been running snv_106 for about 3 or 4 months on a pair of
X4540''s.
> I would ship snapshots from the primary server to the secondary server
> nightly, which was working really well.
> 
> However, I have upgraded to 2009.06, and my replication scripts appear
> to "hang" when performing zfs send/recv.
> When one zfs send/recv process hangs, you cannot send any other
> snapshots from any other filesystem to the remote host.
> I have about 20 file systems I snapshots and replicate nightly.
> 
> The script I use to perform the snapshots is here:
> http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh
> 
> On the remote side, I end up with many "hung" processes, like
this:
> 
>   bjones 11676 11661   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
pdxfilu02
>   bjones 11673 11660   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
pdxfilu02
>   bjones 11664 11653   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
pdxfilu02
>   bjones 13727 13722   0 14:21:20 ?           0:00 /sbin/zfs recv -vFd
pdxfilu02
> 
> And so on, one for each file system.
> 
> On the receiving end, ''zfs list'' shows one filesystem
attempting to
> receive a snapshot, but I cannot stop it:
> 
> $ zfs list
> NAME                                       USED  AVAIL  REFER  MOUNTPOINT
> pdxfilu02/data/fs01/%20090605-00:30:00  1.74G  27.2T   208G
> /pdxfilu02/data/fs01/%20090605-00:30:00
> 
> 
> 
> On the sending side, I CAN kill the ZFS send process, but the remote
> side leaves its processes going, and I CANNOT kill -9 them. I also
> cannot reboot the receiving system, at init 6, the system will just
> hang trying to unmount the file systems.
> I have to physically cut power to the server, but a couple days later,
> this issue will occur again.
> 
> A crash dump from the receiving server with the stuck receives would be highly 
useful, if you can get it.  Reboot -d would be best, but it might just hang. 
You can try savecore -L.

-tim
> I''f I boot to my snv_106 BE, everything works fine, this issue has
> never occurred on that version.
> 
> Any thoughts?
>

Ian Collins

2009-Jun-05 23:29 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Tim Haley wrote:> Brent Jones wrote:
>>
>> On the sending side, I CAN kill the ZFS send process, but the remote
>> side leaves its processes going, and I CANNOT kill -9 them. I also
>> cannot reboot the receiving system, at init 6, the system will just
>> hang trying to unmount the file systems.
>> I have to physically cut power to the server, but a couple days later,
>> this issue will occur again.
>>
>>
> A crash dump from the receiving server with the stuck receives would 
> be highly useful, if you can get it.  Reboot -d would be best, but it 
> might just hang. You can try savecore -L.
>I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang.  I 
didn''t try savecore.

One thing I didn''t try was scat on the running system.  What should I 
look for (with scat) if this happens again?

-- 
Ian.

Brent Jones

2009-Jun-05 23:56 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley <tim.haley at sun.com>
wrote:> Brent Jones wrote:
>>
>> Hello all,
>> I had been running snv_106 for about 3 or 4 months on a pair of
X4540''s.
>> I would ship snapshots from the primary server to the secondary server
>> nightly, which was working really well.
>>
>> However, I have upgraded to 2009.06, and my replication scripts appear
>> to "hang" when performing zfs send/recv.
>> When one zfs send/recv process hangs, you cannot send any other
>> snapshots from any other filesystem to the remote host.
>> I have about 20 file systems I snapshots and replicate nightly.
>>
>> The script I use to perform the snapshots is here:
>> http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh
>>
>> On the remote side, I end up with many "hung" processes, like
this:
>>
>> ?bjones 11676 11661 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>> ?bjones 11673 11660 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>> ?bjones 11664 11653 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>> ?bjones 13727 13722 ? 0 14:21:20 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>>
>> And so on, one for each file system.
>>
>> On the receiving end, ''zfs list'' shows one filesystem
attempting to
>> receive a snapshot, but I cannot stop it:
>>
>> $ zfs list
>> NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? USED ?AVAIL ?REFER
?MOUNTPOINT
>> pdxfilu02/data/fs01/%20090605-00:30:00 ?1.74G ?27.2T ? 208G
>> /pdxfilu02/data/fs01/%20090605-00:30:00
>>
>>
>>
>> On the sending side, I CAN kill the ZFS send process, but the remote
>> side leaves its processes going, and I CANNOT kill -9 them. I also
>> cannot reboot the receiving system, at init 6, the system will just
>> hang trying to unmount the file systems.
>> I have to physically cut power to the server, but a couple days later,
>> this issue will occur again.
>>
>>
> A crash dump from the receiving server with the stuck receives would be
> highly useful, if you can get it. ?Reboot -d would be best, but it might
> just hang. You can try savecore -L.
>
> -tim
>
>> I''f I boot to my snv_106 BE, everything works fine, this issue
has
>> never occurred on that version.
>>
>> Any thoughts?
>>
>
>
I''m doing a savecore -L, but I have 64GB of ram, which makes the dumps
a pita to work with.

Is there any additional information I can provide?

-- 
Brent Jones
brent at servuhome.net

Brent Jones

2009-Jun-06 00:57 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley <tim.haley at sun.com>
wrote:> Brent Jones wrote:
>>
>> Hello all,
>> I had been running snv_106 for about 3 or 4 months on a pair of
X4540''s.
>> I would ship snapshots from the primary server to the secondary server
>> nightly, which was working really well.
>>
>> However, I have upgraded to 2009.06, and my replication scripts appear
>> to "hang" when performing zfs send/recv.
>> When one zfs send/recv process hangs, you cannot send any other
>> snapshots from any other filesystem to the remote host.
>> I have about 20 file systems I snapshots and replicate nightly.
>>
>> The script I use to perform the snapshots is here:
>> http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh
>>
>> On the remote side, I end up with many "hung" processes, like
this:
>>
>> ?bjones 11676 11661 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>> ?bjones 11673 11660 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>> ?bjones 11664 11653 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>> ?bjones 13727 13722 ? 0 14:21:20 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>>
>> And so on, one for each file system.
>>
>> On the receiving end, ''zfs list'' shows one filesystem
attempting to
>> receive a snapshot, but I cannot stop it:
>>
>> $ zfs list
>> NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? USED ?AVAIL ?REFER
?MOUNTPOINT
>> pdxfilu02/data/fs01/%20090605-00:30:00 ?1.74G ?27.2T ? 208G
>> /pdxfilu02/data/fs01/%20090605-00:30:00
>>
>>
>>
>> On the sending side, I CAN kill the ZFS send process, but the remote
>> side leaves its processes going, and I CANNOT kill -9 them. I also
>> cannot reboot the receiving system, at init 6, the system will just
>> hang trying to unmount the file systems.
>> I have to physically cut power to the server, but a couple days later,
>> this issue will occur again.
>>
>>
> A crash dump from the receiving server with the stuck receives would be
> highly useful, if you can get it. ?Reboot -d would be best, but it might
> just hang. You can try savecore -L.
>
> -tim
>
>> I''f I boot to my snv_106 BE, everything works fine, this issue
has
>> never occurred on that version.
>>
>> Any thoughts?
>>
>
>
Well, I think I found a specific file system that is causing this.
I kicked off a zpool scrub to see if there might be corruption on
either end, but that takes well over 40 hours on these servers.


-- 
Brent Jones
brent at servuhome.net

Brent Jones

2009-Jun-06 20:57 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

>
> Well, I think I found a specific file system that is causing this.
> I kicked off a zpool scrub to see if there might be corruption on
> either end, but that takes well over 40 hours on these servers.
>
>
> --
> Brent Jones
> brent at servuhome.net
>
It turns out that the file system I believed would trigger this
scenario is not a guaranteed way to lock up all ZFS related
commands/operations.

Another thing I seem to have found however, it seems to be "load
dependant", meaning if I just kick off one send/recv, there is a 100%
chance of success.
However, if I batch up 3 or more send/recv operations going at the
same time (for different file systems), there seems to be a 50% chance
of all the send/recv operations stalling out and entering this "hung"
state.

I had to restart the scrub, it should be done Monday and I''ll see if
anything is uncovered.

-- 
Brent Jones
brent at servuhome.net

Ian Collins

2009-Jun-07 10:50 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Ian Collins wrote:> Tim Haley wrote:
>> Brent Jones wrote:
>>>
>>> On the sending side, I CAN kill the ZFS send process, but the
remote
>>> side leaves its processes going, and I CANNOT kill -9 them. I also
>>> cannot reboot the receiving system, at init 6, the system will just
>>> hang trying to unmount the file systems.
>>> I have to physically cut power to the server, but a couple days
later,
>>> this issue will occur again.
>>>
>>>
>> A crash dump from the receiving server with the stuck receives would 
>> be highly useful, if you can get it. Reboot -d would be best, but it 
>> might just hang. You can try savecore -L.
>>
> I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. 
> I didn''t try savecore.
>
> One thing I didn''t try was scat on the running system. What should
I
> look for (with scat) if this happens again?
>I now have a system with a hanging zfs receive, any hints on debugging it?

-- 
Ian.

Brent Jones

2009-Jun-08 17:27 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

On Sun, Jun 7, 2009 at 3:50 AM, Ian Collins<ian at ianshome.com>
wrote:> Ian Collins wrote:
>>
>> Tim Haley wrote:
>>>
>>> Brent Jones wrote:
>>>>
>>>> On the sending side, I CAN kill the ZFS send process, but the
remote
>>>> side leaves its processes going, and I CANNOT kill -9 them. I
also
>>>> cannot reboot the receiving system, at init 6, the system will
just
>>>> hang trying to unmount the file systems.
>>>> I have to physically cut power to the server, but a couple days
later,
>>>> this issue will occur again.
>>>>
>>>>
>>> A crash dump from the receiving server with the stuck receives
would be
>>> highly useful, if you can get it. Reboot -d would be best, but it
might just
>>> hang. You can try savecore -L.
>>>
>> I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I
>> didn''t try savecore.
>>
>> One thing I didn''t try was scat on the running system. What
should I look
>> for (with scat) if this happens again?
>>
> I now have a system with a hanging zfs receive, any hints on debugging it?
>
> --
> Ian.
I haven''t figured out a way to identify the problem, still trying to
find a 100% way to reproduce this problem.
Seemingly the more snapshots I send at a given time, the likelihood of
this happening goes up, but, correlation is not causation  :)

I might try to open a support case with Sun (have a support contract),
but Opensolaris doesn''t seem to be well understood by the support
folks yet, so not sure how far it will get.

-- 
Brent Jones
brent at servuhome.net

Brent Jones

2009-Jun-09 03:57 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

>
> I haven''t figured out a way to identify the problem, still trying
to
> find a 100% way to reproduce this problem.
> Seemingly the more snapshots I send at a given time, the likelihood of
> this happening goes up, but, correlation is not causation ?:)
>
> I might try to open a support case with Sun (have a support contract),
> but Opensolaris doesn''t seem to be well understood by the support
> folks yet, so not sure how far it will get.
>
> --
> Brent Jones
> brent at servuhome.net
>
I can reproduce this 100% by sending about 6 or more snapshots at once.

Here is some output that JBK helped me put together:

Here is a pastebin ''mdb'' findstack output:
http://pastebin.com/m4751b08c

Not sure what I''m looking at, but maybe someone at Sun can see whats
going on?



-- 
Brent Jones
brent at servuhome.net

Richard Lowe

2009-Jun-09 04:38 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Brent Jones <brent at servuhome.net> writes:
>>
>> I haven''t figured out a way to identify the problem, still
trying to
>> find a 100% way to reproduce this problem.
>> Seemingly the more snapshots I send at a given time, the likelihood of
>> this happening goes up, but, correlation is not causation ?:)
>>
>> I might try to open a support case with Sun (have a support contract),
>> but Opensolaris doesn''t seem to be well understood by the
support
>> folks yet, so not sure how far it will get.
>>
>> --
>> Brent Jones
>> brent at servuhome.net
>>
>
> I can reproduce this 100% by sending about 6 or more snapshots at once.
>
> Here is some output that JBK helped me put together:
>
> Here is a pastebin ''mdb'' findstack output:
> http://pastebin.com/m4751b08c
>
> Not sure what I''m looking at, but maybe someone at Sun can see
whats going on?
I''ve had similar issues with similar traces.  I think you''re
waiting on
a transaction that''s never going to come.

I thought at the time that I was hitting:
   CR 6367701 "hang because tx_state_t is inconsistent"

But given the rash of reports here, it seems perhaps this is something
different.

I, like you, hit it when sending snapshots, it seems (in my case) to be
specific to incremental streams, rather than full streams, I can send
seemingly any number of full streams, but incremental sends via send -i,
or send -R of datasets with multiple snapshots, will get into a state
like that above.

-- Rich

Brent Jones

2009-Jun-09 05:01 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowe<richlowe at richlowe.net>
wrote:> Brent Jones <brent at servuhome.net> writes:
>
>
> I''ve had similar issues with similar traces. ?I think
you''re waiting on
> a transaction that''s never going to come.
>
> I thought at the time that I was hitting:
> ? CR 6367701 "hang because tx_state_t is inconsistent"
>
> But given the rash of reports here, it seems perhaps this is something
> different.
>
> I, like you, hit it when sending snapshots, it seems (in my case) to be
> specific to incremental streams, rather than full streams, I can send
> seemingly any number of full streams, but incremental sends via send -i,
> or send -R of datasets with multiple snapshots, will get into a state
> like that above.
>
> -- Rich
>
For now, back to snv_106 (the most stable build that I''ve seen, like it
a lot)
I''ll open a case in the morning, and see what they suggest.


-- 
Brent Jones
brent at servuhome.net

Tim Haley

2009-Jun-10 22:22 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Brent Jones wrote:> On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowe<richlowe at
richlowe.net> wrote:
>> Brent Jones <brent at servuhome.net> writes:
>>
> 
>> I''ve had similar issues with similar traces.  I think
you''re waiting on
>> a transaction that''s never going to come.
>>
>> I thought at the time that I was hitting:
>>   CR 6367701 "hang because tx_state_t is inconsistent"
>>
>> But given the rash of reports here, it seems perhaps this is something
>> different.
>>
>> I, like you, hit it when sending snapshots, it seems (in my case) to be
>> specific to incremental streams, rather than full streams, I can send
>> seemingly any number of full streams, but incremental sends via send
-i,
>> or send -R of datasets with multiple snapshots, will get into a state
>> like that above.
>>
>> -- Rich
>>
> 
> For now, back to snv_106 (the most stable build that I''ve seen,
like it a lot)
> I''ll open a case in the morning, and see what they suggest.
> 
> After examining the dump we got from you (thanks again), we''re
relatively sure
you are hitting

6826836 Deadlock possible in dmu_object_reclaim()

This was introduced in nv_111 and fixed in nv_113.

Sorry for the trouble.

-tim

Robert Milkowski

2009-Jun-11 20:28 UTC

head link

[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Hello Ian,

Saturday, June 6, 2009, 12:29:48 AM, you wrote:

IC> Tim Haley wrote:>> Brent Jones wrote:
>>>
>>> On the sending side, I CAN kill the ZFS send process, but the
remote
>>> side leaves its processes going, and I CANNOT kill -9 them. I also
>>> cannot reboot the receiving system, at init 6, the system will just
>>> hang trying to unmount the file systems.
>>> I have to physically cut power to the server, but a couple days
later,
>>> this issue will occur again.
>>>
>>>
>> A crash dump from the receiving server with the stuck receives would 
>> be highly useful, if you can get it.  Reboot -d would be best, but it 
>> might just hang. You can try savecore -L.
>>IC> I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang.  I
IC> didn''t try savecore.

mdb -KF

and then $<systemdump

ps. make sure you have an access to the console

-- 
Best regards,
 Robert Milkowski
                                       http://milek.blogspot.com

Brent Jones

2009-Jun-12 01:43 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

>>
>>
> After examining the dump we got from you (thanks again), we''re
relatively
> sure you are hitting
>
> 6826836 Deadlock possible in dmu_object_reclaim()
>
> This was introduced in nv_111 and fixed in nv_113.
>
> Sorry for the trouble.
>
> -tim
>
>
Do you know when new builds will show up on pkg.opensolaris.org/dev ?


-- 
Brent Jones
brent at servuhome.net

Ian Collins

2009-Jul-01 04:45 UTC

head link

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

Tim Haley wrote:> Ian Collins wrote:
>> Ian Collins wrote:
>>> Tim Haley wrote:
>>>> Brent Jones wrote:
>>>>>
>>>>> On the sending side, I CAN kill the ZFS send process, but
the remote
>>>>> side leaves its processes going, and I CANNOT kill -9 them.
I also
>>>>> cannot reboot the receiving system, at init 6, the system
will just
>>>>> hang trying to unmount the file systems.
>>>>> I have to physically cut power to the server, but a couple
days
>>>>> later,
>>>>> this issue will occur again.
>>>>>
>>>>>
>>>> A crash dump from the receiving server with the stuck receives 
>>>> would be highly useful, if you can get it. Reboot -d would be
best,
>>>> but it might just hang. You can try savecore -L.
>>>>
>>> I tried a reboot -d (I even had kmem-flags=0xf set), but it did 
>>> hang. I didn''t try savecore.
>>>
>>> One thing I didn''t try was scat on the running system.
What should I
>>> look for (with scat) if this happens again?
>>>
>> I now have a system with a hanging zfs receive, any hints on 
>> debugging it?
>>
>
> If you''ve got it stuck, but can still do things on the console,
then
> run ''mdb -K'' on the console and type ''::stacks
-m zfs''.  That will
> summarize all threads running in the kernel related to zfs.  Perhaps 
> there will be a clue in the stacks of the receive(s) as to where they 
> are stuck.
>I''ve seen this again on Solaris 10u7. 

I''m doing a couple of full sends from one pool to another on the same 
host.  One of the sends stopped while the other completed.  All zfs 
commands on the source pool now hang, the destination pool is OK.

::stacks isn''t recognised by mdb on Solaris 10, is there an
alternative?
> Also, make sure the spa is healthy and not suspended.  This is an 
> example on one of my machines.
>There were.

-- 
Ian.

zfs discuss - Jun 2009 - ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers

[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers