Brent Jones
2009-Jun-05 18:32 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Hello all, I had been running snv_106 for about 3 or 4 months on a pair of X4540''s. I would ship snapshots from the primary server to the secondary server nightly, which was working really well. However, I have upgraded to 2009.06, and my replication scripts appear to "hang" when performing zfs send/recv. When one zfs send/recv process hangs, you cannot send any other snapshots from any other filesystem to the remote host. I have about 20 file systems I snapshots and replicate nightly. The script I use to perform the snapshots is here: http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh On the remote side, I end up with many "hung" processes, like this: bjones 11676 11661 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11673 11660 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11664 11653 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 13727 13722 0 14:21:20 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 And so on, one for each file system. On the receiving end, ''zfs list'' shows one filesystem attempting to receive a snapshot, but I cannot stop it: $ zfs list NAME USED AVAIL REFER MOUNTPOINT pdxfilu02/data/fs01/%20090605-00:30:00 1.74G 27.2T 208G /pdxfilu02/data/fs01/%20090605-00:30:00 On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. I''f I boot to my snv_106 BE, everything works fine, this issue has never occurred on that version. Any thoughts? -- Brent Jones brent at servuhome.net
Brent Jones
2009-Jun-05 21:45 UTC
[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers
On Fri, Jun 5, 2009 at 2:28 PM, Mike La Spina <mike.laspina at laspina.ca> wrote:> Hi, > > I have replications between hosts and they are working fine with zfs send/recv''s after upgrading to Indiana snv_111b (2009.06). > > Have you run the commands manually to see any messages/prompts are occurring? > > It sounds like its waiting for some input. > > Regards, > > Mike > > http://blog.laspina.ca/ > -- > This message posted from opensolaris.org > _______________________________________________ > storage-discuss mailing list > storage-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/storage-discuss >If I power cycle the server, I can run the replication script manually. The script will go automatically again for another night or two, before hanging up. I''ve piped all output to a file, and there isn''t any prompt for user input, and the zfs receive on the remote side is un-killable (and hangs the server when trying to restart). It appears to be the receiving end choking on a snapshot, and not allowing any more to run. Once one snapshot freezes, running another (for a different file system) zfs send/recv will just stall, with another un-killable zfs receive. -- Brent Jones brent at servuhome.net
Rick Romero
2009-Jun-05 21:49 UTC
[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers
On Fri, 2009-06-05 at 14:45 -0700, Brent Jones wrote:> On Fri, Jun 5, 2009 at 2:28 PM, Mike La Spina <mike.laspina at laspina.ca> wrote: > > Hi, > > > > I have replications between hosts and they are working fine with zfs send/recv''s after upgrading to Indiana snv_111b (2009.06). > > > > Have you run the commands manually to see any messages/prompts are occurring? > > > > It sounds like its waiting for some input. > > > > Regards, > > > > Mike > > > > http://blog.laspina.ca/ > > -- > > This message posted from opensolaris.org > > _______________________________________________ > > storage-discuss mailing list > > storage-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/storage-discuss > > > > If I power cycle the server, I can run the replication script manually. > The script will go automatically again for another night or two, > before hanging up. > I''ve piped all output to a file, and there isn''t any prompt for user > input, and the zfs receive on the remote side is un-killable (and > hangs the server when trying to restart). > > It appears to be the receiving end choking on a snapshot, and not > allowing any more to run. > Once one snapshot freezes, running another (for a different file > system) zfs send/recv will just stall, with another un-killable zfs > receive. >Is it the version of ZFS? I think it was upgraded. I noticed something similar after upgrading ZFS on FreeBSD 7 STABLE. I was trying to zfs send my @Tuesday, and an automatic script ran (which deletes @Tuesday and takes a new snap) - and rather than failing as I expected, the destroy and snapshot commands hung around until the send was done (hosed up my incrementals - doh :) Rick
Brent Jones
2009-Jun-05 21:55 UTC
[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers
On Fri, Jun 5, 2009 at 2:49 PM, Rick Romero <rick at havokmon.com> wrote:> On Fri, 2009-06-05 at 14:45 -0700, Brent Jones wrote: >> On Fri, Jun 5, 2009 at 2:28 PM, Mike La Spina <mike.laspina at laspina.ca> wrote: >> > Hi, >> > >> > I have replications between hosts and they are working fine with zfs send/recv''s after upgrading to Indiana snv_111b (2009.06). >> > >> > Have you run the commands manually to see any messages/prompts are occurring? >> > >> > It sounds like its waiting for some input. >> > >> > Regards, >> > >> > Mike >> > >> > http://blog.laspina.ca/ >> > -- >> > This message posted from opensolaris.org >> > _______________________________________________ >> > storage-discuss mailing list >> > storage-discuss at opensolaris.org >> > http://mail.opensolaris.org/mailman/listinfo/storage-discuss >> > >> >> If I power cycle the server, I can run the replication script manually. >> The script will go automatically again for another night or two, >> before hanging up. >> I''ve piped all output to a file, and there isn''t any prompt for user >> input, and the zfs receive on the remote side is un-killable (and >> hangs the server when trying to restart). >> >> It appears to be the receiving end choking on a snapshot, and not >> allowing any more to run. >> Once one snapshot freezes, running another (for a different file >> system) zfs send/recv will just stall, with another un-killable zfs >> receive. >> > > Is it the version of ZFS? ? I think it was upgraded. ?I noticed > something similar after upgrading ZFS on FreeBSD 7 STABLE. ?I was trying > to zfs send my @Tuesday, and an automatic script ran (which deletes > @Tuesday and takes a new snap) - and rather than failing as I expected, > the destroy and snapshot commands hung around until the send was done > (hosed up my incrementals - doh :) > > Rick > > >Running the latest version of ZFS on all my file systems. My replication script adds a user property to the file system, to effectively "lock it". My cleanup scripts check for that lock flag, and will die if they see it set. Its the send/receive that is hung up, I see the pending receiving still sitting there, more than 24 hours later. Sad -- Brent Jones brent at servuhome.net
Ian Collins
2009-Jun-05 22:25 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Brent Jones wrote:> > On the sending side, I CAN kill the ZFS send process, but the remote > side leaves its processes going, and I CANNOT kill -9 them. I also > cannot reboot the receiving system, at init 6, the system will just > hang trying to unmount the file systems. > I have to physically cut power to the server, but a couple days later, > this issue will occur again. > >I have seen this on Solaris 10. Something appears to break with a pool or filesystem causing zfs receive to hang in the kernel. Once this happens, any zfs command that changes the state of the pool/filesystem will hang, including a zpool detach or an int 6. Can you get truss -p or mdb -p to work on the stuck process? -- Ian.
Brent Jones
2009-Jun-05 22:31 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins <ian at ianshome.com> wrote:> Brent Jones wrote: >> >> On the sending side, I CAN kill the ZFS send process, but the remote >> side leaves its processes going, and I CANNOT kill -9 them. I also >> cannot reboot the receiving system, at init 6, the system will just >> hang trying to unmount the file systems. >> I have to physically cut power to the server, but a couple days later, >> this issue will occur again. >> >> > > I have seen this on Solaris 10. ?Something appears to break with a pool or > filesystem causing zfs receive to hang in the kernel. ?Once this happens, > any zfs command that changes the state of the pool/filesystem will hang, > including a zpool detach or an int 6. > > Can you get truss -p or mdb -p to work on the stuck process? > > -- > Ian. > >I cannot. # truss -p 11308 truss: unanticipated system error: 11308 (root at pdxfilu02)-(06:29 PM Fri Jun 05)-(log) # mdb -p 11308 mdb: cannot debug 11308: unanticipated system error mdb: failed to initialize target: No such file or directory All the hung zfs receives PID''s have ''1'' as their PPID. Is it safe to truss PID 1? :) When you saw this, how did you escape it? I''ve found only pulling the plug will fix it. -- Brent Jones brent at servuhome.net
Ian Collins
2009-Jun-05 22:51 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Brent Jones wrote:> On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins <ian at ianshome.com> wrote: > >> Brent Jones wrote: >> >>> On the sending side, I CAN kill the ZFS send process, but the remote >>> side leaves its processes going, and I CANNOT kill -9 them. I also >>> cannot reboot the receiving system, at init 6, the system will just >>> hang trying to unmount the file systems. >>> I have to physically cut power to the server, but a couple days later, >>> this issue will occur again. >>> >>> >>> >> I have seen this on Solaris 10. Something appears to break with a pool or >> filesystem causing zfs receive to hang in the kernel. Once this happens, >> any zfs command that changes the state of the pool/filesystem will hang, >> including a zpool detach or an int 6. >> >> Can you get truss -p or mdb -p to work on the stuck process? > > I cannot. > > # truss -p 11308 > truss: unanticipated system error: 11308 > (root at pdxfilu02)-(06:29 PM Fri Jun 05)-(log) > # mdb -p 11308 > mdb: cannot debug 11308: unanticipated system error > mdb: failed to initialize target: No such file or directory > >Same as me...> All the hung zfs receives PID''s have ''1'' as their PPID. > Is it safe to truss PID 1? :) > > When you saw this, how did you escape it? I''ve found only pulling the > plug will fix it. > >I''m several miles away from the boxes, so I had to resort to a hard reset through the ILOM. I have yet to identify the root cause, all I know is the problem happens "sometimes". I have sent over several 10s of thousands of snapshots to the last system that hung over the past few days without incident. -- Ian.
Tim Haley
2009-Jun-05 23:20 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Brent Jones wrote:> Hello all, > I had been running snv_106 for about 3 or 4 months on a pair of X4540''s. > I would ship snapshots from the primary server to the secondary server > nightly, which was working really well. > > However, I have upgraded to 2009.06, and my replication scripts appear > to "hang" when performing zfs send/recv. > When one zfs send/recv process hangs, you cannot send any other > snapshots from any other filesystem to the remote host. > I have about 20 file systems I snapshots and replicate nightly. > > The script I use to perform the snapshots is here: > http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh > > On the remote side, I end up with many "hung" processes, like this: > > bjones 11676 11661 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 > bjones 11673 11660 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 > bjones 11664 11653 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 > bjones 13727 13722 0 14:21:20 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 > > And so on, one for each file system. > > On the receiving end, ''zfs list'' shows one filesystem attempting to > receive a snapshot, but I cannot stop it: > > $ zfs list > NAME USED AVAIL REFER MOUNTPOINT > pdxfilu02/data/fs01/%20090605-00:30:00 1.74G 27.2T 208G > /pdxfilu02/data/fs01/%20090605-00:30:00 > > > > On the sending side, I CAN kill the ZFS send process, but the remote > side leaves its processes going, and I CANNOT kill -9 them. I also > cannot reboot the receiving system, at init 6, the system will just > hang trying to unmount the file systems. > I have to physically cut power to the server, but a couple days later, > this issue will occur again. > >A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. -tim> I''f I boot to my snv_106 BE, everything works fine, this issue has > never occurred on that version. > > Any thoughts? >
Ian Collins
2009-Jun-05 23:29 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Tim Haley wrote:> Brent Jones wrote: >> >> On the sending side, I CAN kill the ZFS send process, but the remote >> side leaves its processes going, and I CANNOT kill -9 them. I also >> cannot reboot the receiving system, at init 6, the system will just >> hang trying to unmount the file systems. >> I have to physically cut power to the server, but a couple days later, >> this issue will occur again. >> >> > A crash dump from the receiving server with the stuck receives would > be highly useful, if you can get it. Reboot -d would be best, but it > might just hang. You can try savecore -L. >I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I didn''t try savecore. One thing I didn''t try was scat on the running system. What should I look for (with scat) if this happens again? -- Ian.
Brent Jones
2009-Jun-05 23:56 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley <tim.haley at sun.com> wrote:> Brent Jones wrote: >> >> Hello all, >> I had been running snv_106 for about 3 or 4 months on a pair of X4540''s. >> I would ship snapshots from the primary server to the secondary server >> nightly, which was working really well. >> >> However, I have upgraded to 2009.06, and my replication scripts appear >> to "hang" when performing zfs send/recv. >> When one zfs send/recv process hangs, you cannot send any other >> snapshots from any other filesystem to the remote host. >> I have about 20 file systems I snapshots and replicate nightly. >> >> The script I use to perform the snapshots is here: >> http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh >> >> On the remote side, I end up with many "hung" processes, like this: >> >> ?bjones 11676 11661 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd >> pdxfilu02 >> ?bjones 11673 11660 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd >> pdxfilu02 >> ?bjones 11664 11653 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd >> pdxfilu02 >> ?bjones 13727 13722 ? 0 14:21:20 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd >> pdxfilu02 >> >> And so on, one for each file system. >> >> On the receiving end, ''zfs list'' shows one filesystem attempting to >> receive a snapshot, but I cannot stop it: >> >> $ zfs list >> NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? USED ?AVAIL ?REFER ?MOUNTPOINT >> pdxfilu02/data/fs01/%20090605-00:30:00 ?1.74G ?27.2T ? 208G >> /pdxfilu02/data/fs01/%20090605-00:30:00 >> >> >> >> On the sending side, I CAN kill the ZFS send process, but the remote >> side leaves its processes going, and I CANNOT kill -9 them. I also >> cannot reboot the receiving system, at init 6, the system will just >> hang trying to unmount the file systems. >> I have to physically cut power to the server, but a couple days later, >> this issue will occur again. >> >> > A crash dump from the receiving server with the stuck receives would be > highly useful, if you can get it. ?Reboot -d would be best, but it might > just hang. You can try savecore -L. > > -tim > >> I''f I boot to my snv_106 BE, everything works fine, this issue has >> never occurred on that version. >> >> Any thoughts? >> > >I''m doing a savecore -L, but I have 64GB of ram, which makes the dumps a pita to work with. Is there any additional information I can provide? -- Brent Jones brent at servuhome.net
Brent Jones
2009-Jun-06 00:57 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley <tim.haley at sun.com> wrote:> Brent Jones wrote: >> >> Hello all, >> I had been running snv_106 for about 3 or 4 months on a pair of X4540''s. >> I would ship snapshots from the primary server to the secondary server >> nightly, which was working really well. >> >> However, I have upgraded to 2009.06, and my replication scripts appear >> to "hang" when performing zfs send/recv. >> When one zfs send/recv process hangs, you cannot send any other >> snapshots from any other filesystem to the remote host. >> I have about 20 file systems I snapshots and replicate nightly. >> >> The script I use to perform the snapshots is here: >> http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh >> >> On the remote side, I end up with many "hung" processes, like this: >> >> ?bjones 11676 11661 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd >> pdxfilu02 >> ?bjones 11673 11660 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd >> pdxfilu02 >> ?bjones 11664 11653 ? 0 01:30:03 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd >> pdxfilu02 >> ?bjones 13727 13722 ? 0 14:21:20 ? ? ? ? ? ? 0:00 /sbin/zfs recv -vFd >> pdxfilu02 >> >> And so on, one for each file system. >> >> On the receiving end, ''zfs list'' shows one filesystem attempting to >> receive a snapshot, but I cannot stop it: >> >> $ zfs list >> NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? USED ?AVAIL ?REFER ?MOUNTPOINT >> pdxfilu02/data/fs01/%20090605-00:30:00 ?1.74G ?27.2T ? 208G >> /pdxfilu02/data/fs01/%20090605-00:30:00 >> >> >> >> On the sending side, I CAN kill the ZFS send process, but the remote >> side leaves its processes going, and I CANNOT kill -9 them. I also >> cannot reboot the receiving system, at init 6, the system will just >> hang trying to unmount the file systems. >> I have to physically cut power to the server, but a couple days later, >> this issue will occur again. >> >> > A crash dump from the receiving server with the stuck receives would be > highly useful, if you can get it. ?Reboot -d would be best, but it might > just hang. You can try savecore -L. > > -tim > >> I''f I boot to my snv_106 BE, everything works fine, this issue has >> never occurred on that version. >> >> Any thoughts? >> > >Well, I think I found a specific file system that is causing this. I kicked off a zpool scrub to see if there might be corruption on either end, but that takes well over 40 hours on these servers. -- Brent Jones brent at servuhome.net
Brent Jones
2009-Jun-06 20:57 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
> > Well, I think I found a specific file system that is causing this. > I kicked off a zpool scrub to see if there might be corruption on > either end, but that takes well over 40 hours on these servers. > > > -- > Brent Jones > brent at servuhome.net >It turns out that the file system I believed would trigger this scenario is not a guaranteed way to lock up all ZFS related commands/operations. Another thing I seem to have found however, it seems to be "load dependant", meaning if I just kick off one send/recv, there is a 100% chance of success. However, if I batch up 3 or more send/recv operations going at the same time (for different file systems), there seems to be a 50% chance of all the send/recv operations stalling out and entering this "hung" state. I had to restart the scrub, it should be done Monday and I''ll see if anything is uncovered. -- Brent Jones brent at servuhome.net
Ian Collins
2009-Jun-07 10:50 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Ian Collins wrote:> Tim Haley wrote: >> Brent Jones wrote: >>> >>> On the sending side, I CAN kill the ZFS send process, but the remote >>> side leaves its processes going, and I CANNOT kill -9 them. I also >>> cannot reboot the receiving system, at init 6, the system will just >>> hang trying to unmount the file systems. >>> I have to physically cut power to the server, but a couple days later, >>> this issue will occur again. >>> >>> >> A crash dump from the receiving server with the stuck receives would >> be highly useful, if you can get it. Reboot -d would be best, but it >> might just hang. You can try savecore -L. >> > I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. > I didn''t try savecore. > > One thing I didn''t try was scat on the running system. What should I > look for (with scat) if this happens again? >I now have a system with a hanging zfs receive, any hints on debugging it? -- Ian.
Brent Jones
2009-Jun-08 17:27 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
On Sun, Jun 7, 2009 at 3:50 AM, Ian Collins<ian at ianshome.com> wrote:> Ian Collins wrote: >> >> Tim Haley wrote: >>> >>> Brent Jones wrote: >>>> >>>> On the sending side, I CAN kill the ZFS send process, but the remote >>>> side leaves its processes going, and I CANNOT kill -9 them. I also >>>> cannot reboot the receiving system, at init 6, the system will just >>>> hang trying to unmount the file systems. >>>> I have to physically cut power to the server, but a couple days later, >>>> this issue will occur again. >>>> >>>> >>> A crash dump from the receiving server with the stuck receives would be >>> highly useful, if you can get it. Reboot -d would be best, but it might just >>> hang. You can try savecore -L. >>> >> I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I >> didn''t try savecore. >> >> One thing I didn''t try was scat on the running system. What should I look >> for (with scat) if this happens again? >> > I now have a system with a hanging zfs receive, any hints on debugging it? > > -- > Ian.I haven''t figured out a way to identify the problem, still trying to find a 100% way to reproduce this problem. Seemingly the more snapshots I send at a given time, the likelihood of this happening goes up, but, correlation is not causation :) I might try to open a support case with Sun (have a support contract), but Opensolaris doesn''t seem to be well understood by the support folks yet, so not sure how far it will get. -- Brent Jones brent at servuhome.net
Brent Jones
2009-Jun-09 03:57 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
> > I haven''t figured out a way to identify the problem, still trying to > find a 100% way to reproduce this problem. > Seemingly the more snapshots I send at a given time, the likelihood of > this happening goes up, but, correlation is not causation ?:) > > I might try to open a support case with Sun (have a support contract), > but Opensolaris doesn''t seem to be well understood by the support > folks yet, so not sure how far it will get. > > -- > Brent Jones > brent at servuhome.net >I can reproduce this 100% by sending about 6 or more snapshots at once. Here is some output that JBK helped me put together: Here is a pastebin ''mdb'' findstack output: http://pastebin.com/m4751b08c Not sure what I''m looking at, but maybe someone at Sun can see whats going on? -- Brent Jones brent at servuhome.net
Richard Lowe
2009-Jun-09 04:38 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Brent Jones <brent at servuhome.net> writes:>> >> I haven''t figured out a way to identify the problem, still trying to >> find a 100% way to reproduce this problem. >> Seemingly the more snapshots I send at a given time, the likelihood of >> this happening goes up, but, correlation is not causation ?:) >> >> I might try to open a support case with Sun (have a support contract), >> but Opensolaris doesn''t seem to be well understood by the support >> folks yet, so not sure how far it will get. >> >> -- >> Brent Jones >> brent at servuhome.net >> > > I can reproduce this 100% by sending about 6 or more snapshots at once. > > Here is some output that JBK helped me put together: > > Here is a pastebin ''mdb'' findstack output: > http://pastebin.com/m4751b08c > > Not sure what I''m looking at, but maybe someone at Sun can see whats going on?I''ve had similar issues with similar traces. I think you''re waiting on a transaction that''s never going to come. I thought at the time that I was hitting: CR 6367701 "hang because tx_state_t is inconsistent" But given the rash of reports here, it seems perhaps this is something different. I, like you, hit it when sending snapshots, it seems (in my case) to be specific to incremental streams, rather than full streams, I can send seemingly any number of full streams, but incremental sends via send -i, or send -R of datasets with multiple snapshots, will get into a state like that above. -- Rich
Brent Jones
2009-Jun-09 05:01 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowe<richlowe at richlowe.net> wrote:> Brent Jones <brent at servuhome.net> writes: >> > I''ve had similar issues with similar traces. ?I think you''re waiting on > a transaction that''s never going to come. > > I thought at the time that I was hitting: > ? CR 6367701 "hang because tx_state_t is inconsistent" > > But given the rash of reports here, it seems perhaps this is something > different. > > I, like you, hit it when sending snapshots, it seems (in my case) to be > specific to incremental streams, rather than full streams, I can send > seemingly any number of full streams, but incremental sends via send -i, > or send -R of datasets with multiple snapshots, will get into a state > like that above. > > -- Rich >For now, back to snv_106 (the most stable build that I''ve seen, like it a lot) I''ll open a case in the morning, and see what they suggest. -- Brent Jones brent at servuhome.net
Tim Haley
2009-Jun-10 22:22 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Brent Jones wrote:> On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowe<richlowe at richlowe.net> wrote: >> Brent Jones <brent at servuhome.net> writes: >> > >> I''ve had similar issues with similar traces. I think you''re waiting on >> a transaction that''s never going to come. >> >> I thought at the time that I was hitting: >> CR 6367701 "hang because tx_state_t is inconsistent" >> >> But given the rash of reports here, it seems perhaps this is something >> different. >> >> I, like you, hit it when sending snapshots, it seems (in my case) to be >> specific to incremental streams, rather than full streams, I can send >> seemingly any number of full streams, but incremental sends via send -i, >> or send -R of datasets with multiple snapshots, will get into a state >> like that above. >> >> -- Rich >> > > For now, back to snv_106 (the most stable build that I''ve seen, like it a lot) > I''ll open a case in the morning, and see what they suggest. > >After examining the dump we got from you (thanks again), we''re relatively sure you are hitting 6826836 Deadlock possible in dmu_object_reclaim() This was introduced in nv_111 and fixed in nv_113. Sorry for the trouble. -tim
Robert Milkowski
2009-Jun-11 20:28 UTC
[zfs-discuss] [storage-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Hello Ian, Saturday, June 6, 2009, 12:29:48 AM, you wrote: IC> Tim Haley wrote:>> Brent Jones wrote: >>> >>> On the sending side, I CAN kill the ZFS send process, but the remote >>> side leaves its processes going, and I CANNOT kill -9 them. I also >>> cannot reboot the receiving system, at init 6, the system will just >>> hang trying to unmount the file systems. >>> I have to physically cut power to the server, but a couple days later, >>> this issue will occur again. >>> >>> >> A crash dump from the receiving server with the stuck receives would >> be highly useful, if you can get it. Reboot -d would be best, but it >> might just hang. You can try savecore -L. >>IC> I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I IC> didn''t try savecore. mdb -KF and then $<systemdump ps. make sure you have an access to the console -- Best regards, Robert Milkowski http://milek.blogspot.com
Brent Jones
2009-Jun-12 01:43 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
>> >> > After examining the dump we got from you (thanks again), we''re relatively > sure you are hitting > > 6826836 Deadlock possible in dmu_object_reclaim() > > This was introduced in nv_111 and fixed in nv_113. > > Sorry for the trouble. > > -tim > >Do you know when new builds will show up on pkg.opensolaris.org/dev ? -- Brent Jones brent at servuhome.net
Ian Collins
2009-Jul-01 04:45 UTC
[zfs-discuss] ZFS snapshot send/recv "hangs" X4540 servers
Tim Haley wrote:> Ian Collins wrote: >> Ian Collins wrote: >>> Tim Haley wrote: >>>> Brent Jones wrote: >>>>> >>>>> On the sending side, I CAN kill the ZFS send process, but the remote >>>>> side leaves its processes going, and I CANNOT kill -9 them. I also >>>>> cannot reboot the receiving system, at init 6, the system will just >>>>> hang trying to unmount the file systems. >>>>> I have to physically cut power to the server, but a couple days >>>>> later, >>>>> this issue will occur again. >>>>> >>>>> >>>> A crash dump from the receiving server with the stuck receives >>>> would be highly useful, if you can get it. Reboot -d would be best, >>>> but it might just hang. You can try savecore -L. >>>> >>> I tried a reboot -d (I even had kmem-flags=0xf set), but it did >>> hang. I didn''t try savecore. >>> >>> One thing I didn''t try was scat on the running system. What should I >>> look for (with scat) if this happens again? >>> >> I now have a system with a hanging zfs receive, any hints on >> debugging it? >> > > If you''ve got it stuck, but can still do things on the console, then > run ''mdb -K'' on the console and type ''::stacks -m zfs''. That will > summarize all threads running in the kernel related to zfs. Perhaps > there will be a clue in the stacks of the receive(s) as to where they > are stuck. >I''ve seen this again on Solaris 10u7. I''m doing a couple of full sends from one pool to another on the same host. One of the sends stopped while the other completed. All zfs commands on the source pool now hang, the destination pool is OK. ::stacks isn''t recognised by mdb on Solaris 10, is there an alternative?> Also, make sure the spa is healthy and not suspended. This is an > example on one of my machines. >There were. -- Ian.