Joseph Mocker
2006-Jul-13 05:20 UTC
[zfs-discuss] system unresponsive after issuing a zpool attach
Today I attempted to upgrade to S10_U2 and migrate some mirrored UFS SVM partitions to ZFS. I used Live Upgrade to migrate from U1 to U2 and that went without a hitch on my SunBlade 2000. And the initial conversion of one side of the UFS mirrors to a ZFS pool and subsequent data migration went fine. However, when I attempted to attach the second side mirrors as a mirror of the ZFS pool, all hell broke loose. The system more or less became unresponsive after a few minutes. It appeared that ZFS had taken all available memory because I saw tons of errors on the console about failed memory allocations. Any thoughts/suggestions? The data I migrated consisted of about 80GB. Here''s the general flow of what I did: 1. break the SVM mirrors metadetach d5 d51 metadetach d6 d61 metadetach d7 d71 2. remove the SVM mirrors metaclear d51 metaclear d61 metaclear d71 3. combine the partitions with format. They were contiguous partitions on s4, s5 & s6 of the disk, I just made a single partition on s4 and cleared s5 & s6. 4. create the pool zpool create storage cXtXdXs4 5. create three filesystems zfs create storage/app zfs create storage/work zfs create storage/extra 6. migrate the data cd /app; find . -depth -print | cpio -pdmv /storage/app cd /work; find . -depth -print | cpio -pdmv /storage/work cd /extra; find . -depth -print | cpio -pdmv /storage/extra 7. remove the other SVM mirrors umount /app; metaclear d5 d50 umount /work; metaclear d6 d60 umount /extra; metaclear d7 d70 8. combine the partitions with format. They were contiguous partitions on s4, s5 & s6 of the disk, I just made a single partition on s4 and cleared s5 & s6. 9. attach the partition to the pool as a mirror zpool attach storage cXtXdXs4 cYtYdYs4 A few minutes after issuing the command the system became unresponsive as described above. I could reboot the system and it would boot up enough for me to look at the pool status with ''zfs status'' at least for a little while (it appears that the resilver starts every time I reboot) I left the system running in hopes that it would complete resilvering over night, otherwise I will probably have to attempt to detach the mirror. I didn''t see any posts with similar problems, but did find at least a couple of similar memory consumption issues. Help! --joe
Dennis Clarke
2006-Jul-13 05:54 UTC
[zfs-discuss] system unresponsive after issuing a zpool attach
> Today I attempted to upgrade to S10_U2 and migrate some mirrored UFS SVM > partitions to ZFS. > > I used Live Upgrade to migrate from U1 to U2 and that went without a > hitch on my SunBlade 2000. And the initial conversion of one side of the > UFS mirrors to a ZFS pool and subsequent data migration went fine. > However, when I attempted to attach the second side mirrors as a mirror > of the ZFS pool, all hell broke loose. > > The system more or less became unresponsive after a few minutes. It > appeared that ZFS had taken all available memory because I saw tons of > errors on the console about failed memory allocations. > > Any thoughts/suggestions? > > The data I migrated consisted of about 80GB. Here''s the general flow of > what I did: > > 1. break the SVM mirrors > metadetach d5 d51 > metadetach d6 d61 > metadetach d7 d71 > 2. remove the SVM mirrors > metaclear d51 > metaclear d61 > metaclear d71 > 3. combine the partitions with format. They were contiguous > partitions on s4, s5 & s6 of the disk, I just made a single > partition on s4 and cleared s5 & s6. > 4. create the pool > zpool create storage cXtXdXs4 > 5. create three filesystems > zfs create storage/app > zfs create storage/work > zfs create storage/extra > 6. migrate the data > cd /app; find . -depth -print | cpio -pdmv /storage/app > cd /work; find . -depth -print | cpio -pdmv /storage/work > cd /extra; find . -depth -print | cpio -pdmv /storage/extra > 7. remove the other SVM mirrors > umount /app; metaclear d5 d50 > umount /work; metaclear d6 d60 > umount /extra; metaclear d7 d70before you went any further here did you issue a metastat command and also did you have any metadb''s on that other disk before you nuked those slices ? just asking here I am hoping that you did a metaclear d5 and then metaclear d50 in order to clear out both the one sided mirror as well as its component. I''m just fishing around here ..> 8. combine the partitions with format. They were contiguous > partitions on s4, s5 & s6 of the disk, I just made a single > partition on s4 and cleared s5 & s6.okay .. I hope that SVM was not looking for them. I guess you would get a nasty stack of errors in that case.> 9. attach the partition to the pool as a mirror > zpool attach storage cXtXdXs4 cYtYdYs4So you wanted a mirror ? Like : # zpool status pool: storage state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s4 ONLINE 0 0 0 c0t1d0s4 ONLINE 0 0 0 errors: No known data errors that sort of deal ? Dennis Clarke
Joseph Mocker
2006-Jul-13 14:42 UTC
[zfs-discuss] system unresponsive after issuing a zpool attach
Dennis Clarke wrote:>>Today I attempted to upgrade to S10_U2 and migrate some mirrored UFS SVM >>partitions to ZFS. >> >>I used Live Upgrade to migrate from U1 to U2 and that went without a >>hitch on my SunBlade 2000. And the initial conversion of one side of the >>UFS mirrors to a ZFS pool and subsequent data migration went fine. >>However, when I attempted to attach the second side mirrors as a mirror >>of the ZFS pool, all hell broke loose. >> >>The system more or less became unresponsive after a few minutes. It >>appeared that ZFS had taken all available memory because I saw tons of >>errors on the console about failed memory allocations. >> >>Any thoughts/suggestions? >> >>The data I migrated consisted of about 80GB. Here''s the general flow of >>what I did: >> >>1. break the SVM mirrors >> metadetach d5 d51 >> metadetach d6 d61 >> metadetach d7 d71 >>2. remove the SVM mirrors >> metaclear d51 >> metaclear d61 >> metaclear d71 >>3. combine the partitions with format. They were contiguous >> partitions on s4, s5 & s6 of the disk, I just made a single >> partition on s4 and cleared s5 & s6. >>4. create the pool >> zpool create storage cXtXdXs4 >>5. create three filesystems >> zfs create storage/app >> zfs create storage/work >> zfs create storage/extra >>6. migrate the data >> cd /app; find . -depth -print | cpio -pdmv /storage/app >> cd /work; find . -depth -print | cpio -pdmv /storage/work >> cd /extra; find . -depth -print | cpio -pdmv /storage/extra >>7. remove the other SVM mirrors >> umount /app; metaclear d5 d50 >> umount /work; metaclear d6 d60 >> umount /extra; metaclear d7 d70 >> >> > >before you went any further here did you issue a metastat command and also >did you have any metadb''s on that other disk before you nuked those slices ? > >I did have metadbs on the s7 slices but I removed them with metadb. I did a fair amount of metastats as well.>just asking here > >I am hoping that you did a metaclear d5 and then metaclear d50 in order to >clear out both the one sided mirror as well as its component. > >I''m just fishing around here .. > >>>8. combine the partitions with format. They were contiguous >> partitions on s4, s5 & s6 of the disk, I just made a single >> partition on s4 and cleared s5 & s6. >> >> > >okay .. I hope that SVM was not looking for them. I guess you would get a >nasty stack of errors in that case. > >Yeah. Actually format was pretty helpful as it told me particular slices of the disk was in use by SVM. I didn''t have any problems with the first side of the ZFS mirror.> > >>9. attach the partition to the pool as a mirror >> zpool attach storage cXtXdXs4 cYtYdYs4 >> >> > >So you wanted a mirror ? > >Like : > ># zpool status > pool: storage > state: ONLINE > scrub: none requested >config: > > NAME STATE READ WRITE CKSUM > storage ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t0d0s4 ONLINE 0 0 0 > c0t1d0s4 ONLINE 0 0 0 > >errors: No known data errors > >that sort of deal ? > >Yes, that''s exactly right. Something that just occured to me, that I will have to look at when I get to the system is that I don''t recall if I had any swap partitions enabled or not. If I do, that could help ball up the system as it tries to swap stuff out to disk in order to give space to ZFS.
Dennis Clarke
2006-Jul-13 18:31 UTC
[zfs-discuss] system unresponsive after issuing a zpool attach
> Who hoo! It looks like the resilver completed sometime over night. The > system appears to be running normally, (after one final reboot): > > mock at watt[1]: zpool status > pool: storage > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > storage ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t2d0s4 ONLINE 0 0 0 > c1t1d0s4 ONLINE 0 0 0 > > errors: No known data errorslooks nice :-)> I took a poke at the zfs bugs on SunSolve again, and found one that is > the likely culprit: > > 6355416 zpool scrubbing consumes all memory, system hung > > Appears that a fix is in Nevada 36, hopefully it''ll be back ported to a > patch for 10. >whoa whoa ... just one bloody second .. whoa .. That looks like a real nasty bug description there. What are the details on that? Is this particular to a given system or controller config or something liek that or are we talking global to Solaris 10 Update 2 everywhere ?? :-( Bug ID: 6355416 Synopsis: zpool scrubbing consumes all memory, system hung Category: kernel Subcategory: zfs State: 10-Fix Delivered <<-- in a patch somewhere ? Description: On a 6800 domain with 8G of RAM I created a zpool using a single 18G drive and on that pool created a file system and a zvol. The zvol was filled with data. # zfs list NAME USED AVAIL REFER MOUNTPOINT pool 11.0G 5.58G 9.00K /pool pool/fs 8K 5.58G 8K /pool/fs pool/fs at snap 0 - 8K - pool/root 11.0G 5.58G 11.0G - pool/root at 1 783K - 11.0G - # I then attached a second 18g drive to the pool and all seemed well. After a few minutes however the system ground to a halt. No response from the keyboard. Aborting the system it failed to dump due to the dump device being to small. On rebooting it did not make it into multi user. Booting milestone=none and then bringing it up by had I could see it hung doing zfs mount -a. Booting milestone=none again I was able to export the pool and then the system would come up into multiuser. Any attempt to import the pool would hang the system , running vmstat showing it consumed all available memory. With the pool exported I reinstalled the system with a larger dump device and then imported the pool. The same hung occurred however this time I got the crash dump. Dumps can be found here: /net/enospc.uk/export/esc/pts-crashdumps/zfs_nomemory Dump 0 is from stock build 72a dump 1 from my workspace and had KMF_AUDIT set. The only change in my workspace is to the isp driver. ::kmausers gives:> ::kmausers 365010944 bytes for 44557 allocations with data size 8192: kmem_cache_alloc+0x148 segkmem_xalloc+0x40 segkmem_alloc+0x9c vmem_xalloc+0x554 vmem_alloc+0x214 kmem_slab_create+0x44 kmem_slab_alloc+0x3c kmem_cache_alloc+0x148 kmem_zalloc+0x28 zio_create+0x3c zio_vdev_child_io+0xc4 vdev_mirror_io_start+0x1ac spa_scrub_cb+0xe4 traverse_segment+0x2e8 traverse_more+0x7c 362520576 bytes for 44253 allocations with data size 8192: kmem_cache_alloc+0x148 segkmem_xalloc+0x40 segkmem_alloc+0x9c vmem_xalloc+0x554 vmem_alloc+0x214 kmem_slab_create+0x44 kmem_slab_alloc+0x3c kmem_cache_alloc+0x148 kmem_zalloc+0x28 zio_create+0x3c zio_read+0x54 spa_scrub_io_start+0x88 spa_scrub_cb+0xe4 traverse_segment+0x2e8 traverse_more+0x7c 241177600 bytes for 376840 allocations with data size 640: kmem_cache_alloc+0x88 kmem_zalloc+0x28 zio_create+0x3c zio_vdev_child_io+0xc4 vdev_mirror_io_done+0x254 taskq_thread+0x1a0 209665920 bytes for 327603 allocations with data size 640: kmem_cache_alloc+0x88 kmem_zalloc+0x28 zio_create+0x3c zio_read+0x54 spa_scrub_io_start+0x88 spa_scrub_cb+0xe4 traverse_segment+0x2e8 traverse_more+0x7c I have attached the full output. If I am quick I can detatch the disk and the export the pool before the system grinds to a halt. Then reimporting the pool I can access the data. Attaching the disk again results in the system using all the memory again. Date Modified: 2005-11-25 09:03:07 GMT+00:00 Work Around: Suggested Fix: Evaluation: Fixed by patch: Integrated in Build: snv_36 Duplicate of: Related Change Request(s):6352306 6384439 6385428 Date Modified: 2006-03-23 23:58:15 GMT+00:00 Public Summary:
Daniel Rock
2006-Jul-13 19:31 UTC
[zfs-discuss] system unresponsive after issuing a zpool attach
Joseph Mocker schrieb:> Today I attempted to upgrade to S10_U2 and migrate some mirrored UFS SVM > partitions to ZFS. > > I used Live Upgrade to migrate from U1 to U2 and that went without a > hitch on my SunBlade 2000. And the initial conversion of one side of the > UFS mirrors to a ZFS pool and subsequent data migration went fine. > However, when I attempted to attach the second side mirrors as a mirror > of the ZFS pool, all hell broke loose.>> 9. attach the partition to the pool as a mirror > zpool attach storage cXtXdXs4 cYtYdYs4 > > A few minutes after issuing the command the system became unresponsive > as described above.Same here. I also did upgrade to S10_U2, and converted my non-root md similar like you. Everything went fine until the "zpool attach". The system seemed to be hanging for at least 2-3 minutes. Then I could type something again. "top" then showed 98% system time. This was on a SunBlade 1000 with 2 x 750MHz CPUs. The zpool/zfs was created with checksum=sha256. Daniel
Joseph Mocker
2006-Jul-13 20:01 UTC
[zfs-discuss] system unresponsive after issuing a zpool attach
Dennis Clarke wrote:> > whoa whoa ... just one bloody second .. whoa .. > > That looks like a real nasty bug description there. > > What are the details on that? Is this particular to a given system or > controller config or something liek that or are we talking global to Solaris > 10 Update 2 everywhere ?? :-( >Thats a good question. Looking the internal evaluation, it appears scrubs can be a little too aggressive. Perhaps one of the ZFS engineers can comment, Jeff? I am curious about the "fix delivered" state as well. Looks like its been fixed in SNV 36, but I wonder if there will be a patch available. --joe> Bug ID: 6355416 > Synopsis: zpool scrubbing consumes all memory, system hung > Category: kernel > Subcategory: zfs > State: 10-Fix Delivered <<-- in a patch somewhere ? > > Description: > > On a 6800 domain with 8G of RAM I created a zpool using a single 18G drive > and on that pool created a file system and a zvol. The zvol was filled with > data. > > # zfs list > NAME USED AVAIL REFER MOUNTPOINT > pool 11.0G 5.58G 9.00K /pool > pool/fs 8K 5.58G 8K /pool/fs > pool/fs at snap 0 - 8K - > pool/root 11.0G 5.58G 11.0G - > pool/root at 1 783K - 11.0G - > # > > I then attached a second 18g drive to the pool and all seemed well. After a > few minutes however the system ground to a halt. No response from the > keyboard. > > Aborting the system it failed to dump due to the dump device being to small. > On rebooting it did not make it into multi user. > > Booting milestone=none and then bringing it up by had I could see it hung > doing zfs mount -a. > > Booting milestone=none again I was able to export the pool and then the > system would come up into multiuser. Any attempt to import the pool would > hang the system , running vmstat showing it consumed all available memory. > > With the pool exported I reinstalled the system with a larger dump device > and then imported the pool. The same hung occurred however this time I got > the crash dump. > > Dumps can be found here: > > /net/enospc.uk/export/esc/pts-crashdumps/zfs_nomemory > > Dump 0 is from stock build 72a dump 1 from my workspace and had KMF_AUDIT > set. The only change in my workspace is to the isp driver. > > ::kmausers gives:> ::kmausers > 365010944 bytes for 44557 allocations with data size 8192: > kmem_cache_alloc+0x148 > segkmem_xalloc+0x40 > segkmem_alloc+0x9c > vmem_xalloc+0x554 > vmem_alloc+0x214 > kmem_slab_create+0x44 > kmem_slab_alloc+0x3c > kmem_cache_alloc+0x148 > kmem_zalloc+0x28 > zio_create+0x3c > zio_vdev_child_io+0xc4 > vdev_mirror_io_start+0x1ac > spa_scrub_cb+0xe4 > traverse_segment+0x2e8 > traverse_more+0x7c > 362520576 bytes for 44253 allocations with data size 8192: > kmem_cache_alloc+0x148 > segkmem_xalloc+0x40 > segkmem_alloc+0x9c > vmem_xalloc+0x554 > vmem_alloc+0x214 > kmem_slab_create+0x44 > kmem_slab_alloc+0x3c > kmem_cache_alloc+0x148 > kmem_zalloc+0x28 > zio_create+0x3c > zio_read+0x54 > spa_scrub_io_start+0x88 > spa_scrub_cb+0xe4 > traverse_segment+0x2e8 > traverse_more+0x7c > 241177600 bytes for 376840 allocations with data size 640: > kmem_cache_alloc+0x88 > kmem_zalloc+0x28 > zio_create+0x3c > zio_vdev_child_io+0xc4 > vdev_mirror_io_done+0x254 > taskq_thread+0x1a0 > 209665920 bytes for 327603 allocations with data size 640: > kmem_cache_alloc+0x88 > kmem_zalloc+0x28 > zio_create+0x3c > zio_read+0x54 > spa_scrub_io_start+0x88 > spa_scrub_cb+0xe4 > traverse_segment+0x2e8 > traverse_more+0x7c > > I have attached the full output. > > If I am quick I can detatch the disk and the export the pool before the > system grinds to a halt. Then reimporting the pool I can access the data. > Attaching the disk again results in the system using all the memory again. > > Date Modified: 2005-11-25 09:03:07 GMT+00:00 > > > Work Around: > Suggested Fix: > Evaluation: > Fixed by patch: > Integrated in Build: snv_36 > Duplicate of: > Related Change Request(s):6352306 6384439 6385428 > Date Modified: 2006-03-23 23:58:15 GMT+00:00 > Public Summary: > > >
Rustam
2006-Aug-17 05:52 UTC
[zfs-discuss] Re: system unresponsive after issuing a zpool attach
I have similar behaviour on S10 U2 but in a different situation. I had working mirror with one of mirrors failed: mirror DEGRADED 0 0 0 c0d0 ONLINE 0 0 0 c0d1 UNAVAILABLE 0 0 0 After replacing corrupted hard disk i''ve run: # zpool replace tank c0d1 And it started replacement/resilvering... after few minutes system became unavailbale. Reboot only gives me a few minutes, then resilvering make system unresponsible. Is there any workaroud or patch for this problem??? This message posted from opensolaris.org
George Wilson
2006-Aug-17 06:07 UTC
[zfs-discuss] system unresponsive after issuing a zpool attach
I believe this is what you''re hitting: 6456888 zpool attach leads to memory exhaustion and system hang We are currently looking at fixing this so stay tuned. Thanks, George Daniel Rock wrote:> Joseph Mocker schrieb: >> Today I attempted to upgrade to S10_U2 and migrate some mirrored UFS >> SVM partitions to ZFS. >> >> I used Live Upgrade to migrate from U1 to U2 and that went without a >> hitch on my SunBlade 2000. And the initial conversion of one side of >> the UFS mirrors to a ZFS pool and subsequent data migration went fine. >> However, when I attempted to attach the second side mirrors as a >> mirror of the ZFS pool, all hell broke loose. > > >> 9. attach the partition to the pool as a mirror >> zpool attach storage cXtXdXs4 cYtYdYs4 >> >> A few minutes after issuing the command the system became unresponsive >> as described above. > > Same here. I also did upgrade to S10_U2, and converted my non-root md > similar like you. Everything went fine until the "zpool attach". The system > seemed to be hanging for at least 2-3 minutes. Then I could type something > again. "top" then showed 98% system time. > > This was on a SunBlade 1000 with 2 x 750MHz CPUs. The zpool/zfs was > created with checksum=sha256. > > > > Daniel > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > mail.opensolaris.org/mailman/listinfo/zfs-discuss