I am having a very odd problem, and so far the folks at Oracle Support have not provided a working solution, so I am asking the crowd here while still pursuing it via Oracle Support. The system is a T2000 running 10U9 with CPU-2010-01and two J4400 loaded with 1 TB SATA drives. There is one zpool on the J4400 (3 x 15 disk vdev + 3 hot spare). This system is the target for zfs send / recv replication from our production server.The OS is UFS on local disk. While I was on vacation this T2000 hung with "out of resource" errors. Other staff tried rebooting, which hung the box. Then they rebooted off of an old BE (10U9 without CPU-2010-01). Oracle Support had them apply a couple patches and an IDR to address zfs "stability and reliability problems" as well as set the following in /etc/system set zfs:zfs_arc_max = 0x700000000 (which is 28 GB) set zfs:arc_meta_limit = 0x700000000 (which is 28 GB) The system has 32 GB RAM and 32 (virtual) CPUs. They then tried importing the zpool and the system hung (after many hours) with the same "out of resource" error. At this point they left the problem for me :-( I removed the zfs.cache from the 10U9 + CPU 2010-10 BE and booted from that. I then applied the IDR (IDR146118-12 )and the zfs patch it depended on (145788-03). I did not include the zfs arc and zfs arc meta limits as I did not think they relevant. A zpool import shows the pool is OK and a sampling with zdb -l of the drives shows good labels. I started importing the zpool and after many hours it hung the system with "out of resource" errors. I had a number of tools running to see what was going on. The only thing this system is doing is importing the zpool. ARC had climbed to about 8 GB and then declined to 3 GB by the time the system hung. This tells me that there is something else consuming RAM and the ARC is releasing it. The hung TOP screen showed the largest user process only had 148 MB allocated (and much less resident). VMSTAT showed a scan rate of over 900,000 (NOT a typo) and almost 8 GB of free swap (so whatever is using memory cannot be paged out). So my guess is that there is a kernel module that is consuming all (and more) of the RAM in the box. I am looking for a way to query how much RAM each kernel module is using and script that in a loop (which will hang when the box runs out of RAM next). I am very open to suggestions here. Since this is the recv end of replication, I assume there was a zfs recv going on at the time the system initially hung. I know there was a 3+ TB snapshot replicating (via a 100 Mbps WAN link) when I left for vacation, that may have still been running. I also assume that any partial snapshots (% instead of @) are being removed when the pool is imported. But what could be causing a partial snapshot removal, even of a very large snapshot, to run the system out of RAM ? What caused the initial hang of the system (I assume due to out of RAM) ? I did not think there was a limit to the size of either a snapshot or a zfs recv. Hung TOP screen: load averages: 91.43, 33.48, 18.989 xxx-xxx1 18:45:34 84 processes: 69 sleeping, 12 running, 1 zombie, 2 on cpu CPU states: 95.2% idle, 0.5% user, 4.4% kernel, 0.0% iowait, 0.0% swap Memory: 31.9G real, 199M free, 267M swap in use, 7.7G swap free PID USERNAME THR PR NCE SIZE RES STATE TIME FLTS CPU COMMAND 533 root 51 59 0 148M 30.6M run 520:21 0 9.77% java 1210 yyyyyy 1 0 0 5248K 1048K cpu25 2:08 0 2.23% xload 14720 yyyyyy 1 59 0 3248K 1256K cpu24 1:56 0 0.03% top 154 root 1 59 0 4024K 1328K sleep 1:17 0 0.02% vmstat 1268 yyyyyy 1 59 0 4248K 1568K sleep 1:26 0 0.01% iostat ... VMSTAT: kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m2 m3 in sy cs us sy id 0 0 112 8117096 211888 55 46 0 0 425 0 912684 0 0 0 0 976 166 836 0 2 98 0 0 112 8117096 211936 53 51 6 0 394 0 926702 0 0 0 0 976 167 833 0 2 98 ARC size (B): 4065882656 -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
An additional data point, when i try to do a zdb -e -d and find the incomplete zfs recv snapshot I get an error as follows: # sudo zdb -e -d xxx-yy-01 | grep "%" Could not open xxx-yy-01/aaa-bb-01/aaa-bb-01-01/%1309906801, error 16 # Anyone know what error 16 means from zdb and how this might impact importing this zpool ? On Wed, Aug 3, 2011 at 9:19 AM, Paul Kraus <paul at kraus-haus.org> wrote:> ? ?I am having a very odd problem, and so far the folks at Oracle > Support have not provided a working solution, so I am asking the crowd > here while still pursuing it via Oracle Support. > > ? ?The system is a T2000 running 10U9 with CPU-2010-01and two J4400 > loaded with 1 TB SATA drives. There is one zpool on the J4400 (3 x 15 > disk vdev + 3 hot spare). This system is the target for zfs send / > recv replication from our production server.The OS is UFS on local > disk. > > ? ? While I was on vacation this T2000 hung with "out of resource" > errors. Other staff tried rebooting, which hung the box. Then they > rebooted off of an old BE (10U9 without CPU-2010-01). Oracle Support > had them apply a couple patches and an IDR to address zfs "stability > and reliability problems" as well as set the following in /etc/system > > set zfs:zfs_arc_max = 0x700000000 (which is 28 GB) > set zfs:arc_meta_limit = 0x700000000 (which is 28 GB) > > ? ?The system has 32 GB RAM and 32 (virtual) CPUs. They then tried > importing the zpool and the system hung (after many hours) with the > same "out of resource" error. At this point they left the problem for > me :-( > > ? ?I removed the zfs.cache from the 10U9 + CPU 2010-10 BE and booted > from that. I then applied the IDR (IDR146118-12 )and the zfs patch it > depended on (145788-03). I did not include the zfs arc and zfs arc > meta limits as I did not think they relevant. A zpool import shows the > pool is OK and a sampling with zdb -l of the drives shows good labels. > I started importing the zpool and after many hours it hung the system > with "out of resource" errors. I had a number of tools running to see > what was going on. The only thing this system is doing is importing > the zpool. > > ARC had climbed to about 8 GB and then declined to 3 GB by the time > the system hung. This tells me that there is something else consuming > RAM and the ARC is releasing it. > > The hung TOP screen showed the largest user process only had 148 MB > allocated (and much less resident). > > VMSTAT showed a scan rate of over 900,000 (NOT a typo) and almost 8 GB > of free swap (so whatever is using memory cannot be paged out). > > ? ?So my guess is that there is a kernel module that is consuming all > (and more) of the RAM in the box. I am looking for a way to query how > much RAM each kernel module is using and script that in a loop (which > will hang when the box runs out of RAM next). I am very open to > suggestions here. > > ? Since this is the recv end of replication, I assume there was a zfs > recv going on at the time the system initially hung. I know there was > a 3+ TB snapshot replicating (via a 100 Mbps WAN link) when I left for > vacation, that may have still been running. I also assume that any > partial snapshots (% instead of @) are being removed when the pool is > imported. But what could be causing a partial snapshot removal, even > of a very large snapshot, to run the system out of RAM ? What caused > the initial hang of the system (I assume due to out of RAM) ? I did > not think there was a limit to the size of either a snapshot or a zfs > recv. > > Hung TOP screen: > > load averages: 91.43, 33.48, 18.989 ? ? ? ? ? ? xxx-xxx1 ? ? ? ? ? ? ? 18:45:34 > 84 processes: ?69 sleeping, 12 running, 1 zombie, 2 on cpu > CPU states: 95.2% idle, ?0.5% user, ?4.4% kernel, ?0.0% iowait, ?0.0% swap > Memory: 31.9G real, 199M free, 267M swap in use, 7.7G swap free > > ? PID USERNAME THR PR NCE ?SIZE ? RES STATE ? TIME FLTS ? ?CPU COMMAND > ? 533 root ? ? ?51 59 ? 0 ?148M 30.6M run ? 520:21 ? ?0 ?9.77% java > ?1210 yyyyyy ? ? 1 ?0 ? 0 5248K 1048K cpu25 ? 2:08 ? ?0 ?2.23% xload > ?14720 yyyyyy ? ? 1 59 ? 0 3248K 1256K cpu24 ? 1:56 ? ?0 ?0.03% top > ? 154 root ? ? ? 1 59 ? 0 4024K 1328K sleep ? 1:17 ? ?0 ?0.02% vmstat > ?1268 yyyyyy ? ? 1 59 ? 0 4248K 1568K sleep ? 1:26 ? ?0 ?0.01% iostat > ... > > VMSTAT: > > kthr ? ? ?memory ? ? ? ? ? ?page ? ? ? ? ? ?disk ? ? ? ? ?faults ? ? ?cpu > ?r b w ? swap ?free ?re ?mf pi po fr de sr m0 m1 m2 m3 ? in ? sy ? cs us sy id > ?0 0 112 8117096 211888 55 46 0 0 425 0 912684 0 0 0 0 ?976 ?166 ?836 ?0 ?2 98 > ?0 0 112 8117096 211936 53 51 6 0 394 0 926702 0 0 0 0 ?976 ?167 ?833 ?0 ?2 98 > > ARC size (B): 4065882656 > > -- > {--------1---------2---------3---------4---------5---------6---------7---------} > Paul Kraus > -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) > -> Sound Designer: Frankenstein, A New Musical > (http://www.facebook.com/event.php?eid=123170297765140) > -> Sound Coordinator, Schenectady Light Opera Company ( > http://www.sloctheater.org/ ) > -> Technical Advisor, RPI Players >-- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Updates to my problem: 1. The destroy operation appears to be restarting from the same point after the system hangs and has to be rebooted. Oracle gave me the following to track progress: echo ''::pgrep "zpool$" |::walk thread|::findstack -v'' | mdb -k | grep dsl_dataset_destroy then take first arg of dsl_dataset_destroy and echo ''<ARG>::print dsl_dataset_t ds_phys->ds_used_bytes'' | mdb -k I am logging these values every minute. Yesterday when I started tracking this I got a value of 0x75d97516b62, my last data point before the system hung was 0x4ee1098bdfd. My first first data point today after rebooting, restarting the logging scripts, and restarting the zpool import is 0x7a0b0634a1b. So it looks like I''ve made no real progress. 2. It looks like the root cause of the original system crash that left the incomplete zfs recv snapshot is that the a zfs recv filled the zpool (there are two parallel zfs recv''s running, one for an old configuration (many datasets) and one for the new (one large dataset)). My replication script checks for free space before stating the replication, but we had a huge data load and replication of it running (3 TB), and when it started there was room for it, but other (much smaller) data loads and replication may have consumed it. This system has no other activity on it, it is just a repository for this replicated data. So ... it looks like I have: - a full zpool - an incomplete (corrupt ?) snapshot from a zfs recv ... and every time I try to import this zpool I hang the system due to lack of memory (the box has 32 GB of RAM). Any suggestions how to delete / destroy this incomplete snapshot without running the system out of RAM ? On Wed, Aug 3, 2011 at 9:56 AM, Paul Kraus <paul at kraus-haus.org> wrote:> An additional data point, when i try to do a zdb -e -d and find the > incomplete zfs recv snapshot I get an error as follows: > > # sudo zdb -e -d xxx-yy-01 | grep "%" > Could not open xxx-yy-01/aaa-bb-01/aaa-bb-01-01/%1309906801, error 16 > # > > Anyone know what error 16 means from zdb and how this might impact > importing this zpool ? > > On Wed, Aug 3, 2011 at 9:19 AM, Paul Kraus <paul at kraus-haus.org> wrote: >> ? ?I am having a very odd problem, and so far the folks at Oracle >> Support have not provided a working solution, so I am asking the crowd >> here while still pursuing it via Oracle Support. >> >> ? ?The system is a T2000 running 10U9 with CPU-2010-01and two J4400 >> loaded with 1 TB SATA drives. There is one zpool on the J4400 (3 x 15 >> disk vdev + 3 hot spare). This system is the target for zfs send / >> recv replication from our production server.The OS is UFS on local >> disk. >> >> ? ? While I was on vacation this T2000 hung with "out of resource" >> errors. Other staff tried rebooting, which hung the box. Then they >> rebooted off of an old BE (10U9 without CPU-2010-01). Oracle Support >> had them apply a couple patches and an IDR to address zfs "stability >> and reliability problems" as well as set the following in /etc/system >> >> set zfs:zfs_arc_max = 0x700000000 (which is 28 GB) >> set zfs:arc_meta_limit = 0x700000000 (which is 28 GB) >> >> ? ?The system has 32 GB RAM and 32 (virtual) CPUs. They then tried >> importing the zpool and the system hung (after many hours) with the >> same "out of resource" error. At this point they left the problem for >> me :-( >> >> ? ?I removed the zfs.cache from the 10U9 + CPU 2010-10 BE and booted >> from that. I then applied the IDR (IDR146118-12 )and the zfs patch it >> depended on (145788-03). I did not include the zfs arc and zfs arc >> meta limits as I did not think they relevant. A zpool import shows the >> pool is OK and a sampling with zdb -l of the drives shows good labels. >> I started importing the zpool and after many hours it hung the system >> with "out of resource" errors. I had a number of tools running to see >> what was going on. The only thing this system is doing is importing >> the zpool. >> >> ARC had climbed to about 8 GB and then declined to 3 GB by the time >> the system hung. This tells me that there is something else consuming >> RAM and the ARC is releasing it. >> >> The hung TOP screen showed the largest user process only had 148 MB >> allocated (and much less resident). >> >> VMSTAT showed a scan rate of over 900,000 (NOT a typo) and almost 8 GB >> of free swap (so whatever is using memory cannot be paged out). >> >> ? ?So my guess is that there is a kernel module that is consuming all >> (and more) of the RAM in the box. I am looking for a way to query how >> much RAM each kernel module is using and script that in a loop (which >> will hang when the box runs out of RAM next). I am very open to >> suggestions here. >> >> ? Since this is the recv end of replication, I assume there was a zfs >> recv going on at the time the system initially hung. I know there was >> a 3+ TB snapshot replicating (via a 100 Mbps WAN link) when I left for >> vacation, that may have still been running. I also assume that any >> partial snapshots (% instead of @) are being removed when the pool is >> imported. But what could be causing a partial snapshot removal, even >> of a very large snapshot, to run the system out of RAM ? What caused >> the initial hang of the system (I assume due to out of RAM) ? I did >> not think there was a limit to the size of either a snapshot or a zfs >> recv. >> >> Hung TOP screen: >> >> load averages: 91.43, 33.48, 18.989 ? ? ? ? ? ? xxx-xxx1 ? ? ? ? ? ? ? 18:45:34 >> 84 processes: ?69 sleeping, 12 running, 1 zombie, 2 on cpu >> CPU states: 95.2% idle, ?0.5% user, ?4.4% kernel, ?0.0% iowait, ?0.0% swap >> Memory: 31.9G real, 199M free, 267M swap in use, 7.7G swap free >> >> ? PID USERNAME THR PR NCE ?SIZE ? RES STATE ? TIME FLTS ? ?CPU COMMAND >> ? 533 root ? ? ?51 59 ? 0 ?148M 30.6M run ? 520:21 ? ?0 ?9.77% java >> ?1210 yyyyyy ? ? 1 ?0 ? 0 5248K 1048K cpu25 ? 2:08 ? ?0 ?2.23% xload >> ?14720 yyyyyy ? ? 1 59 ? 0 3248K 1256K cpu24 ? 1:56 ? ?0 ?0.03% top >> ? 154 root ? ? ? 1 59 ? 0 4024K 1328K sleep ? 1:17 ? ?0 ?0.02% vmstat >> ?1268 yyyyyy ? ? 1 59 ? 0 4248K 1568K sleep ? 1:26 ? ?0 ?0.01% iostat >> ... >> >> VMSTAT: >> >> kthr ? ? ?memory ? ? ? ? ? ?page ? ? ? ? ? ?disk ? ? ? ? ?faults ? ? ?cpu >> ?r b w ? swap ?free ?re ?mf pi po fr de sr m0 m1 m2 m3 ? in ? sy ? cs us sy id >> ?0 0 112 8117096 211888 55 46 0 0 425 0 912684 0 0 0 0 ?976 ?166 ?836 ?0 ?2 98 >> ?0 0 112 8117096 211936 53 51 6 0 394 0 926702 0 0 0 0 ?976 ?167 ?833 ?0 ?2 98 >> >> ARC size (B): 4065882656 >> >> -- >> {--------1---------2---------3---------4---------5---------6---------7---------} >> Paul Kraus >> -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) >> -> Sound Designer: Frankenstein, A New Musical >> (http://www.facebook.com/event.php?eid=123170297765140) >> -> Sound Coordinator, Schenectady Light Opera Company ( >> http://www.sloctheater.org/ ) >> -> Technical Advisor, RPI Players >> > > > > -- > {--------1---------2---------3---------4---------5---------6---------7---------} > Paul Kraus > -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) > -> Sound Designer: Frankenstein, A New Musical > (http://www.facebook.com/event.php?eid=123170297765140) > -> Sound Coordinator, Schenectady Light Opera Company ( > http://www.sloctheater.org/ ) > -> Technical Advisor, RPI Players >-- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Another update: The configuration of the zpool is 45 x 1 TB drives in three vdev''s, each of 15 drives. We should have a net capacity of between 30 and 36 TB (and that agrees with my memory of the pool). I ran zdb -e -d against the pool (not imported) and totaled the size of the datasets and came up with just about 11 TB. This also agrees with my memory (about 18 TB of data and about 1.5 compression ratio). If the failed snapshot / zfs recv is 3 TB (like I think it should be) or almost 8 TB (as Oracle is telling me based on some mdb -k examinations of the dataset delete thread), I should still have almost 10 TB free. I am making an assumption here, and that is that the size listed for the dataset with zdb -d includes all snapshots of that dataset (much like the SIZE field of a zfs list). If that is NOT the case, then I need to come up with a different way to estimate the "fullness" of this zpool. On Thu, Aug 4, 2011 at 1:25 PM, Paul Kraus <paul at kraus-haus.org> wrote:> Updates to my problem: > > 1. The destroy operation appears to be restarting from the same point > after the system hangs and has to be rebooted. Oracle gave me the > following to track progress: > > echo ''::pgrep "zpool$" |::walk thread|::findstack -v'' | mdb -k | grep > dsl_dataset_destroy > then take first arg of dsl_dataset_destroy and > echo ''<ARG>::print dsl_dataset_t ds_phys->ds_used_bytes'' | mdb -k > > I am logging these values every minute. Yesterday when I started > tracking this I got a value of 0x75d97516b62, my last data point > before the system hung was 0x4ee1098bdfd. My first first data point > today after rebooting, restarting the logging scripts, and restarting > the zpool import is 0x7a0b0634a1b. So it looks like I''ve made no real > progress. > > 2. It looks like the root cause of the original system crash that left > the incomplete zfs recv snapshot is that the a zfs recv filled the > zpool (there are two parallel zfs recv''s running, one for an old > configuration (many datasets) and one for the new (one large > dataset)). My replication script checks for free space before stating > the replication, but we had a huge data load and replication of it > running (3 TB), and when it started there was room for it, but other > (much smaller) data loads and replication may have consumed it. This > system has no other activity on it, it is just a repository for this > replicated data. > > So ... it looks like I have: > - a full zpool > - an incomplete (corrupt ?) snapshot from a zfs recv > ... and every time I try to import this zpool I hang the system due to > lack of memory (the box has 32 GB of RAM). > > Any suggestions how to delete / destroy this incomplete snapshot > without running the system out of RAM ? > > On Wed, Aug 3, 2011 at 9:56 AM, Paul Kraus <paul at kraus-haus.org> wrote: >> An additional data point, when i try to do a zdb -e -d and find the >> incomplete zfs recv snapshot I get an error as follows: >> >> # sudo zdb -e -d xxx-yy-01 | grep "%" >> Could not open xxx-yy-01/aaa-bb-01/aaa-bb-01-01/%1309906801, error 16 >> # >> >> Anyone know what error 16 means from zdb and how this might impact >> importing this zpool ? >> >> On Wed, Aug 3, 2011 at 9:19 AM, Paul Kraus <paul at kraus-haus.org> wrote: >>> ? ?I am having a very odd problem, and so far the folks at Oracle >>> Support have not provided a working solution, so I am asking the crowd >>> here while still pursuing it via Oracle Support. >>> >>> ? ?The system is a T2000 running 10U9 with CPU-2010-01and two J4400 >>> loaded with 1 TB SATA drives. There is one zpool on the J4400 (3 x 15 >>> disk vdev + 3 hot spare). This system is the target for zfs send / >>> recv replication from our production server.The OS is UFS on local >>> disk. >>> >>> ? ? While I was on vacation this T2000 hung with "out of resource" >>> errors. Other staff tried rebooting, which hung the box. Then they >>> rebooted off of an old BE (10U9 without CPU-2010-01). Oracle Support >>> had them apply a couple patches and an IDR to address zfs "stability >>> and reliability problems" as well as set the following in /etc/system >>> >>> set zfs:zfs_arc_max = 0x700000000 (which is 28 GB) >>> set zfs:arc_meta_limit = 0x700000000 (which is 28 GB) >>> >>> ? ?The system has 32 GB RAM and 32 (virtual) CPUs. They then tried >>> importing the zpool and the system hung (after many hours) with the >>> same "out of resource" error. At this point they left the problem for >>> me :-( >>> >>> ? ?I removed the zfs.cache from the 10U9 + CPU 2010-10 BE and booted >>> from that. I then applied the IDR (IDR146118-12 )and the zfs patch it >>> depended on (145788-03). I did not include the zfs arc and zfs arc >>> meta limits as I did not think they relevant. A zpool import shows the >>> pool is OK and a sampling with zdb -l of the drives shows good labels. >>> I started importing the zpool and after many hours it hung the system >>> with "out of resource" errors. I had a number of tools running to see >>> what was going on. The only thing this system is doing is importing >>> the zpool. >>> >>> ARC had climbed to about 8 GB and then declined to 3 GB by the time >>> the system hung. This tells me that there is something else consuming >>> RAM and the ARC is releasing it. >>> >>> The hung TOP screen showed the largest user process only had 148 MB >>> allocated (and much less resident). >>> >>> VMSTAT showed a scan rate of over 900,000 (NOT a typo) and almost 8 GB >>> of free swap (so whatever is using memory cannot be paged out). >>> >>> ? ?So my guess is that there is a kernel module that is consuming all >>> (and more) of the RAM in the box. I am looking for a way to query how >>> much RAM each kernel module is using and script that in a loop (which >>> will hang when the box runs out of RAM next). I am very open to >>> suggestions here. >>> >>> ? Since this is the recv end of replication, I assume there was a zfs >>> recv going on at the time the system initially hung. I know there was >>> a 3+ TB snapshot replicating (via a 100 Mbps WAN link) when I left for >>> vacation, that may have still been running. I also assume that any >>> partial snapshots (% instead of @) are being removed when the pool is >>> imported. But what could be causing a partial snapshot removal, even >>> of a very large snapshot, to run the system out of RAM ? What caused >>> the initial hang of the system (I assume due to out of RAM) ? I did >>> not think there was a limit to the size of either a snapshot or a zfs >>> recv. >>> >>> Hung TOP screen: >>> >>> load averages: 91.43, 33.48, 18.989 ? ? ? ? ? ? xxx-xxx1 ? ? ? ? ? ? ? 18:45:34 >>> 84 processes: ?69 sleeping, 12 running, 1 zombie, 2 on cpu >>> CPU states: 95.2% idle, ?0.5% user, ?4.4% kernel, ?0.0% iowait, ?0.0% swap >>> Memory: 31.9G real, 199M free, 267M swap in use, 7.7G swap free >>> >>> ? PID USERNAME THR PR NCE ?SIZE ? RES STATE ? TIME FLTS ? ?CPU COMMAND >>> ? 533 root ? ? ?51 59 ? 0 ?148M 30.6M run ? 520:21 ? ?0 ?9.77% java >>> ?1210 yyyyyy ? ? 1 ?0 ? 0 5248K 1048K cpu25 ? 2:08 ? ?0 ?2.23% xload >>> ?14720 yyyyyy ? ? 1 59 ? 0 3248K 1256K cpu24 ? 1:56 ? ?0 ?0.03% top >>> ? 154 root ? ? ? 1 59 ? 0 4024K 1328K sleep ? 1:17 ? ?0 ?0.02% vmstat >>> ?1268 yyyyyy ? ? 1 59 ? 0 4248K 1568K sleep ? 1:26 ? ?0 ?0.01% iostat >>> ... >>> >>> VMSTAT: >>> >>> kthr ? ? ?memory ? ? ? ? ? ?page ? ? ? ? ? ?disk ? ? ? ? ?faults ? ? ?cpu >>> ?r b w ? swap ?free ?re ?mf pi po fr de sr m0 m1 m2 m3 ? in ? sy ? cs us sy id >>> ?0 0 112 8117096 211888 55 46 0 0 425 0 912684 0 0 0 0 ?976 ?166 ?836 ?0 ?2 98 >>> ?0 0 112 8117096 211936 53 51 6 0 394 0 926702 0 0 0 0 ?976 ?167 ?833 ?0 ?2 98 >>> >>> ARC size (B): 4065882656 >>> >>> -- >>> {--------1---------2---------3---------4---------5---------6---------7---------} >>> Paul Kraus >>> -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) >>> -> Sound Designer: Frankenstein, A New Musical >>> (http://www.facebook.com/event.php?eid=123170297765140) >>> -> Sound Coordinator, Schenectady Light Opera Company ( >>> http://www.sloctheater.org/ ) >>> -> Technical Advisor, RPI Players >>> >> >> >> >> -- >> {--------1---------2---------3---------4---------5---------6---------7---------} >> Paul Kraus >> -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) >> -> Sound Designer: Frankenstein, A New Musical >> (http://www.facebook.com/event.php?eid=123170297765140) >> -> Sound Coordinator, Schenectady Light Opera Company ( >> http://www.sloctheater.org/ ) >> -> Technical Advisor, RPI Players >> > > > > -- > {--------1---------2---------3---------4---------5---------6---------7---------} > Paul Kraus > -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) > -> Sound Designer: Frankenstein, A New Musical > (http://www.facebook.com/event.php?eid=123170297765140) > -> Sound Coordinator, Schenectady Light Opera Company ( > http://www.sloctheater.org/ ) > -> Technical Advisor, RPI Players >-- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
UPDATE (for those following along at home)... After patching to latest and greatest Solaris 10 kernel update and getting firmware on both OS drives (72 GB SAS) and server updated to latest and greatest, Oracle has now officially declared it a bug (CR#7082249). No word on when I''ll hear back on status of this new bug (which looks like an old bug, but the old bug has been fixed in the patches I''m now running). On Wed, Aug 3, 2011 at 9:19 AM, Paul Kraus <paul at kraus-haus.org> wrote:> ? ?I am having a very odd problem, and so far the folks at Oracle > Support have not provided a working solution, so I am asking the crowd > here while still pursuing it via Oracle Support. > > ? ?The system is a T2000 running 10U9 with CPU-2010-01and two J4400 > loaded with 1 TB SATA drives. There is one zpool on the J4400 (3 x 15 > disk vdev + 3 hot spare). This system is the target for zfs send / > recv replication from our production server.The OS is UFS on local > disk.<snip> -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players