Lachlan Mulcahy
2011-Oct-31 19:11 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi Folks, I have been having issues with Solaris kernel based systems "locking up" and am wondering if anyone else has observed a similar symptom before. Some information/background... Systems the symptom has presented on: NFS server (Nexenta Core 3.01) and a MySQL Server (Sol 11 Express). The issue presents itself as almost total unresponsiveness -- Cannot SSH to the host any longer, access on the local console (via Dell Remote Access Console) is also unresponsive. The only case I have seen some level of responsiveness is in the case of a MySQL server... I was able to connect to the server and issue extremely basic commands like SHOW PROCESSLIST -- anything else would just hang. I feel like this could be explained by the fact that MySQL keeps a thread cache (no need to allocate memory for a new thread on incoming connection) and SHOW PROCESSLIST can be served almost entirely from allocated memory structures. The NFS server has 48G physical memory and no specifically tuned ZFS settings in /etc/system. The MySQL server has 80G physical memory and I have had a variety of ZFS tuning settings -- this is now that system that I am primarily focused in on troubleshooting... The primary cache for the MySQL data zpool is set for metadata only (InnoDB has it''s own buffer pool for data) and I have prefetch disabled, since InnoDB also does it''s own prefetching... Originally when the lock up was first observed I had limited ARC to 4G (to allow most memory to MySQL), but then I saw this lock up happen. I then tuned the server thinking I wasn''t allowing ZFS enough breathing room -- I didn''t realise how much metadata can really consume for a 20TB zpool! So I removed the ARC limit and set InnoDB buffer pool to 54G, down from the previous setting of 64G ... This should allow about 26G to the kernel and ZFS.... The server ran fine for a few days, but then the symptom showed up again... I rebooted the machine and interestingly while MySQL was doing crash recovery, the system locked up yet again!.. Hardware wise we are using mostly Dell gear. The MySQL server is: Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks each - 600GB 10k RPM SAS Drives Storage Controller: LSI, Inc. 1068E (JBOD) I have also seen similar symptoms on systems with MD1000 disk arrays containing 2TB 7200RPM SATA drives. The only thing of note that seems to show up in the /var/adm/messages file on this MySQL server is: Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci at 0 ,0/pci8086,3410 at 9/pci1000,3080 at 0 (mpt0): Oct 31 18:24:51 mslvstdp02r mpt request inquiry page 0x89 for SATA target:58 failed! Oct 31 18:24:52 mslvstdp02r scsi: [ID 583861 kern.info] ses0 at mpt0: unit-address 58,0: target 58 lun 0 Oct 31 18:24:52 mslvstdp02r genunix: [ID 936769 kern.info] ses0 is /pci at 0 ,0/pci8086,3410 at 9/pci1000,3080 at 0/ses at 58,0 Oct 31 18:24:52 mslvstdp02r genunix: [ID 408114 kern.info] /pci at 0 ,0/pci8086,3410 at 9/pci1000,3080 at 0/ses at 58,0 (ses0) online Oct 31 18:24:52 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci at 0 ,0/pci8086,3410 at 9/pci1000,3080 at 0 (mpt0): Oct 31 18:24:52 mslvstdp02r mpt request inquiry page 0x89 for SATA target:59 failed! Oct 31 18:24:53 mslvstdp02r scsi: [ID 583861 kern.info] ses1 at mpt0: unit-address 59,0: target 59 lun 0 Oct 31 18:24:53 mslvstdp02r genunix: [ID 936769 kern.info] ses1 is /pci at 0 ,0/pci8086,3410 at 9/pci1000,3080 at 0/ses at 59,0 Oct 31 18:24:53 mslvstdp02r genunix: [ID 408114 kern.info] /pci at 0 ,0/pci8086,3410 at 9/pci1000,3080 at 0/ses at 59,0 (ses1) online I''m thinking that the issue is memory related, so the current test I am running is: ZFS tuneables: /etc/system: # Limit the amount of memory the ARC cache will use # See this link: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache # Limit to 24G set zfs:zfs_arc_max = 25769803776 # Limit meta data to 20GB set zfs:zfs_arc_meta_limit = 21474836480 # Disable ZFS prefetch - InnoDB Does its own set zfs:zfs_prefetch_disable = 1 MySQL memory: Set Innodb buffer pool size to 44G (down another 10G from 54G).. That should allow 44+24=68 for ARC and MySQL and 12G for anything else that I haven''t considered... I am using arcstat.pl to collect/write stats on arc size, hit ratio, requests, etc. to a file every 5 seconds. and vmstat also every 5 seconds. I''m hoping that should the issue present itself again, that I can find a possible cause, but I''m really concerned about this issue - we want to make use of ZFS in production, but this seemingly inexplicable lock ups are not filling us with confidence :( Has anyone seen similar things before and do you have any suggestions for what else I should consider looking at? Thanks and Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111031/cbac49ce/attachment.html>
Marion Hakanson
2011-Oct-31 19:34 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
lmulcahy at marinsoftware.com said:> . . . > The MySQL server is: > Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks > each - 600GB 10k RPM SAS Drives Storage Controller: LSI, Inc. 1068E (JBOD) > > I have also seen similar symptoms on systems with MD1000 disk arrays > containing 2TB 7200RPM SATA drives. > > The only thing of note that seems to show up in the /var/adm/messages file on > this MySQL server is: > > Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/ > pci8086,3410 at 9/pci1000,3080 at 0 (mpt0): Oct 31 18:24:51 mslvstdp02r mpt > request inquiry page 0x89 for SATA target:58 failed! Oc > . . .Have you got the latest firmware on your LSI 1068E HBA''s? These have been known to have lockups/timeouts when used with SAS expanders (disk enclosures) with incompatible firmware revisions, and/or with older mpt drivers. The MD1220 is a 6Gbit/sec device. You may be better off with a matching HBA -- Dell has certainly told us the MD1200-series is not intended for use with the 3Gbit/sec HBA''s. We''re doing fine with the LSI SAS 9200-8e, for example, when connecting to Dell MD1200''s with the 2TB "nearline SAS" disk drives. Last, are you sure it''s memory-related? You might keep an eye on "arcstat.pl" output and see what the ARC sizes look like just prior to lockup. Also, maybe you can look up instructions on how to force a crash dump when the system hangs -- one of the experts around here could tell a lot from a crash dump file. Regards, Marion
Lachlan Mulcahy
2011-Oct-31 21:10 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi Marion, Thanks for your swifty reply! Have you got the latest firmware on your LSI 1068E HBA''s? These have been> known to have lockups/timeouts when used with SAS expanders (disk > enclosures) > with incompatible firmware revisions, and/or with older mpt drivers. >I''ll need to check that out -- I''m 90% sure that these are fresh out of box HBAs. Will try an upgrade there and see if we get any joy there...> The MD1220 is a 6Gbit/sec device. You may be better off with a matching > HBA -- Dell has certainly told us the MD1200-series is not intended for > use with the 3Gbit/sec HBA''s. We''re doing fine with the LSI SAS 9200-8e, > for example, when connecting to Dell MD1200''s with the 2TB "nearline SAS" > disk drives. >I was aware the MD1220 is a 6G device, but I figured that since our IO throughput doesn''t actually come close to saturating 3Gbit/sec that it would just operate at the lower speed and be OK. I guess it is something to look at if I run out of other options... Last, are you sure it''s memory-related? You might keep an eye on "> arcstat.pl" > output and see what the ARC sizes look like just prior to lockup. Also, > maybe you can look up instructions on how to force a crash dump when the > system hangs -- one of the experts around here could tell a lot from a > crash dump file. >I''m starting to doubt that it is a memory issue now -- especially since I now have some results from my latest "test"... output of arcstat.pl looked like this just prior to the lock up: 19:57:36 24G 24G 94 161 61 194 1 1 19:57:41 24G 24G 96 174 62 213 0 0 time arcsz c mh% mhit hit% hits l2hit% l2hits 19:57:46 23G 24G 94 161 62 192 1 1 19:57:51 24G 24G 96 169 63 205 0 0 19:57:56 24G 24G 95 169 61 206 0 0 ^-- This is the very last line printed... I actually discovered and rebooted the machine via DRAC at around 20:44, so it had been in it''s bad state for around 1 hour. Some snippets from the output some 20 minutes earlier shows the point at while the arcsz grew to reach the maximum: time arcsz c mh% mhit hit% hits l2hit% l2hits 19:36:45 21G 24G 95 152 58 177 0 0 19:37:00 22G 24G 95 156 57 182 0 0 19:37:15 22G 24G 95 159 59 185 0 0 19:37:30 23G 24G 94 153 58 178 0 0 19:37:45 23G 24G 95 169 59 195 0 0 19:38:00 24G 24G 95 160 59 187 0 0 19:38:25 24G 24G 96 151 58 177 0 0 So it seems that arcsz reaching the 24G maximum wasn''t necessarily to blame, since the system operated for a good 20mins in this state. I was also logging "vmstat 5" prior to the crash (though I forgot to include some timestamps in my output) and these are the final lines recorded in that log: kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 0 25885248 18012208 71 2090 0 0 0 0 0 0 0 0 22 17008 210267 30229 1 5 94 0 0 0 25884764 18001848 71 2044 0 0 0 0 0 0 0 0 25 14846 151228 25911 1 5 94 0 0 0 25884208 17991876 71 2053 0 0 0 0 0 0 0 0 8 16343 185416 28946 1 5 93 So it seems there was some 17-18G free in the system when the lock up occurred. Curious... I was also capturing some arc info from mdb -k and the output prior to the lock up was... Monday, October 31, 2011 07:57:51 PM UTC arc_no_grow = 0 arc_tempreserve = 0 MB arc_meta_used = 4621 MB arc_meta_limit = 20480 MB arc_meta_max = 4732 MB Monday, October 31, 2011 07:57:56 PM UTC arc_no_grow = 0 arc_tempreserve = 0 MB arc_meta_used = 4622 MB arc_meta_limit = 20480 MB arc_meta_max = 4732 MB Looks like metadata was not primarily responsible for consuming all of that 24G of ARC in arcstat.pl output... Also seems nothing interesting in /var/adm/messages leading up to my rebooting : Oct 31 18:42:57 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 512 PPM exceeds tolerance 500 PPM Oct 31 18:44:01 mslvstdp02r last message repeated 1 time Oct 31 18:45:05 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 512 PPM exceeds tolerance 500 PPM Oct 31 18:46:09 mslvstdp02r last message repeated 1 time Oct 31 18:47:23 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:06:13 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:09:27 mslvstdp02r last message repeated 4 times Oct 31 19:25:04 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:28:17 mslvstdp02r last message repeated 3 times Oct 31 19:46:17 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:49:32 mslvstdp02r last message repeated 4 times Oct 31 20:44:33 mslvstdp02r genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version snv_151a 64-bit Oct 31 20:44:33 mslvstdp02r genunix: [ID 877030 kern.notice] Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights reserved. Just some ntpd stuff really... I''m going to check out a firmware upgrade next.. I can''t believe that this is really an out of memory situation now when there is 17-18G free reported by vmstat.... Lets see... Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111031/6e242cb0/attachment-0001.html>
Lachlan Mulcahy
2011-Nov-01 02:18 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi All/Marion, A small update... known to have lockups/timeouts when used with SAS expanders (disk>> enclosures) >> with incompatible firmware revisions, and/or with older mpt drivers. >> > > I''ll need to check that out -- I''m 90% sure that these are fresh out of > box HBAs. > > Will try an upgrade there and see if we get any joy there... >We did not have the latest firmware on the HBA - through a lot of pain I managed to boot into an MS-DOS disk and run the firmware update. We''re now running the latest on this card from the LSI.com website. (both HBA BIOS and Firmware)> The MD1220 is a 6Gbit/sec device. You may be better off with a matching >> HBA -- Dell has certainly told us the MD1200-series is not intended for >> use with the 3Gbit/sec HBA''s. We''re doing fine with the LSI SAS 9200-8e, >> for example, when connecting to Dell MD1200''s with the 2TB "nearline SAS" >> disk drives. >> > > I was aware the MD1220 is a 6G device, but I figured that since our IO > throughput doesn''t actually come close to saturating 3Gbit/sec that it > would just operate at the lower speed and be OK. I guess it is something to > look at if I run out of other options... >This was my mistake - this particular system has MD1120s attached to it. We have a mix of 1220s and 1120s since we''ve been with Dell since the 1120s were current model. Just kicked off the system running with the same logging as before with this new firmware, so I''ll see if this goes any better. Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111031/379c12fb/attachment.html>
Lachlan Mulcahy
2011-Nov-01 04:46 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi All, We did not have the latest firmware on the HBA - through a lot of pain I> managed to boot into an MS-DOS disk and run the firmware update. We''re now > running the latest on this card from the LSI.com website. (both HBA BIOS > and Firmware) >No joy.. the system seized up again within a few hours of coming back up. Now trying another suggestion sent to me by a direct poster: * Recommendation from Sun (Oracle) to work around a bug: * 6958068 - Nehalem deeper C-states cause erratic scheduling behavior set idle_cpu_prefer_mwait = 0 set idle_cpu_no_deep_c = 1 Was apparently the cause of a similar symptom for them and we are using Nehalem. At this point I''m running out of options, so it can''t hurt to try it. Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111031/d79cb62e/attachment.html>
Richard Elling
2011-Nov-01 05:29 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
FWIW, we recommend disabling C-states in the BIOS for NexentaStor systems. C-states are evil. -- richard On Oct 31, 2011, at 9:46 PM, Lachlan Mulcahy wrote:> Hi All, > > > We did not have the latest firmware on the HBA - through a lot of pain I managed to boot into an MS-DOS disk and run the firmware update. We''re now running the latest on this card from the LSI.com website. (both HBA BIOS and Firmware) > > No joy.. the system seized up again within a few hours of coming back up. > > Now trying another suggestion sent to me by a direct poster: > > * Recommendation from Sun (Oracle) to work around a bug: > * 6958068 - Nehalem deeper C-states cause erratic scheduling behavior > set idle_cpu_prefer_mwait = 0 > set idle_cpu_no_deep_c = 1 > > Was apparently the cause of a similar symptom for them and we are using Nehalem. > > At this point I''m running out of options, so it can''t hurt to try it. > > Regards, > -- > Lachlan Mulcahy > Senior DBA, > Marin Software Inc. > San Francisco, USA > > AU Mobile: +61 458 448 721 > US Mobile: +1 (415) 867 2839 > Office : +1 (415) 671 6080 > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS and performance consulting http://www.RichardElling.com LISA ''11, Boston, MA, December 4-9
Lachlan Mulcahy
2011-Nov-03 00:24 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi All, No joy.. the system seized up again within a few hours of coming back up.> > Now trying another suggestion sent to me by a direct poster: > > * Recommendation from Sun (Oracle) to work around a bug: > * 6958068 - Nehalem deeper C-states cause erratic scheduling behavior > set idle_cpu_prefer_mwait = 0 > set idle_cpu_no_deep_c = 1 > > Was apparently the cause of a similar symptom for them and we are using > Nehalem. > > At this point I''m running out of options, so it can''t hurt to try it.So far the system has been running without any lock ups since very late Monday evening -- we''re now almost 48 hours on. So far so good, but it''s hard to be certain this is the solution, since I could never prove it was the root cause. For now I''m just continuing to test and build confidence level. More time will make me more confident. Maybe a week or so.... Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111102/86333b88/attachment.html>
Edward Ned Harvey
2011-Nov-03 12:15 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lachlan Mulcahy > > I have been having issues with Solaris kernel based systems "locking up"and> am wondering if anyone else has observed a similar symptom before. > > ... > > Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22Disks> each - 600GB 10k RPM SAS Drives > Storage Controller: LSI, Inc. 1068E (JBOD)Please see http://mail.opensolaris.org/pipermail/zfs-discuss/2010-November/046189.html But I''ll need to expand upon this a little more here: When we bought that system, solaris was a supported os on the R710. We paid for oracle gold support (or whatever they called it) and we dug into it for hours and hours, weeks and weeks, never got anywhere. When I replaced the NIC (don''t use the built-in bcom nic) it became much better. It went from crashing weekly to crashing monthly. I don''t believe you''ll ever be able to make the problem go away completely. This is the nature of running on unsupported hardware - even if you pay and get a support contract - they just don''t develop or test on that platform with any quantity, so the end result is crap. We have since reprovisioned the R710 to other purposes, where it''s perfectly stable. We have also bought Sun (oracle) server to fill the requirements that were formerly filled by the R710 with solaris, and it''s also perfectly stable.
Edward Ned Harvey
2011-Nov-03 12:17 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lachlan Mulcahy > > * ? ? ? Recommendation from Sun (Oracle) to work around a bug: > * ? ? ? 6958068 - Nehalem deeper C-states cause erratic schedulingbehavior> set idle_cpu_prefer_mwait = 0 > set idle_cpu_no_deep_c = 1 > Was apparently the cause of a similar symptom for them and we are using > Nehalem.FWIW, we also disabled the c-states. It seemed to make an improvement, but not what I would call a fix for us.
Lachlan Mulcahy
2011-Nov-03 23:09 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi Edward, Thanks for your input. Please see> http://mail.opensolaris.org/pipermail/zfs-discuss/2010-November/046189.html > > But I''ll need to expand upon this a little more here: When we bought that > system, solaris was a supported os on the R710. We paid for oracle gold > support (or whatever they called it) and we dug into it for hours and > hours, > weeks and weeks, never got anywhere. > > When I replaced the NIC (don''t use the built-in bcom nic) it became much > better. It went from crashing weekly to crashing monthly. >We have had no end of trouble with those junky broadcomm NICs. When we originally moved to all Dell hardware from a hosted solution we had a load of problems with the NICs just dropping packets when the CPU got busy. The platform for those machines was CentOS 5.5 -- It never created instability or server crashes, however. We generally use Intel NICs for our production interfaces now. Over the years I''ve found that Intel simply make rock solid NICs. We still use those broadcomm NICs but mostly for out of band/maintenance network access. On the host in question we are using Intel for the main production interface and a broadcomm device for maintenance. If I see any more issues, I''ll consider disabling the onboard broadcomms via BIOS. So far with the sleep states disabled it seems better. I don''t believe you''ll ever be able to make the problem go away completely.> This is the nature of running on unsupported hardware - even if you pay and > get a support contract - they just don''t develop or test on that platform > with any quantity, so the end result is crap. > > We have since reprovisioned the R710 to other purposes, where it''s > perfectly > stable. We have also bought Sun (oracle) server to fill the requirements > that were formerly filled by the R710 with solaris, and it''s also perfectly > stable. > >Unfortunately the point of using Solaris/ZFS for us in this particular instance is to avoid having to buy more hardware. We have a system that just needs to hold a lot of data and RAIDZ2 w/ lzjb gets us about 4-5X the useable space as a regular xfs/ext4 RAID-10 with the same disks. This system is legacy and just needs to live for another few months and support the data growth, so the idea here is to try to avoid spending money on something that is going away. This sort of precludes buying SnOracle(Sun) hardware to run Solaris on tried and true gear. :-/ Thanks and Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111103/dde736a7/attachment-0001.html>
Lachlan Mulcahy
2011-Nov-08 22:27 UTC
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi All, On Wed, Nov 2, 2011 at 5:24 PM, Lachlan Mulcahy <lmulcahy at marinsoftware.com>wrote:> Now trying another suggestion sent to me by a direct poster: >> >> * Recommendation from Sun (Oracle) to work around a bug: >> * 6958068 - Nehalem deeper C-states cause erratic scheduling >> behavior >> set idle_cpu_prefer_mwait = 0 >> set idle_cpu_no_deep_c = 1 >> >> Was apparently the cause of a similar symptom for them and we are using >> Nehalem. >> >> At this point I''m running out of options, so it can''t hurt to try it. > > > So far the system has been running without any lock ups since very late > Monday evening -- we''re now almost 48 hours on. > > So far so good, but it''s hard to be certain this is the solution, since I > could never prove it was the root cause. > > For now I''m just continuing to test and build confidence level. More time > will make me more confident. Maybe a week or so.... >We''re now over a week running with C-states disabled and have not experienced any further system lock ups. I am feeling much more confident in this system now -- it will probably see at least another week or two in addition to more load/QA testing and then be pushed into production. Will update if I see the issue crop up again, but for anyone else experiencing a similar symptom, I''d highly recommend trying this as a solution. So far it seems to have worked for us. Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111108/d4478c15/attachment-0001.html>