Good afternoon, I have a ~600GB zpool living on older Xeons. The system has 8GB of RAM. The pool is hanging off two LSI Logic SAS3041X-Rs (no RAID configured). When I put a moderate amount of load on the zpool (like, say, copying many files locally, or deleting a large number of ZFS fs), the system hangs and becomes completely unresponsive, requiring a reboot. The ARC never gets over ~40MB. The system is running Sol10u4. Are there any suggested tunables for running big zpools on 32bit? Cheers. -- bda Cyberpunk is dead. Long live cyberpunk. http://mirrorshades.org
On Wed, Aug 6, 2008 at 13:31, Bryan Allen <bda at mirrorshades.net> wrote:> I have a ~600GB zpool living on older Xeons. The system has 8GB of RAM. The > pool is hanging off two LSI Logic SAS3041X-Rs (no RAID configured).You might try taking out 4gb of the ram (!). Some 32-bit drivers have problems doing DMA to >4GB, so limiting yourself to that much might at least eliminate that source of problems. Will
For what it''s worth I see this as well on 32-bit Xeons, 1GB ram, and dual AOC-SAT2-MV8 (large amounts of io sometimes resulting in lockup requiring a reboot --- though my setup is Nexenta b85). Nothing in the logging, nor loadavg increasing significantly. It could be the regular Marvell driver issues, but is definitely not cool when it happens. Thomas On Wed, Aug 6, 2008 at 1:31 PM, Bryan Allen <bda at mirrorshades.net> wrote:> > Good afternoon, > > I have a ~600GB zpool living on older Xeons. The system has 8GB of RAM. The > pool is hanging off two LSI Logic SAS3041X-Rs (no RAID configured). > > When I put a moderate amount of load on the zpool (like, say, copying many > files locally, or deleting a large number of ZFS fs), the system hangs and > becomes completely unresponsive, requiring a reboot. > > The ARC never gets over ~40MB. > > The system is running Sol10u4. > > Are there any suggested tunables for running big zpools on 32bit? > > Cheers. > -- > bda > Cyberpunk is dead. Long live cyberpunk. > http://mirrorshades.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
In the most recent code base (both OpenSolaris/Nevada and S10Ux with patches) all the known marvell88sx problems have long ago been dealt with. However, I''ve said this before. Solaris on 32-bit platforms has problems and is not to be trusted. There are far, far too many places in the source code where a 64-bit object is either loaded or stored without any atomic locking occurring which could result in any number of wrong and bad behaviors. ZFS has some problems of this sort, but so does some of the low level 32-bit x86 code. The problem was reported long ago, but to the best of my knowledge the issues have not been addressed. Looking below it appears that nothing has been done for about 9 months. Here is the top of the bug report: Bug ID 6634371 Synopsis Solaris ON is broken w.r.t. 64-bit operations on 32-bit processors State 1-Dispatched (Default State) Category:Subcategory kernel:other Keywords 32-bit | 64-bit | atomic Reported Against Duplicate Of Introduced In Commit to Fix Fixed In Release Fixed Related Bugs Submit Date 27-NOV-2007 Last Update Date 28-NOV-2007 This message posted from opensolaris.org
Brian D. Horn wrote:> In the most recent code base (both OpenSolaris/Nevada and S10Ux with patches) > all the known marvell88sx problems have long ago been dealt with. > > However, I''ve said this before. Solaris on 32-bit platforms has problems and > is not to be trusted. There are far, far too many places in the source > code where a 64-bit object is either loaded or stored without any atomic > locking occurring which could result in any number of wrong and bad behaviors. > ZFS has some problems of this sort, but so does some of the low level 32-bit > x86 code. The problem was reported long ago, but to the best of my knowledge > the issues have not been addressed. Looking below it appears that nothing > has been done for about 9 months. > > Here is the top of the bug report: > > Bug ID 6634371 > Synopsis Solaris ON is broken w.r.t. 64-bit operations on 32-bit processors > State 1-Dispatched (Default State) > Category:Subcategory kernel:otherI believe you misfiled that bug. I''ve redirected it to solaris / kernel / arch-x86 which appears to me to be more appropriate. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Brian D. Horn wrote:> In the most recent code base (both OpenSolaris/Nevada and S10Ux with patches) > all the known marvell88sx problems have long ago been dealt with.Not true. The working marvell patches still have not been released for Solaris. They''re still just IDRs. Unless you know something I (and my Sun support reps) don''t, in which case please provide patch numbers. -- Carson
As far as I can tell from the patch web patches: For Solaris 10 x86 138053-01 should have the fixes it does depend on other earlier patches though). I find it very difficult to tell what the story is with patches as the patch numbers seem to have very little in them to correlate them to code changes. For Solaris Nevada/OpenSolaris it would seem that the fixes when back Feb 11, 2008 (though there have been additional changes to the sata module since then). Pretty much if you have a version of the driver that still spews informational messages with "marvell88sx" in them, you are running "old" stuff. If those messages have been suppressed, odds are that you have new stuff. This message posted from opensolaris.org
On Wed, Aug 6, 2008 at 6:22 PM, Carson Gaspar <carson at taltos.org> wrote:> Brian D. Horn wrote: >> In the most recent code base (both OpenSolaris/Nevada and S10Ux with patches) >> all the known marvell88sx problems have long ago been dealt with. > > Not true. The working marvell patches still have not been released for > Solaris. They''re still just IDRs. Unless you know something I (and my > Sun support reps) don''t, in which case please provide patch numbers.I was able to get a Tpatch this week with encouraging words about a likely release of 138053-02 this week. In a separate thread last week (?) Enda said that it should be out within a couple weeks. Mike -- Mike Gerdts http://mgerdts.blogspot.com/
On Thu, Aug 7, 2008 at 5:32 AM, Peter Bortas <bortas at gmail.com> wrote:> On Wed, Aug 6, 2008 at 7:31 PM, Bryan Allen <bda at mirrorshades.net> wrote: >> >> Good afternoon, >> >> I have a ~600GB zpool living on older Xeons. The system has 8GB of RAM. The >> pool is hanging off two LSI Logic SAS3041X-Rs (no RAID configured). >> >> When I put a moderate amount of load on the zpool (like, say, copying many >> files locally, or deleting a large number of ZFS fs), the system hangs and >> becomes completely unresponsive, requiring a reboot. > > I have the same problem with 32bit, 2GiB RAM and 6 disk in a 2.7T > raidz on snv_81. Slightly unbalanced one might say, but it shouldn''t > lock up regardless.Forgot to mention I run with diffrent controllers: 2 x Sil3114 PCI cards. -- Peter Bortas
Bryan, Thomas: these hangs of 32-bit Solaris under heavy (fs, I/O) loads are a well known problem. They are caused by memory contention in the kernel heap. Check ''kstat vmem::heap''. The usual recommendation is to change the kernelbase. It worked for me. See: http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046710.html http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046715.html -marc
On Thu, Aug 7, 2008 at 5:53 AM, Marc Bevand <m.bevand at gmail.com> wrote:> Bryan, Thomas: these hangs of 32-bit Solaris under heavy (fs, I/O) loads are a > well known problem. They are caused by memory contention in the kernel heap. > Check ''kstat vmem::heap''. The usual recommendation is to change the > kernelbase. It worked for me. See: > > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046710.html > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046715.htmlThanks Marc! -- Peter Bortas
Yes, there have been bugs with heavy I/O and ZFS running the system out of memory. However, there was a contention in the thread about it possibly being due to marvell88sx driver bugs (most likely not). Further, my mention of 32-bit Solaris being unsafe at any speed is still true. Without analysis of a specific hang it is very hard to say what caused it. It could be driver, memory exhaustion, file system error, VM error, broken hardware, or any number of other things. My points were 1) The marvell88sx driver should be pretty solid at this point in time (yes, earlier releases had problems, most of which were related to bad block handling), and 2) There are systemic issues in Solaris on 32-bit architectures (of which only x86 is supported). This message posted from opensolaris.org
+------------------------------------------------------------------------------ | On 2008-08-07 03:53:04, Marc Bevand wrote: | | Bryan, Thomas: these hangs of 32-bit Solaris under heavy (fs, I/O) loads are a | well known problem. They are caused by memory contention in the kernel heap. | Check ''kstat vmem::heap''. The usual recommendation is to change the | kernelbase. It worked for me. See: | | http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046710.html | http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046715.html Marc, That definitely seems to have made quite a difference. Thanks very much for the help! -- bda Cyberpunk is dead. Long live cyberpunk. http://mirrorshades.org
> In the most recent code base (both OpenSolaris/Nevada and S10Ux with patches) > all the known marvell88sx problems have long ago been dealt with.I''d dispute that. My testing appears to show major hot plug problems with the marvell driver in snv_94. This message posted from opensolaris.org
1) I don''t believe that any bug report has been generated despite various e-mails about this topic. 2) The marvell88sx driver has not been changed recently, so that if this problem actually exists, it is probably related to the sata framework. 3) Is this problem simply that when a device is hot plugged in, it is not immediately made available? If so, that was a design decision which is configurable. This message posted from opensolaris.org
Hmm... it appears that my e-mail to the zfs list covering the problems has disappeared. I will send it again and cross my fingers. The basic problem I found was that with the Supermicro AOC-SAT2-MV8 card (using the marvell chipset), drive removals are not detected consistently by Solaris. There are at least four distinct possible reactions to a drive removal: - Ports 6 & 7 seem to be detected pretty much all the time. Cfgadm reports an empty bay straight away. - For ports 1-5, removal is not detected until you also remove a drive on port 6 or 7. - I''ve also seen Solaris lock completely on removal of a drive on ports 1-5, and hang until the drive is re-inserted. - I''ve seen Solaris crash completely a couple of times and never recover. The last two appeared completely random, I couldn''t find any particular pattern to predicting when either behaviour would occur. This message posted from opensolaris.org
I''ve filed specifically for ZFS: 6735425 some places where 64bit values are being incorrectly accessed on 32bit processors eric On Aug 6, 2008, at 1:59 PM, Brian D. Horn wrote:> In the most recent code base (both OpenSolaris/Nevada and S10Ux with > patches) > all the known marvell88sx problems have long ago been dealt with. > > However, I''ve said this before. Solaris on 32-bit platforms has > problems and > is not to be trusted. There are far, far too many places in the > source > code where a 64-bit object is either loaded or stored without any > atomic > locking occurring which could result in any number of wrong and bad > behaviors. > ZFS has some problems of this sort, but so does some of the low > level 32-bit > x86 code. The problem was reported long ago, but to the best of my > knowledge > the issues have not been addressed. Looking below it appears that > nothing > has been done for about 9 months. > > Here is the top of the bug report: > > Bug ID 6634371 > Synopsis Solaris ON is broken w.r.t. 64-bit operations on 32-bit > processors > State 1-Dispatched (Default State) > Category:Subcategory kernel:other > Keywords 32-bit | 64-bit | atomic > Reported Against > Duplicate Of > Introduced In > Commit to Fix > Fixed In > Release Fixed > Related Bugs > Submit Date 27-NOV-2007 > Last Update Date 28-NOV-2007 > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss