[ Sending this here, as I''ve publicly complained about this bug on the ZFS list previously, and there have been prior threads related to the fix hitting OpenSolaris ] For those of you who have been suffering marvell device resets and hung I/Os on Sol 10 U4 with NCQ enabled, you should talk to your Sun support folks about IDR137601-02. We have one server that could consistently reproduce the problem, and it''s been running since Friday with NCQ enabled and no hangs after applying the IDR. We had tested previous IDRs that did _not_ fix the problem, but so far, so good... While we had a rough start with support on this issue, kudos to the engineers for finally tracking down this heisenbug (fingers crossed...), and here''s hoping it makes it into a public patch soon. -- Carson
Carson Gaspar wrote:> [ Sending this here, as I''ve publicly complained about this bug on the > ZFS list previously, and there have been prior threads related to the > fix hitting OpenSolaris ] > > For those of you who have been suffering marvell device resets and hung > I/Os on Sol 10 U4 with NCQ enabled, you should talk to your Sun support > folks about IDR137601-02. We have one server that could consistently > reproduce the problem, and it''s been running since Friday with NCQ > enabled and no hangs after applying the IDR. We had tested previous IDRs > that did _not_ fix the problem, but so far, so good... >That is good to hear.> While we had a rough start with support on this issue, kudos to the > engineers for finally tracking down this heisenbug (fingers crossed...), > and here''s hoping it makes it into a public patch soon. >I (we) appreciate your kind words and hope that you have no further issues. By the way, all the additional issues were related to hardware error cases such as sector I/O errors, which is why they were so difficult to track down and reproduce. Regards, Lida
Carson, are you sure about the patch number ? I have been unable to get a french support guy to find it. You are mentionning an "official" IDR patch for Solaris 10 x86, update 4, aren''t you ? Thanks. Xavier. This message posted from opensolaris.org
Xavier Canehan wrote:> Carson, > are you sure about the patch number ? I have been unable to get a french support guy to find it. > You are mentionning an "official" IDR patch for Solaris 10 x86, update 4, aren''t you ?We received it from Sun Support for our Sol 10 U4 X4500. Not sure how "official" an IDR can be... it may have been custom-made just for us. You may want to try just escalating the marvell device reset errors - you should get to the same engineer eventually... -- Carson
Carson Gaspar wrote:> Xavier Canehan wrote: >> Carson, >> are you sure about the patch number ? I have been unable to get a french support guy to find it. >> You are mentionning an "official" IDR patch for Solaris 10 x86, update 4, aren''t you ? > > We received it from Sun Support for our Sol 10 U4 X4500. Not sure how > "official" an IDR can be... it may have been custom-made just for us. > You may want to try just escalating the marvell device reset errors - > you should get to the same engineer eventually... >If you have a case with Sun Service please refer to this Sun Alert and ask for the IDR. The engineer will have to esc. your case to be able to get the IDR from the CR owner. http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1 Sun Alert: 233341: Solaris 10 x86 Systems Using Marvell HBA Controllers May Experience Panic or Hang Status: <SNIP> 4. Workaround Binary relief is available through normal support channels for customers that have patch 125205-07 installed. 5. Resolution A final resolution is pending completion. <SNIP> Cheers, Henrik
When we installed the Marvell driver patch 125205-07 on our X4500 a few months ago and it started crashing, Sun support just told us to back out that patch. The system has been stable since then. We are still running Solaris 10 11/06 on that system. Is there an advantage to using 125205-07 and the IDR you mention compared to just not using NCQ? Better performance? If so, how much better? Thanks This message posted from opensolaris.org
Doug wrote:> When we installed the Marvell driver patch 125205-07 on our X4500 a few months ago and it started crashing, Sun support just told us to back out that patch. The system has been stable since then. > > We are still running Solaris 10 11/06 on that system. Is there an advantage to using 125205-07 and the IDR you mention compared to just not using NCQ? Better performance? If so, how much better?Everything depends on your I/O workload. We are updating thousands of RRD files, so it''s _extremely_ random, relatively small I/Os. In our case, NCQ is a huge win (about a 25% improvement, as I recall). If you do mostly sequential I/O, It will probably make only a small difference. -- Carson