Shane, Regarding the read performance issues you mentioned today, could you take a look at this bug, resolved in Lustre 1.4.7, to see if it resembles the problem you''re seeing: https://bugzilla.clusterfs.com/show_bug.cgi?id=10265 I''ve heard a few reports today at SC about slow read performance use cases in 1.4.6 and 1.4.7, which I''d like to get on top of immediately. If this is not a match for your case, please file a Bugzilla ticket and let me know the number. I''d like to hand this off promptly to a senior IO specialist. Thanks, Peter
On 2006-11-15, at 12:36 , Peter Bojanic wrote:> Shane, > > Regarding the read performance issues you mentioned today...Some further news... - Sandia (Redstorm) reports 100% CPU utilization during reads on both Linux and Cray systems; they''re running Unicos 1.4, based on Lustre 1.4.6 ===> Lee, is there a Bugzilla reported by Sandia that describes this issue? - Indiana University reports a similar read performance drop-off but with Lustre 1.4.7.x. I''ve escalated the following Bugzilla to our IO team with the highest priority: Bug 11194: simultaneous reads of same one stripe file show decrease in performance ===> mjmac, can you please make sure we''ve got Cluster Surveys for both Indiana University and for Rizzo at ORNL so engineering can understand the scope of these systems Thanks, Peter
In CGG we have had some more progress with our investigations into poor IO performance with small IO sizes using Lustre 1.4.6/x and 1.4.7.x and we created a new bug on this issue which I shall summarise here. Our clients are RHEL3.4 kernel 2.4.21 and we believe the issue is due to ll_writepage being called by the kernel vm daemons to release vm cache memory too aggressively, this results in the cache being flushed within a few milliseconds of being written so small IO''s are not aggregated into large rpcs. We had some success in tuning the vm tunables but the problem comes back at some point later. There may be some later 2.4 vm patches we can apply or we may just use a 2.6 kernel which does not seem to suffer from the same problems. J Belshaw CGG Redhill -----Original Message----- From: lustre-discuss-bounces@clusterfs.com [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Peter Bojanic Sent: 15 November 2006 17:36 To: Richard Shane Canon Cc: lustre-discuss@clusterfs.com Subject: [Lustre-discuss] Read performance issues at ORNL Shane, Regarding the read performance issues you mentioned today, could you take a look at this bug, resolved in Lustre 1.4.7, to see if it resembles the problem you''re seeing: https://bugzilla.clusterfs.com/show_bug.cgi?id=10265 I''ve heard a few reports today at SC about slow read performance use cases in 1.4.6 and 1.4.7, which I''d like to get on top of immediately. If this is not a match for your case, please file a Bugzilla ticket and let me know the number. I''d like to hand this off promptly to a senior IO specialist. Thanks, Peter _______________________________________________ Lustre-discuss mailing list Lustre-discuss@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote:> On 2006-11-15, at 12:36 , Peter Bojanic wrote: > > > Shane, > > > > Regarding the read performance issues you mentioned today... > > Some further news... > > - Sandia (Redstorm) reports 100% CPU utilization during reads on both > Linux and Cray systems; they''re running Unicos 1.4, based on Lustre > 1.4.6 > > ===> Lee, is there a Bugzilla reported by Sandia that describes this > issue?Not that I know of. I believe we are supposed to submit one relative to the block side of Rose. --Lee> > - Indiana University reports a similar read performance drop-off but > with Lustre 1.4.7.x. I''ve escalated the following Bugzilla to our IO > team with the highest priority: > > Bug 11194: simultaneous reads of same one stripe file show decrease > in performance > > ===> mjmac, can you please make sure we''ve got Cluster Surveys for > both Indiana University and for Rizzo at ORNL so engineering can > understand the scope of these systems > > Thanks, > Peter >
On Thu, 2006-11-16 at 05:10 -0700, Lee Ward wrote:> On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote: > > On 2006-11-15, at 12:36 , Peter Bojanic wrote: > > > > > Shane, > > > > > > Regarding the read performance issues you mentioned today... > > > > Some further news... > > > > - Sandia (Redstorm) reports 100% CPU utilization during reads on both > > Linux and Cray systems; they''re running Unicos 1.4, based on Lustre > > 1.4.6 > > > > ===> Lee, is there a Bugzilla reported by Sandia that describes this > > issue? > > Not that I know of. I believe we are supposed to submit one relative to > the block side of Rose.*black> > --Lee > > > > > - Indiana University reports a similar read performance drop-off but > > with Lustre 1.4.7.x. I''ve escalated the following Bugzilla to our IO > > team with the highest priority: > > > > Bug 11194: simultaneous reads of same one stripe file show decrease > > in performance > > > > ===> mjmac, can you please make sure we''ve got Cluster Surveys for > > both Indiana University and for Rizzo at ORNL so engineering can > > understand the scope of these systems > > > > Thanks, > > Peter > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Peter Bojanic wrote:> Shane, > > Regarding the read performance issues you mentioned today, could you > take a look at this bug, resolved in Lustre 1.4.7, to see if it > resembles the problem you''re seeing: > > https://bugzilla.clusterfs.com/show_bug.cgi?id=10265 > > I''ve heard a few reports today at SC about slow read performance use > cases in 1.4.6 and 1.4.7, which I''d like to get on top of immediately. > If this is not a match for your case, please file a Bugzilla ticket > and let me know the number. I''d like to hand this off promptly to a > senior IO specialist. > > Thanks, > Peter > >I think it is a well known (?) issue that with the default debug mask (/proc/sys/lnet/debug) on the servers, read performance suffers. I don''t think any work into narrowing that down or measuring the impact has been done yet, but if the data would be interesting, I think this could be done quite quickly. Nic
Here at LLNL we know of at least one cause of 100% CPU utilization observed on LNET gigE<->elan routers during a heavy read load. We tracked this issue back to a needed tuning for our e1000 network cards to prevent RX descriptor overflow. While we mainly saw this on our router nodes I see no reason it couldn''t manifest itself on a heavily loaded tcp attached client or server. The details of this bug and a debug patch which will log the RX descriptor overflows if they are occuring can be found in bug10077. https://bugzilla.lustre.org/show_bug.cgi?id=10077 At LLNL we now make a habit out of maxing out the RX descriptor queue on any lustre node using an e1000 adaptor. Add the module option ''RxDescriptors=4096'' to your modprobe.conf or use ''ethtool -G ethX rx 4096'' to tune it on the fly. I have not yet investigated if the tg3 driver or bcm5700 drivers suffer from similar issues. They may. -- Good luck, Brian> On 2006-11-15, at 12:36 , Peter Bojanic wrote: > > Shane, > > > > Regarding the read performance issues you mentioned today... > > Some further news... > > - Sandia (Redstorm) reports 100% CPU utilization during reads on both > Linux and Cray systems; they''re running Unicos 1.4, based on Lustre > 1.4.6 > > ===> Lee, is there a Bugzilla reported by Sandia that describes this > issue? > > - Indiana University reports a similar read performance drop-off but > with Lustre 1.4.7.x. I''ve escalated the following Bugzilla to our IO > team with the highest priority: > > Bug 11194: simultaneous reads of same one stripe file show decrease > in performance > > ===> mjmac, can you please make sure we''ve got Cluster Surveys for > both Indiana University and for Rizzo at ORNL so engineering can > understand the scope of these systems > > Thanks, > Peter > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Lee, On 2006-11-16, at 8:10 , Lee Ward wrote:> On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote: >> On 2006-11-15, at 12:36 , Peter Bojanic wrote: >> >>> Shane, >>> >>> Regarding the read performance issues you mentioned today... >> >> Some further news... >> >> - Sandia (Redstorm) reports 100% CPU utilization during reads on both >> Linux and Cray systems; they''re running Unicos 1.4, based on Lustre >> 1.4.6 >> >> ===> Lee, is there a Bugzilla reported by Sandia that describes this >> issue? > > Not that I know of. I believe we are supposed to submit one > relative to > the block side of Rose.Would you please let Alex know what bug number this is? I''d like to make sure his eyeballs are on it. Thanks, Bojanic
It''s not, yet. In theory Steve Monk will report it for the Rose machine. It seemed easier that way and they see the issue as well... --Lee On Tue, 2006-11-21 at 12:56 -0400, Peter Bojanic wrote:> Lee, > > On 2006-11-16, at 8:10 , Lee Ward wrote: > > > On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote: > >> On 2006-11-15, at 12:36 , Peter Bojanic wrote: > >> > >>> Shane, > >>> > >>> Regarding the read performance issues you mentioned today... > >> > >> Some further news... > >> > >> - Sandia (Redstorm) reports 100% CPU utilization during reads on both > >> Linux and Cray systems; they''re running Unicos 1.4, based on Lustre > >> 1.4.6 > >> > >> ===> Lee, is there a Bugzilla reported by Sandia that describes this > >> issue? > > > > Not that I know of. I believe we are supposed to submit one > > relative to > > the block side of Rose. > > Would you please let Alex know what bug number this is? I''d like to > make sure his eyeballs are on it. > > Thanks, > Bojanic > > >
Thanks for clarifying, Lee. Steve, is there already a Bugzilla for your IO performance issue? Is it possibly https://bugzilla.clusterfs.com/show_bug.cgi?id=10112? Please advise. Thanks, Bojanic On 2006-11-21, at 13:43 , Lee Ward wrote:> It''s not, yet. In theory Steve Monk will report it for the Rose > machine. > It seemed easier that way and they see the issue as well... > > --Lee > > On Tue, 2006-11-21 at 12:56 -0400, Peter Bojanic wrote: >> Lee, >> >> On 2006-11-16, at 8:10 , Lee Ward wrote: >> >>> On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote: >>>> On 2006-11-15, at 12:36 , Peter Bojanic wrote: >>>> >>>>> Shane, >>>>> >>>>> Regarding the read performance issues you mentioned today... >>>> >>>> Some further news... >>>> >>>> - Sandia (Redstorm) reports 100% CPU utilization during reads on >>>> both >>>> Linux and Cray systems; they''re running Unicos 1.4, based on Lustre >>>> 1.4.6 >>>> >>>> ===> Lee, is there a Bugzilla reported by Sandia that describes >>>> this >>>> issue? >>> >>> Not that I know of. I believe we are supposed to submit one >>> relative to >>> the block side of Rose. >> >> Would you please let Alex know what bug number this is? I''d like to >> make sure his eyeballs are on it. >> >> Thanks, >> Bojanic >> >> >>
Yes I assume since the bug was reopened yesterday that we are going to use 10112. Thanks, Steve> -----Original Message----- > From: Peter Bojanic [mailto:pbojanic@clusterfs.com] > Sent: Tuesday, November 21, 2006 11:04 AM > To: Monk, Stephen; Ward, Lee > Cc: Alex Thomas; Richard Shane Canon; Michael MacDonald; > lustre-discuss@clusterfs.com > Subject: Re: Read performance issues at ORNL > > Thanks for clarifying, Lee. > > Steve, is there already a Bugzilla for your IO performance > issue? Is it possibly > https://bugzilla.clusterfs.com/show_bug.cgi?id=10112? > > Please advise. > > Thanks, > Bojanic > > On 2006-11-21, at 13:43 , Lee Ward wrote: > > > It''s not, yet. In theory Steve Monk will report it for the Rose > > machine. > > It seemed easier that way and they see the issue as well... > > > > --Lee > > > > On Tue, 2006-11-21 at 12:56 -0400, Peter Bojanic wrote: > >> Lee, > >> > >> On 2006-11-16, at 8:10 , Lee Ward wrote: > >> > >>> On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote: > >>>> On 2006-11-15, at 12:36 , Peter Bojanic wrote: > >>>> > >>>>> Shane, > >>>>> > >>>>> Regarding the read performance issues you mentioned today... > >>>> > >>>> Some further news... > >>>> > >>>> - Sandia (Redstorm) reports 100% CPU utilization during reads on > >>>> both Linux and Cray systems; they''re running Unicos 1.4, > based on > >>>> Lustre > >>>> 1.4.6 > >>>> > >>>> ===> Lee, is there a Bugzilla reported by Sandia that describes > >>>> this issue? > >>> > >>> Not that I know of. I believe we are supposed to submit > one relative > >>> to the block side of Rose. > >> > >> Would you please let Alex know what bug number this is? > I''d like to > >> make sure his eyeballs are on it. > >> > >> Thanks, > >> Bojanic > >> > >> > >> > > >
Brian Behlendorf
2006-Nov-21 11:23 UTC
[Lustre-discuss] RE: Read performance issues at ORNL
Please make the bug public if you looking for community advice, I personally can''t seem to access bug #10112. Thanks, Brian> Yes I assume since the bug was reopened yesterday that we are going to > use 10112. > > Thanks, > Steve > > > -----Original Message----- > > From: Peter Bojanic [mailto:pbojanic@clusterfs.com] > > Sent: Tuesday, November 21, 2006 11:04 AM > > To: Monk, Stephen; Ward, Lee > > Cc: Alex Thomas; Richard Shane Canon; Michael MacDonald; > > lustre-discuss@clusterfs.com > > Subject: Re: Read performance issues at ORNL > > > > Thanks for clarifying, Lee. > > > > Steve, is there already a Bugzilla for your IO performance > > issue? Is it possibly > > https://bugzilla.clusterfs.com/show_bug.cgi?id=10112? > > > > Please advise. > > > > Thanks, > > Bojanic > > > > On 2006-11-21, at 13:43 , Lee Ward wrote: > > > It''s not, yet. In theory Steve Monk will report it for the Rose > > > machine. > > > It seemed easier that way and they see the issue as well... > > > > > > --Lee > > > > > > On Tue, 2006-11-21 at 12:56 -0400, Peter Bojanic wrote: > > >> Lee, > > >> > > >> On 2006-11-16, at 8:10 , Lee Ward wrote: > > >>> On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote: > > >>>> On 2006-11-15, at 12:36 , Peter Bojanic wrote: > > >>>>> Shane, > > >>>>> > > >>>>> Regarding the read performance issues you mentioned today... > > >>>> > > >>>> Some further news... > > >>>> > > >>>> - Sandia (Redstorm) reports 100% CPU utilization during reads on > > >>>> both Linux and Cray systems; they''re running Unicos 1.4, > > > > based on > > > > >>>> Lustre > > >>>> 1.4.6 > > >>>> > > >>>> ===> Lee, is there a Bugzilla reported by Sandia that describes > > >>>> this issue? > > >>> > > >>> Not that I know of. I believe we are supposed to submit > > > > one relative > > > > >>> to the block side of Rose. > > >> > > >> Would you please let Alex know what bug number this is? > > > > I''d like to > > > > >> make sure his eyeballs are on it. > > >> > > >> Thanks, > > >> Bojanic > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss-- Thanks, Brian
Alex, Can your team please take this as an interrupt. Determine root cause then estimate a completion date to be included in a release with the product management group. Thanks, Bojanic On 2006-11-21, at 14:10 , Monk, Stephen wrote:> Yes I assume since the bug was reopened yesterday that we are going to > use 10112. > > Thanks, > Steve > >> -----Original Message----- >> From: Peter Bojanic [mailto:pbojanic@clusterfs.com] >> Sent: Tuesday, November 21, 2006 11:04 AM >> To: Monk, Stephen; Ward, Lee >> Cc: Alex Thomas; Richard Shane Canon; Michael MacDonald; >> lustre-discuss@clusterfs.com >> Subject: Re: Read performance issues at ORNL >> >> Thanks for clarifying, Lee. >> >> Steve, is there already a Bugzilla for your IO performance >> issue? Is it possibly >> https://bugzilla.clusterfs.com/show_bug.cgi?id=10112? >> >> Please advise. >> >> Thanks, >> Bojanic >> >> On 2006-11-21, at 13:43 , Lee Ward wrote: >> >>> It''s not, yet. In theory Steve Monk will report it for the Rose >>> machine. >>> It seemed easier that way and they see the issue as well... >>> >>> --Lee >>> >>> On Tue, 2006-11-21 at 12:56 -0400, Peter Bojanic wrote: >>>> Lee, >>>> >>>> On 2006-11-16, at 8:10 , Lee Ward wrote: >>>> >>>>> On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote: >>>>>> On 2006-11-15, at 12:36 , Peter Bojanic wrote: >>>>>> >>>>>>> Shane, >>>>>>> >>>>>>> Regarding the read performance issues you mentioned today... >>>>>> >>>>>> Some further news... >>>>>> >>>>>> - Sandia (Redstorm) reports 100% CPU utilization during reads on >>>>>> both Linux and Cray systems; they''re running Unicos 1.4, >> based on >>>>>> Lustre >>>>>> 1.4.6 >>>>>> >>>>>> ===> Lee, is there a Bugzilla reported by Sandia that describes >>>>>> this issue? >>>>> >>>>> Not that I know of. I believe we are supposed to submit >> one relative >>>>> to the block side of Rose. >>>> >>>> Would you please let Alex know what bug number this is? >> I''d like to >>>> make sure his eyeballs are on it. >>>> >>>> Thanks, >>>> Bojanic >>>> >>>> >>>> >> >> >>
OK, we''ll take care. thanks, Alex>>>>> Peter Bojanic (PB) writes:PB> Alex, PB> Can your team please take this as an interrupt. Determine root cause PB> then estimate a completion date to be included in a release with the PB> product management group. PB> Thanks, PB> Bojanic PB> On 2006-11-21, at 14:10 , Monk, Stephen wrote: >> Yes I assume since the bug was reopened yesterday that we are going to >> use 10112. >> >> Thanks, >> Steve >> >>> -----Original Message----- >>> From: Peter Bojanic [mailto:pbojanic@clusterfs.com] >>> Sent: Tuesday, November 21, 2006 11:04 AM >>> To: Monk, Stephen; Ward, Lee >>> Cc: Alex Thomas; Richard Shane Canon; Michael MacDonald; >>> lustre-discuss@clusterfs.com >>> Subject: Re: Read performance issues at ORNL >>> >>> Thanks for clarifying, Lee. >>> >>> Steve, is there already a Bugzilla for your IO performance >>> issue? Is it possibly >>> https://bugzilla.clusterfs.com/show_bug.cgi?id=10112? >>> >>> Please advise. >>> >>> Thanks, >>> Bojanic >>> >>> On 2006-11-21, at 13:43 , Lee Ward wrote: >>> >>>> It''s not, yet. In theory Steve Monk will report it for the Rose >>>> machine. >>>> It seemed easier that way and they see the issue as well... >>>> >>>> --Lee >>>> >>>> On Tue, 2006-11-21 at 12:56 -0400, Peter Bojanic wrote: >>>>> Lee, >>>>> >>>>> On 2006-11-16, at 8:10 , Lee Ward wrote: >>>>>>>>>> On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote:>>>>>>> On 2006-11-15, at 12:36 , Peter Bojanic wrote: >>>>>>> >>>>>>>> Shane, >>>>>>>> >>>>>>>> Regarding the read performance issues you mentioned today... >>>>>>> >>>>>>> Some further news... >>>>>>> >>>>>>> - Sandia (Redstorm) reports 100% CPU utilization during reads on >>>>>>> both Linux and Cray systems; they''re running Unicos 1.4, >>> based on >>>>>>> Lustre >>>>>>> 1.4.6 >>>>>>> >>>>>>> ===> Lee, is there a Bugzilla reported by Sandia that describes >>>>>>> this issue? >>>>>>>>>>> Not that I know of. I believe we are supposed to submit>>> one relative>>>>> to the block side of Rose.>>>>> >>>>> Would you please let Alex know what bug number this is? >>> I''d like to >>>>> make sure his eyeballs are on it. >>>>> >>>>> Thanks, >>>>> Bojanic >>>>> >>>>> >>>>> >>> >>> >>>
I''ve requested permission from the customer to make this ticket public. Bojanic On 2006-11-21, at 14:23 , Brian Behlendorf wrote:> > Please make the bug public if you looking for community advice, I > personally > can''t seem to access bug #10112. > > Thanks, > Brian > > >> Yes I assume since the bug was reopened yesterday that we are >> going to >> use 10112. >> >> Thanks, >> Steve >> >>> -----Original Message----- >>> From: Peter Bojanic [mailto:pbojanic@clusterfs.com] >>> Sent: Tuesday, November 21, 2006 11:04 AM >>> To: Monk, Stephen; Ward, Lee >>> Cc: Alex Thomas; Richard Shane Canon; Michael MacDonald; >>> lustre-discuss@clusterfs.com >>> Subject: Re: Read performance issues at ORNL >>> >>> Thanks for clarifying, Lee. >>> >>> Steve, is there already a Bugzilla for your IO performance >>> issue? Is it possibly >>> https://bugzilla.clusterfs.com/show_bug.cgi?id=10112? >>> >>> Please advise. >>> >>> Thanks, >>> Bojanic >>> >>> On 2006-11-21, at 13:43 , Lee Ward wrote: >>>> It''s not, yet. In theory Steve Monk will report it for the Rose >>>> machine. >>>> It seemed easier that way and they see the issue as well... >>>> >>>> --Lee >>>> >>>> On Tue, 2006-11-21 at 12:56 -0400, Peter Bojanic wrote: >>>>> Lee, >>>>> >>>>> On 2006-11-16, at 8:10 , Lee Ward wrote: >>>>>> On Wed, 2006-11-15 at 15:56 -0500, Peter Bojanic wrote: >>>>>>> On 2006-11-15, at 12:36 , Peter Bojanic wrote: >>>>>>>> Shane, >>>>>>>> >>>>>>>> Regarding the read performance issues you mentioned today... >>>>>>> >>>>>>> Some further news... >>>>>>> >>>>>>> - Sandia (Redstorm) reports 100% CPU utilization during reads on >>>>>>> both Linux and Cray systems; they''re running Unicos 1.4, >>> >>> based on >>> >>>>>>> Lustre >>>>>>> 1.4.6 >>>>>>> >>>>>>> ===> Lee, is there a Bugzilla reported by Sandia that describes >>>>>>> this issue? >>>>>> >>>>>> Not that I know of. I believe we are supposed to submit >>> >>> one relative >>> >>>>>> to the block side of Rose. >>>>> >>>>> Would you please let Alex know what bug number this is? >>> >>> I''d like to >>> >>>>> make sure his eyeballs are on it. >>>>> >>>>> Thanks, >>>>> Bojanic >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > -- > Thanks, > Brian > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss