Hi Everyone! I just setup a lustre system on centos 5.5 and lustre 1.8.5. there are three identical oss with four osts each. After having fantastic write rates but low read rates, I ran the obdfilter-survey script to get a hint of what may cause this. Unfortnately obdfilter-survey in case=disk mode freezes on two of my three oss at the write task of the 4 objs, 16 threads line and leaves the system in an unstable state requiring a reboot. The other oss runs through the script without problems. To exclude a problem in the system?s setup, I booted one of the bad oss with the working oss? disk - with the same faulty result. Creating a new filesystem on all osts of one of the problem oss neither did the trick. Any ideas what may cause this behavior? Thanks! Robert
Hi, This chapter of the may answer your question about slow read perfoamnce 23.3.2 Write Performance Better Than Read Performance http://wiki.lustre.org/manual/LustreManual20_HTML/LustreTroubleshooting.html#50548790_pgfId-1294879 Wojciech On 4 January 2011 20:14, robert <spam.robert at risefx.com> wrote:> Hi Everyone! > > I just setup a lustre system on centos 5.5 and lustre 1.8.5. there are > three identical oss with four osts each. > > After having fantastic write rates but low read rates, I ran the > obdfilter-survey script to get a hint of what may cause this. > > Unfortnately obdfilter-survey in case=disk mode freezes on two of my > three oss at the write task of the 4 objs, 16 threads line and leaves > the system in an unstable state requiring a reboot. The other oss runs > through the script without problems. To exclude a problem in the > system?s setup, I booted one of the bad oss with the working oss? disk - > with the same faulty result. Creating a new filesystem on all osts of > one of the problem oss neither did the trick. > > Any ideas what may cause this behavior? Thanks! > > Robert > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Senior System Architect High Performance Computing Service University of Cambridge
Hi Robert, On Jan 4, 2011, at 23:14, robert wrote:> Hi Everyone! > > I just setup a lustre system on centos 5.5 and lustre 1.8.5. there are > three identical oss with four osts each. > > After having fantastic write rates but low read rates, I ran the > obdfilter-survey script to get a hint of what may cause this. > > Unfortnately obdfilter-survey in case=disk mode freezes on two of my > three oss at the write task of the 4 objs, 16 threads line and leaves > the system in an unstable state requiring a reboot.Can you provide /var/log/messages contents in time of running obdfilter-survey ? if you able to replicate crashing - getting output sysrq+t (in crash time) be very useful to investigate that problem. -------------------------------------------- Alexey Lyashkov alexey_lyashkov at xyratex.com ______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
Have you checked bugzilla ? Does this sound similar to your scenario - https://bugzilla.lustre.org/show_bug.cgi?id=22749 ? regards p On 1/5/11 9:14 AM, Alexey Lyashkov wrote:> Hi Robert, > > On Jan 4, 2011, at 23:14, robert wrote: > > >> Hi Everyone! >> >> I just setup a lustre system on centos 5.5 and lustre 1.8.5. there are >> three identical oss with four osts each. >> >> After having fantastic write rates but low read rates, I ran the >> obdfilter-survey script to get a hint of what may cause this. >> >> Unfortnately obdfilter-survey in case=disk mode freezes on two of my >> three oss at the write task of the 4 objs, 16 threads line and leaves >> the system in an unstable state requiring a reboot. >> > Can you provide /var/log/messages contents in time of running obdfilter-survey ? > if you able to replicate crashing - getting output sysrq+t (in crash time) be very useful to investigate that problem. > > > > -------------------------------------------- > Alexey Lyashkov > alexey_lyashkov at xyratex.com > > > > > ______________________________________________________________________ > This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. > > Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. > > Xyratex Technology Limited (03134912), Registered in England& Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. > > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. > ______________________________________________________________________ > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On 01/04/2011 02:14 PM, robert wrote:> Hi Everyone! > > I just setup a lustre system on centos 5.5 and lustre 1.8.5. there are > three identical oss with four osts each. > > After having fantastic write rates but low read rates, I ran the > obdfilter-survey script to get a hint of what may cause this. > > Unfortnately obdfilter-survey in case=disk mode freezes on two of my > three oss at the write task of the 4 objs, 16 threads line and leaves > the system in an unstable state requiring a reboot. The other oss runs > through the script without problems. To exclude a problem in the > system?s setup, I booted one of the bad oss with the working oss? disk - > with the same faulty result. Creating a new filesystem on all osts of > one of the problem oss neither did the trick. > > Any ideas what may cause this behavior? Thanks!Do you have panic_on_lbug set? It''s easy to LBUG Lustre by interrupting (Ctrl-C/SIGINT/Arrivederci Roma) a running obdfilter-survey. Using 1.8.4 on RHEL 5.5: [root at oss21 obdfilter-survey]# nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey Wed Jan 5 10:51:05 CST 2011 Obdfilter-survey for case=disk from oss21.ranger.tacc.utexas.edu ost 6 sz 6291456K rsz 1024K obj 6 thr 6 write ^C [root at oss21 ~]# dmesg [87251.960393] Lustre: 11759:(echo_client.c:1409:echo_client_cleanup()) ASSERTION(eco->eco_refcount == 0) failed [87251.960451] Lustre: 11759:(echo_client.c:1409:echo_client_cleanup()) LBUG() [87251.960482] Pid: 11759, comm: lctl ... See https://bugzilla.lustre.org/show_bug.cgi?id=21745 -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
Thank you Wojciech, Alexey, Parinay and John, it looks like the controller (areca 1280) is the problem. In the logs I found an according error message (arcmsr6:...). Googling this shows that a lot of users having problems with areca controllers under heavy load. The OSS crash can be reproduced with tiobench running 32 threads each on two clients (!). So the problem might not be directly related to obdfilter-survey but just triggers a totally different problem. Wojciech, I read about this. The odd thing is that the speed per process is constant even with more processes on the same or another client. The system is obviously capable of much more but shows a limit per reader process. Alexey, I was not able to capture the sysrq+t into a file in my test installation and after discovering the arcmsr message went that way first. Prinay, I tried the behaviour with strace, but neither do I get any output apart from the "attached" message nor does obdfilter-survey continue afterwards. John, panic_on_lbug is not set and today i saw that the system freezes after 1-2h even without interrupting obdfilter-survey. I will do a test with a different controller in the next days and will post log info if the problem persists. Thanks again! Robert Am 05.01.2011 18:25, schrieb John Hammond:> On 01/04/2011 02:14 PM, robert wrote: >> Hi Everyone! >> >> I just setup a lustre system on centos 5.5 and lustre 1.8.5. there are >> three identical oss with four osts each. >> >> After having fantastic write rates but low read rates, I ran the >> obdfilter-survey script to get a hint of what may cause this. >> >> Unfortnately obdfilter-survey in case=disk mode freezes on two of my >> three oss at the write task of the 4 objs, 16 threads line and leaves >> the system in an unstable state requiring a reboot. The other oss runs >> through the script without problems. To exclude a problem in the >> system?s setup, I booted one of the bad oss with the working oss? disk - >> with the same faulty result. Creating a new filesystem on all osts of >> one of the problem oss neither did the trick. >> >> Any ideas what may cause this behavior? Thanks! > Do you have panic_on_lbug set? > > It''s easy to LBUG Lustre by interrupting (Ctrl-C/SIGINT/Arrivederci Roma) a > running obdfilter-survey. Using 1.8.4 on RHEL 5.5: > > [root at oss21 obdfilter-survey]# nobjhi=2 thrhi=2 size=1024 case=disk sh > obdfilter-survey > Wed Jan 5 10:51:05 CST 2011 Obdfilter-survey for case=disk from > oss21.ranger.tacc.utexas.edu > ost 6 sz 6291456K rsz 1024K obj 6 thr 6 write > ^C > > [root at oss21 ~]# dmesg > [87251.960393] Lustre: 11759:(echo_client.c:1409:echo_client_cleanup()) > ASSERTION(eco->eco_refcount == 0) failed > [87251.960451] Lustre: 11759:(echo_client.c:1409:echo_client_cleanup()) LBUG() > [87251.960482] Pid: 11759, comm: lctl > ... > > See https://bugzilla.lustre.org/show_bug.cgi?id=21745 >