Hi all, we are suffering from a sever metadata performance degradation on our 1.8.4 cluster and are pretty clueless. - We moved the MDT to a new hardware, since the old one was failing - We increased the size of the MDT with ''resize2fs'' (+ mounted it and saw all the files) - We found the performance of the new mds dreadful - We restarted the MDT on the old hardware with the failed RAID controller replaced, but without doing anything with OSS or clients The machine crashed three minutes after recovery was over - Moved back to the new hardware, but the system was now pretty messed up: persistent "still busy with N RPCs" and some "going back to sleep" messages (by the way, there is no way to find out what these RPCs are, and how to kill them? Of course I wouldn''t mind switching off some clients or even rebooting some OSS if I only new which ones...) - Shut down the entire cluster, writeconf, restart without any client mounts - worked fine - Mounted Lustre and tried to "ls" a directory with 100 files: takes several minutes(!) - Being patient and then trying the same on a second client: takes msecs. I have done complete shutdowns before, lastly to upgrade from 1.6 to 1.8, then without writeconf and without performance loss. Before to change the IPs of all servers (moving into a subnet), with writeconf, but without recollection of the metadata behavior afterwards. It is clear that after writeconf some information has to be regenerated, but this is really extreme - also normal? The MDT now behaves more like an xrootd master which makes first contact to its file servers and has to read in the entire database (would be nice to have in Lustre to regenerate the MDT in case of desaster ;-) ). Which caches are being filled now when I ls through the cluster? May I expect the MDT to explode once it has learned about a certain percentage of the system? ;-) I mean, we have 100 mio files now and the current MDT hardware has just 32GB memory... In any case this is not the Lustre behavior we are used to. Thanks for any hints, Thomas
What is the underlying disk, did that hardware/RAID config change when you switched hardware? The ''still busy'' message is a bug, may be fixed in 1.8.5 cliffw On Sat, Apr 2, 2011 at 1:01 AM, Thomas Roth <t.roth at gsi.de> wrote:> Hi all, > > we are suffering from a sever metadata performance degradation on our 1.8.4 > cluster and are pretty clueless. > - We moved the MDT to a new hardware, since the old one was failing > - We increased the size of the MDT with ''resize2fs'' (+ mounted it and saw > all the files) > - We found the performance of the new mds dreadful > - We restarted the MDT on the old hardware with the failed RAID controller > replaced, but without doing anything with OSS or clients > The machine crashed three minutes after recovery was over > - Moved back to the new hardware, but the system was now pretty messed up: > persistent "still busy with N RPCs" and some "going back to sleep" > messages (by the way, there is no way to find out what these RPCs are, and > how to kill them? Of course I wouldn''t mind switching off some clients or > even rebooting some OSS if I only new which ones...) > - Shut down the entire cluster, writeconf, restart without any client > mounts - worked fine > - Mounted Lustre and tried to "ls" a directory with 100 files: takes > several minutes(!) > - Being patient and then trying the same on a second client: takes > msecs. > > I have done complete shutdowns before, lastly to upgrade from 1.6 to 1.8, > then without writeconf and without performance loss. Before to change the > IPs of all servers (moving into a subnet), with writeconf, but without > recollection of the metadata behavior afterwards. > It is clear that after writeconf some information has to be regenerated, > but this is really extreme - also normal? > > The MDT now behaves more like an xrootd master which makes first contact to > its file servers and has to read in the entire database (would be nice to > have in Lustre to regenerate the MDT in case of desaster ;-) ). > Which caches are being filled now when I ls through the cluster? May I > expect the MDT to explode once it has learned about a certain percentage of > the > system? ;-) I mean, we have 100 mio files now and the current MDT hardware > has just 32GB memory... > In any case this is not the Lustre behavior we are used to. > > Thanks for any hints, > Thomas > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- cliffw Support Guy WhamCloud, Inc. www.whamcloud.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110403/5ce14007/attachment.html
Hi Cliff, no, the configuration as such did not change. The hardware is quite different, though. The old box had a Raid10 on 16 built-in Raptor 150GB SATA disks, the new one a Raid10 on 24 Cheetah 300GB SAS disks in an fibre-channel attached external enclosure. Actually we are more concerned that the 48 AMD cores of the new box might not have been the best idea. But atm, the system is running fine and fast again! After the last MDT restart, I started several ls-jobs crawling through the entire clustre. Obviously, after writeconfing all servers, Lustre really has to "learn" again about the whereabouts of its files. And I found experimentally that it is about the knownlegde of the OSTs: in fact we have tried our very old, repaired hardware as MDT while copying the MDT to yet another, third type of machine. The effect of "first very slow, then very fast ''ls''" was there. Then we shut down and started the third hardware, tried on new directories - same effect, tried on some already checked directories - very fast. So using the old hardware had refreshed the memory of the OSTs about these directories. All of this is to be expected to some degree, but the difference of minutes vs. mseconds is quite astonishing. Ah well, this cluster is also full to the brim, and the last time we had to writeconf the servers, there were certainly 20%-30% less files. Cheers, Thomas On 04/04/2011 06:11 AM, Cliff White wrote:> What is the underlying disk, did that hardware/RAID config change > when you switched hardware? > The ''still busy'' message is a bug, may be fixed in 1.8.5 > cliffw > > > On Sat, Apr 2, 2011 at 1:01 AM, Thomas Roth <t.roth at gsi.de <mailto:t.roth at gsi.de>> wrote: > > Hi all, > > we are suffering from a sever metadata performance degradation on our 1.8.4 cluster and are pretty clueless. > - We moved the MDT to a new hardware, since the old one was failing > - We increased the size of the MDT with ''resize2fs'' (+ mounted it and saw all the files) > - We found the performance of the new mds dreadful > - We restarted the MDT on the old hardware with the failed RAID controller replaced, but without doing anything with OSS or clients > The machine crashed three minutes after recovery was over > - Moved back to the new hardware, but the system was now pretty messed up: persistent "still busy with N RPCs" and some "going back to sleep" > messages (by the way, there is no way to find out what these RPCs are, and how to kill them? Of course I wouldn''t mind switching off some clients or > even rebooting some OSS if I only new which ones...) > - Shut down the entire cluster, writeconf, restart without any client mounts - worked fine > - Mounted Lustre and tried to "ls" a directory with 100 files: takes several minutes(!) > - Being patient and then trying the same on a second client: takes msecs. > > I have done complete shutdowns before, lastly to upgrade from 1.6 to 1.8, then without writeconf and without performance loss. Before to change the > IPs of all servers (moving into a subnet), with writeconf, but without recollection of the metadata behavior afterwards. > It is clear that after writeconf some information has to be regenerated, but this is really extreme - also normal? > > The MDT now behaves more like an xrootd master which makes first contact to its file servers and has to read in the entire database (would be nice to > have in Lustre to regenerate the MDT in case of desaster ;-) ). > Which caches are being filled now when I ls through the cluster? May I expect the MDT to explode once it has learned about a certain percentage of the > system? ;-) I mean, we have 100 mio files now and the current MDT hardware has just 32GB memory... > In any case this is not the Lustre behavior we are used to. > > Thanks for any hints, > Thomas > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > -- > cliffw > Support Guy > WhamCloud, Inc. > www.whamcloud.com <http://www.whamcloud.com> > >-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker, Dr. Hartmut Eickhoff Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt