Hi all, in our 196 OST - Cluster, the previously perfect distribution of files among the OSTs is not working anymore, since ~ 2 weeks. The filling for most OSTs is between 57% and 62%, but some (~10) have risen up to 94%. I''m trying to fix that by having these OSTs deactivated on the MDT and finding and migrating away data from them, but it seems I''m not fast enough and it''s a ongoing problem - I''ve just deactivated another OST with threatening 67%. Our qos_prio_free is at the default 90%. Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level 1, so it would be possible to fill up an OST by just creating a 2TB file. However, I''m not aware of any such gigafiles (using robinhood to get a picture of our file system). In addition, our user''s behavior should not have changed recently. In August, the entire cluster had filled up to almost 80% in a neatly even distribution among the OSTs, so we extended the cluster by more OSTs, migrating data to even the filling between old and new ones. This also succeeded, and up to October there was no indication of something not working. There are no error message in the logs that would point to some OSTs being favored ;-) So, what could be the cause of this misdistribution? Regards, Thomas
Strangely (although I''m sure it''s not related) I have seen the exact same behavior on my Lustre cluster in the last month or so. I have also never seen this before, and to the best of my knowledge there is no change in usage patterns. I''m running 1.6.7.2 on the servers. Ron Jerome National Research Council Canada. -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of Thomas Roth Sent: Fri 10/30/2009 2:07 PM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] Bad distribution of files among OSTs Hi all, in our 196 OST - Cluster, the previously perfect distribution of files among the OSTs is not working anymore, since ~ 2 weeks. The filling for most OSTs is between 57% and 62%, but some (~10) have risen up to 94%. I''m trying to fix that by having these OSTs deactivated on the MDT and finding and migrating away data from them, but it seems I''m not fast enough and it''s a ongoing problem - I''ve just deactivated another OST with threatening 67%. Our qos_prio_free is at the default 90%. Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level 1, so it would be possible to fill up an OST by just creating a 2TB file. However, I''m not aware of any such gigafiles (using robinhood to get a picture of our file system). In addition, our user''s behavior should not have changed recently. In August, the entire cluster had filled up to almost 80% in a neatly even distribution among the OSTs, so we extended the cluster by more OSTs, migrating data to even the filling between old and new ones. This also succeeded, and up to October there was no indication of something not working. There are no error message in the logs that would point to some OSTs being favored ;-) So, what could be the cause of this misdistribution? Regards, Thomas _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091030/43c5f8cb/attachment.html
Andreas Dilger
2009-Oct-31 00:12 UTC
[Lustre-discuss] Bad distribution of files among OSTs
On 2009-10-30, at 12:07, Thomas Roth wrote:> in our 196 OST - Cluster, the previously perfect distribution of files > among the OSTs is not working anymore, since ~ 2 weeks. > The filling for most OSTs is between 57% and 62%, but some (~10) have > risen up to 94%. I''m trying to fix that by having these OSTs > deactivated > on the MDT and finding and migrating away data from them, but it seems > I''m not fast enough and it''s a ongoing problem - I''ve just deactivated > another OST with threatening 67%.Is this correlated to some upgrade of Lustre? What version are you using?> Our qos_prio_free is at the default 90%. > > Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level > 1, so > it would be possible to fill up an OST by just creating a 2TB file. > However, I''m not aware of any such gigafiles (using robinhood to get a > picture of our file system).To fill the smallest OST from 60% to 90% would only need a few file that total 0.3 * 2.3TB, or 690GB. One way to find such files is to mount the full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to list the object IDs that are very large, and then in bug 21244 I''ve written a small program that dumps the MDS inode number from the specified objects. You can then use "debugfs -c -R "ncheck {list of inode numbers} /dev/$ {mdsdev}" on the MDS to find the pathnames of those files. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Thanks, Andreas. Indeed we are running Lustre 1.6.7.2, on kernel 2.6.22, Debian Etch. But there was no upgrade involved, we moved from 1.6.7.1 to .2 in July. The procedure you described has the slight disadvantage of having to take the OSTs in question offline. It would be nice if Robinhood did the same job on a live system - according to its manual, it can purge data on a per-OST basis if they become to full. However, I haven''t yet found a way to extract just the info about these OSTs without deleting files. In fact, I am in the process of collecting this info "manually": I have now quite a number of lists of user''s data from running "lfs find --obd OST... /lustre/...", I just haven''t run these lists through a "ls -lh" yet. To busy moving the files instead of measuring them ;-) Regards, Thomas Andreas Dilger wrote:> On 2009-10-30, at 12:07, Thomas Roth wrote: >> in our 196 OST - Cluster, the previously perfect distribution of files >> among the OSTs is not working anymore, since ~ 2 weeks. >> The filling for most OSTs is between 57% and 62%, but some (~10) have >> risen up to 94%. I''m trying to fix that by having these OSTs deactivated >> on the MDT and finding and migrating away data from them, but it seems >> I''m not fast enough and it''s a ongoing problem - I''ve just deactivated >> another OST with threatening 67%. > > Is this correlated to some upgrade of Lustre? What version are you using? > > >> Our qos_prio_free is at the default 90%. >> >> Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level 1, so >> it would be possible to fill up an OST by just creating a 2TB file. >> However, I''m not aware of any such gigafiles (using robinhood to get a >> picture of our file system). > > To fill the smallest OST from 60% to 90% would only need a few file that > total 0.3 * 2.3TB, or 690GB. One way to find such files is to mount the > full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to list the > object IDs that are very large, and then in bug 21244 I''ve written a small > program that dumps the MDS inode number from the specified objects. You > can then use "debugfs -c -R "ncheck {list of inode numbers} /dev/${mdsdev}" > on the MDS to find the pathnames of those files. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >-- -------------------------------------------------------------------- Thomas Roth Gesellschaft f?r Schwerionenforschung Planckstr. 1 - 64291 Darmstadt, Germany Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 -------------- next part -------------- A non-text attachment was scrubbed... Name: t_roth.vcf Type: text/x-vcard Size: 298 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091031/3a968e28/attachment.vcf
Another question: Could this situation, 10 full OSTs out of 200, lead to a significant drop in performance? Before, we could usually get the full 110MB/s or so over the 1Gbit/s ethernet lines of the clients. That had dropped to about 50%, but we did not find any other odd thing than the filling levels of the OSTs. Regards, Thomas Andreas Dilger wrote:> On 2009-10-30, at 12:07, Thomas Roth wrote: >> in our 196 OST - Cluster, the previously perfect distribution of files >> among the OSTs is not working anymore, since ~ 2 weeks. >> The filling for most OSTs is between 57% and 62%, but some (~10) have >> risen up to 94%. I''m trying to fix that by having these OSTs deactivated >> on the MDT and finding and migrating away data from them, but it seems >> I''m not fast enough and it''s a ongoing problem - I''ve just deactivated >> another OST with threatening 67%. > > Is this correlated to some upgrade of Lustre? What version are you using? > > >> Our qos_prio_free is at the default 90%. >> >> Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level 1, so >> it would be possible to fill up an OST by just creating a 2TB file. >> However, I''m not aware of any such gigafiles (using robinhood to get a >> picture of our file system). > > To fill the smallest OST from 60% to 90% would only need a few file that > total 0.3 * 2.3TB, or 690GB. One way to find such files is to mount the > full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to list the > object IDs that are very large, and then in bug 21244 I''ve written a small > program that dumps the MDS inode number from the specified objects. You > can then use "debugfs -c -R "ncheck {list of inode numbers} /dev/${mdsdev}" > on the MDS to find the pathnames of those files. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >-- -------------------------------------------------------------------- Thomas Roth Gesellschaft f?r Schwerionenforschung Planckstr. 1 - 64291 Darmstadt, Germany Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 -------------- next part -------------- A non-text attachment was scrubbed... Name: t_roth.vcf Type: text/x-vcard Size: 298 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091101/6059fd0f/attachment.vcf
Another question I had with regards to this is how long have your OSS''s been running without a reboot? Mine have been up for 148 days which is probably longer than ever before. And now that I''ve said this, it just occurred to me that one of them was rebooted about a three weeks ago and all the others have been up for almost 6 months. I don''t know if this has any relevance, but it''s the only thing I can think of that''s different. Ron. -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of Thomas Roth Sent: Sun 11/1/2009 4:03 AM To: Andreas Dilger Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Bad distribution of files among OSTs Another question: Could this situation, 10 full OSTs out of 200, lead to a significant drop in performance? Before, we could usually get the full 110MB/s or so over the 1Gbit/s ethernet lines of the clients. That had dropped to about 50%, but we did not find any other odd thing than the filling levels of the OSTs. Regards, Thomas Andreas Dilger wrote:> On 2009-10-30, at 12:07, Thomas Roth wrote: >> in our 196 OST - Cluster, the previously perfect distribution of files >> among the OSTs is not working anymore, since ~ 2 weeks. >> The filling for most OSTs is between 57% and 62%, but some (~10) have >> risen up to 94%. I''m trying to fix that by having these OSTs deactivated >> on the MDT and finding and migrating away data from them, but it seems >> I''m not fast enough and it''s a ongoing problem - I''ve just deactivated >> another OST with threatening 67%. > > Is this correlated to some upgrade of Lustre? What version are you using? > > >> Our qos_prio_free is at the default 90%. >> >> Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level 1, so >> it would be possible to fill up an OST by just creating a 2TB file. >> However, I''m not aware of any such gigafiles (using robinhood to get a >> picture of our file system). > > To fill the smallest OST from 60% to 90% would only need a few file that > total 0.3 * 2.3TB, or 690GB. One way to find such files is to mount the > full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to list the > object IDs that are very large, and then in bug 21244 I''ve written a small > program that dumps the MDS inode number from the specified objects. You > can then use "debugfs -c -R "ncheck {list of inode numbers} /dev/${mdsdev}" > on the MDS to find the pathnames of those files. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >-- -------------------------------------------------------------------- Thomas Roth Gesellschaft f?r Schwerionenforschung Planckstr. 1 - 64291 Darmstadt, Germany Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091101/92915c56/attachment.html
Andreas Dilger
2009-Nov-02 06:33 UTC
[Lustre-discuss] Bad distribution of files among OSTs
On 2009-11-01, at 02:03, Thomas Roth wrote:> Could this situation, 10 full OSTs out of 200, > lead to a significant drop in performance? > Before, we could usually get the full 110MB/s > or so over the 1Gbit/s ethernet lines of the clients. > That had dropped to about 50%, but we did not > find any other odd thing than the filling levels of > the OSTs.Yes, this is entirely possible. If the OST is very full, then it takes longer to find free blocks.> Andreas Dilger wrote: >> On 2009-10-30, at 12:07, Thomas Roth wrote: >>> in our 196 OST - Cluster, the previously perfect distribution of >>> files >>> among the OSTs is not working anymore, since ~ 2 weeks. >>> The filling for most OSTs is between 57% and 62%, but some (~10) >>> have >>> risen up to 94%. I''m trying to fix that by having these OSTs >>> deactivated >>> on the MDT and finding and migrating away data from them, but it >>> seems >>> I''m not fast enough and it''s a ongoing problem - I''ve just >>> deactivated >>> another OST with threatening 67%. >> >> Is this correlated to some upgrade of Lustre? What version are you >> using? >> >> >>> Our qos_prio_free is at the default 90%. >>> >>> Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level >>> 1, so >>> it would be possible to fill up an OST by just creating a 2TB file. >>> However, I''m not aware of any such gigafiles (using robinhood to >>> get a >>> picture of our file system). >> >> To fill the smallest OST from 60% to 90% would only need a few file >> that >> total 0.3 * 2.3TB, or 690GB. One way to find such files is to >> mount the >> full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to >> list the >> object IDs that are very large, and then in bug 21244 I''ve written >> a small >> program that dumps the MDS inode number from the specified >> objects. You >> can then use "debugfs -c -R "ncheck {list of inode numbers} /dev/$ >> {mdsdev}" >> on the MDS to find the pathnames of those files. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> > > -- > -------------------------------------------------------------------- > Thomas Roth > Gesellschaft f?r Schwerionenforschung > Planckstr. 1 - 64291 Darmstadt, Germany > Department: Informationstechnologie > Location: SB3 1.262 > Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 > > <t_roth.vcf>Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Kevin Van Maren
2009-Nov-03 09:05 UTC
[Lustre-discuss] Bad distribution of files among OSTs
Andreas Dilger wrote:> On 2009-11-01, at 02:03, Thomas Roth wrote: > >> Could this situation, 10 full OSTs out of 200, >> lead to a significant drop in performance? >> Before, we could usually get the full 110MB/s >> or so over the 1Gbit/s ethernet lines of the clients. >> That had dropped to about 50%, but we did not >> find any other odd thing than the filling levels of >> the OSTs. >> > > Yes, this is entirely possible. If the OST is very > full, then it takes longer to find free blocks. >Longer to find them, and less likely to allocate large contiguous free blocks, so the disk IOs are smaller with more seeks. Kevin
[ ... ]> Could this situation, 10 full OSTs out of 200, lead to a > significant drop in performance?Likely so, the major reasons being: * If the OST spans a significant percentage of some disks, the inner tracks of disks are significantly slower than the outer tracks. This applies to any filesystem that fills up a disk. My home PC 1TB disk can do about 100-110MB/s throught the (JFS) filesystem in the outer tracks and aroun 50-55MB/s on the inner ones. * The "free list" can become significantly scattered, depending on the precise allocation patterns of disks. If there are many rewrites of small files that can be particularly bad. Even extent base filesystems, which suffer particularly badly from that as the same file size has to be split into many more extents, increasing metadata overhead. The two above are likely the reason why there have been other reports that speed goes down as filesystems fill up: https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007 ?Performance degradation on xc2 After 6 months of production we lost half of the file system performance Problem is under investigation by HP We had a similar problem on xc1 which was due to fragmentation Current solution for defragmentation is to recreate file systems?> Before, we could usually get the full 110MB/s or so over the > 1Gbit/s ethernet lines of the clients. That had dropped to > about 50%, but we did not find any other odd thing than the > filling levels of the OSTs.It could just be that *all* the OSTs are filling up; it is impossible to avoid the inner track issue on hard disks (except by limiting the top performance), and very difficult to avoid the scattering of the "free list". If you really care some solutions are: * Keep filesystem not more than 60-70% full. * Periodically reload filesystems from backup after reformatting. * Use just the outer 1/3 to 1/2 of the disks (which in recent years been called "short stroking"). But looking at the absolute numbers there is something really wrong: 50MB/s out of 200 OSTs is ridiculously low. The problem is not that it is half of 110MB/s, and lower than it was then, but that it is very low. Each OST should be delivering at least 50MB/s if with recent drives, and even with mild issues of inner track/fragmentation of the "free list". That you are getting 50MB/s may indicate that somehow your files are not being sliced across multiple OSTs. This can have several different reasons; IIRC there are a few discussions in the list archive on this.