Paisit Wongsongsarn
2009-Jun-17 03:26 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Hi Jose, Enable SSD (cache device usage) only for "meta data" would help?. Assuming that you have read optimized SSD in place. I never try it out but worth to try by just turn on. regards, Paisit W. Jose Martins wrote:> > Hello experts, > > IHAC that wants to put more than 250 Million files on a single > mountpoint (in a directory tree with no more than 100 files on each > directory). > > He wants to share such filesystem by NFS and mount it through > many Linux Debian clients > > We are proposing a 7410 Openstore appliance... > > He is claiming that certain operations like find, even if taken from > the Linux clients on such NFS mountpoint take significant more > time than if such NFS share was provided by other NAS providers > like NetApp... > > Can someone confirm if this is really a problem for ZFS filesystems?... > > Is there any way to tune it?... > > We thank any input > > Best regards > > Jose > > > >-- +---------*---------*---------*---------*---------*---------*---------+ Paisit Wongsongsarn Regional Technical Lead (DMA & PFT) Storage Practice, Sun Microsystems Asia South DID: +65 6-239-2626, Mobile: +65 9-154-1717, Email: paisit at sun.com Blogs: blogs.sun.com/paisit +---------*---------*---------*---------*---------*---------*---------+
Alan Hargreaves
2009-Jun-17 03:49 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Another question worth asking here is, is a find over the entire filesystem something that they would expect to be executed with sufficient regularity that it the execution time would have a business impact. Part of teh problem that I come across with people "benchmarking" is that they don''t benchmark the operations that are critical to business. Sure we can spend a lot of time examining the issue and then addressing it; but would it actually help address a real business concern, or just an "itch"? Regards, Alan Hargreaves Paisit Wongsongsarn wrote:> Hi Jose, > > Enable SSD (cache device usage) only for "meta data" would help?. > Assuming that you have read optimized SSD in place. > > I never try it out but worth to try by just turn on. > > regards, > Paisit W. > > Jose Martins wrote: >> >> Hello experts, >> >> IHAC that wants to put more than 250 Million files on a single >> mountpoint (in a directory tree with no more than 100 files on each >> directory). >> >> He wants to share such filesystem by NFS and mount it through >> many Linux Debian clients >> >> We are proposing a 7410 Openstore appliance... >> >> He is claiming that certain operations like find, even if taken from >> the Linux clients on such NFS mountpoint take significant more >> time than if such NFS share was provided by other NAS providers >> like NetApp... >> >> Can someone confirm if this is really a problem for ZFS filesystems?... >> >> Is there any way to tune it?... >> >> We thank any input >> >> Best regards >> >> Jose >> >> >> >> >-- Alan Hargreaves - http://blogs.sun.com/tpenta Staff Engineer (Kernel/VOSJEC/Performance) Asia Pacific/Emerging Markets Sun Microsystems
Roch Bourbonnais
2009-Jun-17 10:33 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Le 16 juin 09 ? 19:55, Jose Martins a ?crit :> > Hello experts, > > IHAC that wants to put more than 250 Million files on a single > mountpoint (in a directory tree with no more than 100 files on each > directory). > > He wants to share such filesystem by NFS and mount it through > many Linux Debian clients > > We are proposing a 7410 Openstore appliance... > > He is claiming that certain operations like find, even if taken from > the Linux clients on such NFS mountpoint take significant more > time than if such NFS share was provided by other NAS providers > like NetApp... >10%, 100%, 10000% or more ? Knowing magnitude helps diagnostics. What kind of pool is this ? This should be a read performance test : pool type and total disk rotation impacts the resulting performance.> Can someone confirm if this is really a problem for ZFS > filesystems?... >Nope> Is there any way to tune it?... > > We thank any input > > Best regards > > Jose > > >-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2431 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090617/e6b3e848/attachment.bin>
robert ungar
2009-Jun-17 11:21 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Jose, I hope our openstorage experts weigh in on ''is this a good idea'', it sounds scary to me but I''m overly cautious anyway. I did want to raise the question of other client expectations for this opportunity, what are the intended data protection requirements, how will they backup and recover the files, do they intend to apply replication in support of disaster recovery plan and are the intended data protection schemes practical. The other area that jumps out at me is concurrent access, in addition to the ''find'' by ''many'' clients, does the client have any performance requirements that must be met to insure the solution is successful. Does any of the above have to happen at the same time ? I''m not in a position to evaluate these considerations for this opportunity, simply sharing some areas that, often enough, are over looked as we address the chief complaint. Regards, Robert> > Jose Martins wrote: >> >> Hello experts, >> >> IHAC that wants to put more than 250 Million files on a single >> mountpoint (in a directory tree with no more than 100 files on each >> directory). >> >> He wants to share such filesystem by NFS and mount it through >> many Linux Debian clients >> >> We are proposing a 7410 Openstore appliance... >> >> He is claiming that certain operations like find, even if taken from >> the Linux clients on such NFS mountpoint take significant more >> time than if such NFS share was provided by other NAS providers >> like NetApp... >> >> Can someone confirm if this is really a problem for ZFS filesystems?... >> >> Is there any way to tune it?... >> >> We thank any input >> >> Best regards >> >> Jose >> >> >> >> >-- **************************** Robert C. Ungar ABCP Professional Services Delivery Storage Solutions Specialist Telephone 585-598-9020 Sun Microsystems 345 Woodcliff Drive Fairport, NY 14450 www.sun.com/storage
Eric D. Mudama
2009-Jun-17 16:03 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
On Wed, Jun 17 at 13:49, Alan Hargreaves wrote:> Another question worth asking here is, is a find over the entire > filesystem something that they would expect to be executed with > sufficient regularity that it the execution time would have a business > impact.Exactly. That''s such an odd business workload on 250,000,000 files that there isn''t likely to be much of a shortcut other than just throwing tons of spindles (or SSDs) at the problem, and/or having tons of memory. If the finds are just by name, thats easy for the system to cache, but if you''re expecting to run something against the output of find with -exec to parse/process 250M files on a regular basis, you''ll likely be severely IO bound. Almost to the point of arguing for something like Hadoop or another form of distributed map:reduce on your dataset with a lot of nodes, instead of a single storage server. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Louis Romero
2009-Jun-17 16:08 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Jose, I believe the problem is endemic to Solaris. I have run into similar problems doing a simple find(1) in /etc. On Linux, a find operation in /etc is almost instantaneous. On solaris, it has a tendency to spin for a long time. I don''t know what their use of find might be but, running updatedb on the linux clients (with the NFS file system mounted of course) and using locate(1) will give you a work-around on the linux clients. Caveat Empore: There is a staleness factor associated with this solution as any new files dropped in after updatedb runs will not show up until the next updatedb is run. HTH louis On 06/16/09 11:55, Jose Martins wrote:> > Hello experts, > > IHAC that wants to put more than 250 Million files on a single > mountpoint (in a directory tree with no more than 100 files on each > directory). > > He wants to share such filesystem by NFS and mount it through > many Linux Debian clients > > We are proposing a 7410 Openstore appliance... > > He is claiming that certain operations like find, even if taken from > the Linux clients on such NFS mountpoint take significant more > time than if such NFS share was provided by other NAS providers > like NetApp... > > Can someone confirm if this is really a problem for ZFS filesystems?... > > Is there any way to tune it?... > > We thank any input > > Best regards > > Jose > > >
Dirk Nitschke
2009-Jun-17 17:38 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Hi Louis! Solaris /usr/bin/find and Linux (GNU-) find work differently! I have experienced dramatic runtime differences some time ago. The reason is that Solaris find and GNU find use different algorithms. GNU find uses the st_nlink ("number of links") field of the stat structure to optimize it''s work. Solaris find does not use this kind of optimization because the meaning of "number of links" is not well defined and file system dependent. If you are interested, take a look at, say, CR 4907267 link count problem is hsfs CR 4462534 RFE: pcfs should emulate link counts for directories Dirk Am 17.06.2009 um 18:08 schrieb Louis Romero:> Jose, > > I believe the problem is endemic to Solaris. I have run into > similar problems doing a simple find(1) in /etc. On Linux, a find > operation in /etc is almost instantaneous. On solaris, it has a > tendency to spin for a long time. I don''t know what their use of > find might be but, running updatedb on the linux clients (with the > NFS file system mounted of course) and using locate(1) will give you > a work-around on the linux clients. > Caveat Empore: There is a staleness factor associated with this > solution as any new files dropped in after updatedb runs will not > show up until the next updatedb is run. > > HTH > > louis > > On 06/16/09 11:55, Jose Martins wrote: >> >> Hello experts, >> >> IHAC that wants to put more than 250 Million files on a single >> mountpoint (in a directory tree with no more than 100 files on each >> directory). >> >> He wants to share such filesystem by NFS and mount it through >> many Linux Debian clients >> >> We are proposing a 7410 Openstore appliance... >> >> He is claiming that certain operations like find, even if taken from >> the Linux clients on such NFS mountpoint take significant more >> time than if such NFS share was provided by other NAS providers >> like NetApp... >> >> Can someone confirm if this is really a problem for ZFS >> filesystems?... >> >> Is there any way to tune it?... >> >> We thank any input >> >> Best regards >> >> Jose >> >> >> >-- Sun Microsystems GmbH Dirk Nitschke Nagelsweg 55 Storage Architect 20097 Hamburg Phone: +49-40-251523-413 Germany Fax: +49-40-251523-425 http://www.sun.de/ Mobile: +49-172-847 62 66 Dirk.Nitschke at Sun.COM ----------------------------------------------------------------------- Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten - Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Haering
Casper.Dik at Sun.COM
2009-Jun-17 18:51 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
>Hi Louis! > >Solaris /usr/bin/find and Linux (GNU-) find work differently! I have >experienced dramatic runtime differences some time ago. The reason is >that Solaris find and GNU find use different algorithms. > >GNU find uses the st_nlink ("number of links") field of the stat >structure to optimize it''s work. Solaris find does not use this kind >of optimization because the meaning of "number of links" is not well >defined and file system dependent.But that''s not the under discussion: apparently the *same* clients get different performance from a OpenStorage system vs a Netapp system. I think we need to know much more and I think OpenStorage can giv the information you need. (Yes, I did have problems because of GNU finds shortcuts: they don''t work all the time) Casper
Louis Romero
2009-Jun-17 19:09 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
hi Dirk, How might we explain running find on a linux client to an NFS mounted file system under the 7000 taking significantly longer (i.e. performance behaving as though the command was run from Solaris?) Not sure if find would have the intelligence to differentiate between file system types and run different sections of code based upon what it finds? louis On 06/17/09 11:38, Dirk Nitschke wrote:> Hi Louis! > > Solaris /usr/bin/find and Linux (GNU-) find work differently! I have > experienced dramatic runtime differences some time ago. The reason is > that Solaris find and GNU find use different algorithms. > > GNU find uses the st_nlink ("number of links") field of the stat > structure to optimize it''s work. Solaris find does not use this kind > of optimization because the meaning of "number of links" is not well > defined and file system dependent. > > If you are interested, take a look at, say, > > CR 4907267 link count problem is hsfs > CR 4462534 RFE: pcfs should emulate link counts for directories > > Dirk > > Am 17.06.2009 um 18:08 schrieb Louis Romero: > >> Jose, >> >> I believe the problem is endemic to Solaris. I have run into similar >> problems doing a simple find(1) in /etc. On Linux, a find operation >> in /etc is almost instantaneous. On solaris, it has a tendency to >> spin for a long time. I don''t know what their use of find might be >> but, running updatedb on the linux clients (with the NFS file system >> mounted of course) and using locate(1) will give you a work-around on >> the linux clients. >> Caveat Empore: There is a staleness factor associated with this >> solution as any new files dropped in after updatedb runs will not >> show up until the next updatedb is run. >> >> HTH >> >> louis >> >> On 06/16/09 11:55, Jose Martins wrote: >>> >>> Hello experts, >>> >>> IHAC that wants to put more than 250 Million files on a single >>> mountpoint (in a directory tree with no more than 100 files on each >>> directory). >>> >>> He wants to share such filesystem by NFS and mount it through >>> many Linux Debian clients >>> >>> We are proposing a 7410 Openstore appliance... >>> >>> He is claiming that certain operations like find, even if taken from >>> the Linux clients on such NFS mountpoint take significant more >>> time than if such NFS share was provided by other NAS providers >>> like NetApp... >>> >>> Can someone confirm if this is really a problem for ZFS filesystems?... >>> >>> Is there any way to tune it?... >>> >>> We thank any input >>> >>> Best regards >>> >>> Jose >>> >>> >>> >> >
Joerg Schilling
2009-Jun-17 19:37 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Dirk Nitschke <Dirk.Nitschke at Sun.COM> wrote:> Solaris /usr/bin/find and Linux (GNU-) find work differently! I have > experienced dramatic runtime differences some time ago. The reason is > that Solaris find and GNU find use different algorithms.Correct: Solaris find honors the POSIX standard, GNU find does not :-(> GNU find uses the st_nlink ("number of links") field of the stat > structure to optimize it''s work. Solaris find does not use this kind > of optimization because the meaning of "number of links" is not well > defined and file system dependent.GNU find makes illegal assumptions on the value in st_nlink for diretctories. These assumptions are derived from implementation specifics found in historic UNIX filesystem implementations, but there is no grant for the asumed behavior.> If you are interested, take a look at, say, > > CR 4907267 link count problem is hsfsHsfs just shows you the numbers set up by the ISO-9660 filesystem creation utility. If you use a recent original mkisofs (like the one that come with Solaris since 1.5 years), you get the same behavior for hsfs and UFS. The related feature has been implemented in October 2006 in mkisofs. If you use "mkisofs" from one of the non-OSS-friendly Linux distributions like Debian, RedHat, Suse, Ubuntu or Mandriva, you use a 5 year old mkisofs version and thus the values in st_nlink for directories are random numbers. The problems caused by programs that ignore POSIX rules have been discussed several times in the POSIX mailing list. In order to solve the issue, I did propose several times to introduce a new pathconf() call that allows to ask whether a directory has historic UFS semantics for st_nlink. Hsfs knows whether the filesystem was created by a recent mkisofs and thus would be able to give the right return value. NFS clients need to implement a RPC that allows to retrieve the value from the expoirted filesystem at the server side.> Am 17.06.2009 um 18:08 schrieb Louis Romero: > > > Jose, > > > > I believe the problem is endemic to Solaris. I have run into > > similar problems doing a simple find(1) in /etc. On Linux, a find > > operation in /etc is almost instantaneous. On solaris, it has aIf you ignore standards you may get _apparent_ speed. On Linux this speed is usually bought by giving up correctness.> > tendency to spin for a long time. I don''t know what their use of > > find might be but, running updatedb on the linux clients (with the > > NFS file system mounted of course) and using locate(1) will give you > > a work-around on the linux clients.With NFS, things are even more complex and in principle similar to the hsfs case where the OS filesystem implementation just shows you the values set up by mkisofs. On a NFS client, you see the number that have been set up by the NFS server but you don''t know what filesystem type is under the NFS server. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Cor Beumer - Storage Solution Architect
2009-Jun-18 10:12 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Hi Jose, Well it depends on the total size of your Zpool and how often these files are changed. I was at a customer an huge internet provider, who had 40x an X4500 with Standard solaris and using ZFS. All the machines were equiped with 48x 1TB disks. The machines were used to provide the email platform, so all the user email accounts were on the system. This did mean also millions of files in one ZPOOL. What they noticed on the the X4500 systems, that when the zpool became filled up for about 50-60% the performance of the system did drop enormously. They do claim this has to do with the fragmentation of the ZFS filesystem. So we did try over there putting an S7410 system in with about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla (in a stripe) we were able to get much and much more i/o''s from the system the the comparable X4500, however they did put it in production for a couple of weeks, and as soon as the ZFS filesystem did come in the range of about 50-60% filling the did see the same problem. The performance did drop down enormously. Netapps has the same problem with there Waffle filesystem, (they also tested this) however they do provide an Defragmentation tool for this. This is also NOT a nice solution, because you have to run this, manually or scheduled and it is taking a lot of system resources but it helps. I did hear Sun is denying we do have this problem in ZFS, and therefore we don''t need a kind of defragmentation mechanism, however our customer experiences are different............ May be it is good for the ZFS group to look at this (potential) problem. The customer i am talking about is willing to share there experiences with Sun engineering. greetings, Cor Beumer Jose Martins wrote:> > Hello experts, > > IHAC that wants to put more than 250 Million files on a single > mountpoint (in a directory tree with no more than 100 files on each > directory). > > He wants to share such filesystem by NFS and mount it through > many Linux Debian clients > > We are proposing a 7410 Openstore appliance... > > He is claiming that certain operations like find, even if taken from > the Linux clients on such NFS mountpoint take significant more > time than if such NFS share was provided by other NAS providers > like NetApp... > > Can someone confirm if this is really a problem for ZFS filesystems?... > > Is there any way to tune it?... > > We thank any input > > Best regards > > Jose > > >-- <http://www.sun.com> *Cor Beumer * Data Management & Storage *Sun Microsystems Nederland BV* Saturnus 1 3824 ME Amersfoort The Netherlands Phone +31 33 451 5172 Mobile +31 6 51 603 142 Email Cor.Beumer at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090618/be6d1ff4/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: c:\SunSTK.JPG Type: image/jpeg Size: 4199 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090618/be6d1ff4/attachment.jpe>
Richard Elling
2009-Jun-18 18:23 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Cor Beumer - Storage Solution Architect wrote:> Hi Jose, > > Well it depends on the total size of your Zpool and how often these > files are changed....and the average size of the files. For small files, it is likely that the default recordsize will not be optimal, for several reasons. Are these small files? -- richard> > I was at a customer an huge internet provider, who had 40x an X4500 > with Standard solaris and using ZFS. > All the machines were equiped with 48x 1TB disks. The machines were > used to provide the email platform, so all > the user email accounts were on the system. This did mean also > millions of files in one ZPOOL. > > What they noticed on the the X4500 systems, that when the zpool became > filled up for about 50-60% the performance of the system > did drop enormously. > They do claim this has to do with the fragmentation of the ZFS > filesystem. So we did try over there putting an S7410 system in with > about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla > (in a stripe) we were able to get much and much more i/o''s from the > system the the comparable X4500, however they did put it in production > for a couple of weeks, and as soon as the ZFS filesystem did come in > the range of about 50-60% filling the did see the same problem. > The performance did drop down enormously. Netapps has the same problem > with there Waffle filesystem, (they also tested this) however they do > provide an Defragmentation tool for this. This is also NOT a nice > solution, because you have to run this, manually or scheduled and it > is taking a lot of system resources but it helps. > > I did hear Sun is denying we do have this problem in ZFS, and > therefore we don''t need a kind of defragmentation mechanism, > however our customer experiences are different............ > > May be it is good for the ZFS group to look at this (potential) problem. > > The customer i am talking about is willing to share there experiences > with Sun engineering. > > greetings, > > Cor Beumer > > > Jose Martins wrote: >> >> Hello experts, >> >> IHAC that wants to put more than 250 Million files on a single >> mountpoint (in a directory tree with no more than 100 files on each >> directory). >> >> He wants to share such filesystem by NFS and mount it through >> many Linux Debian clients >> >> We are proposing a 7410 Openstore appliance... >> >> He is claiming that certain operations like find, even if taken from >> the Linux clients on such NFS mountpoint take significant more >> time than if such NFS share was provided by other NAS providers >> like NetApp... >> >> Can someone confirm if this is really a problem for ZFS filesystems?... >> >> Is there any way to tune it?... >> >> We thank any input >> >> Best regards >> >> Jose >> >> >> > > -- > <http://www.sun.com> *Cor Beumer * > Data Management & Storage > > *Sun Microsystems Nederland BV* > Saturnus 1 > 3824 ME Amersfoort The Netherlands > Phone +31 33 451 5172 > Mobile +31 6 51 603 142 > Email Cor.Beumer at Sun.COM > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Gary Mills
2009-Jun-18 19:24 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
On Thu, Jun 18, 2009 at 12:12:16PM +0200, Cor Beumer - Storage Solution Architect wrote:> > What they noticed on the the X4500 systems, that when the zpool became > filled up for about 50-60% the performance of the system > did drop enormously. > They do claim this has to do with the fragmentation of the ZFS > filesystem. So we did try over there putting an S7410 system in with > about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla (in > a stripe) we were able to get much and much more i/o''s from the system > the the comparable X4500, however they did put it in production for a > couple of weeks, and as soon as the ZFS filesystem did come in the range > of about 50-60% filling the did see the same problem.We had a similar problem with a T2000 and 2 TB of ZFS storage. Once the usage reached 1 TB, the write performance dropped considerably and the CPU consumption increased. Our problem was indirectly a result of fragmentation, but it was solved by a ZFS patch. I understand that this patch, which fixes a whole bunch of ZFS bugs, should be released soon. I wonder if this was your problem. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Richard Elling
2009-Jun-18 21:47 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Gary Mills wrote:> On Thu, Jun 18, 2009 at 12:12:16PM +0200, Cor Beumer - Storage Solution Architect wrote: > >> What they noticed on the the X4500 systems, that when the zpool became >> filled up for about 50-60% the performance of the system >> did drop enormously. >> They do claim this has to do with the fragmentation of the ZFS >> filesystem. So we did try over there putting an S7410 system in with >> about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla (in >> a stripe) we were able to get much and much more i/o''s from the system >> the the comparable X4500, however they did put it in production for a >> couple of weeks, and as soon as the ZFS filesystem did come in the range >> of about 50-60% filling the did see the same problem. >> > > We had a similar problem with a T2000 and 2 TB of ZFS storage. Once > the usage reached 1 TB, the write performance dropped considerably and > the CPU consumption increased. Our problem was indirectly a result of > fragmentation, but it was solved by a ZFS patch. I understand that > this patch, which fixes a whole bunch of ZFS bugs, should be released > soon. I wonder if this was your problem. >George would probably have the latest info, but there were a number of things which circled around the notorious "Stop looking and start ganging" bug report, http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6596237 -- richard
Rainer Orth
2009-Jun-19 11:17 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Richard Elling <richard.elling at gmail.com> writes:> George would probably have the latest info, but there were a number of > things which circled around the notorious "Stop looking and start ganging" > bug report, > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6596237Indeed: we were seriously bitten by this one, taking three Solaris 10 fileservers down for about a week until the problem was diagnosed by Sun Service and an IDR provided. Unfortunately, this issue (seriously fragmented pools or pools beyond ca. 90% full cause file servers to grind to a halt) were only announced/acknowledged publicly after our incident, although the problem seems to have been reported almost two years ago. While a fix has been integrated into snv_114, there''s still no patch for S10, only various IDRs. It''s unclear what the state of the related CR 4854312 (need to defragment storage pool, submitted in 2003!) is. I suppose this might be dealt with by the vdev removal code, but overall it''s scary that dealing with such fundamental issues takes so long. Rainer -- ----------------------------------------------------------------------------- Rainer Orth, Faculty of Technology, Bielefeld University
Roch Bourbonnais
2009-Jun-19 11:47 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Le 18 juin 09 ? 20:23, Richard Elling a ?crit :> Cor Beumer - Storage Solution Architect wrote: >> Hi Jose, >> >> Well it depends on the total size of your Zpool and how often these >> files are changed. > > ...and the average size of the files. For small files, it is likely > that the default > recordsize will not be optimal, for several reasons. Are these > small files? > -- richardHey Richard, I have to correct that. For small files & big files no need to tune the recordsize. (files are stored as single perfectly adjusted records up to the dataset recordsize property). Only for big files access and updated in aligned application record (RDBMS) does it help to tune the ZFS recordsize. -r> >> >> I was at a customer an huge internet provider, who had 40x an X4500 >> with Standard solaris and using ZFS. >> All the machines were equiped with 48x 1TB disks. The machines were >> used to provide the email platform, so all >> the user email accounts were on the system. This did mean also >> millions of files in one ZPOOL. >> >> What they noticed on the the X4500 systems, that when the zpool >> became filled up for about 50-60% the performance of the system >> did drop enormously. >> They do claim this has to do with the fragmentation of the ZFS >> filesystem. So we did try over there putting an S7410 system in >> with about the same config on disks, 44x 1TB SATA BUT 4x 18GB >> WriteZilla (in a stripe) we were able to get much and much more i/ >> o''s from the system the the comparable X4500, however they did put >> it in production for a couple of weeks, and as soon as the ZFS >> filesystem did come in the range of about 50-60% filling the did >> see the same problem. >> The performance did drop down enormously. Netapps has the same >> problem with there Waffle filesystem, (they also tested this) >> however they do provide an Defragmentation tool for this. This is >> also NOT a nice solution, because you have to run this, manually or >> scheduled and it is taking a lot of system resources but it helps. >> >> I did hear Sun is denying we do have this problem in ZFS, and >> therefore we don''t need a kind of defragmentation mechanism, >> however our customer experiences are different............ >> >> May be it is good for the ZFS group to look at this (potential) >> problem. >> >> The customer i am talking about is willing to share there >> experiences with Sun engineering. >> >> greetings, >> >> Cor Beumer >> >> >> Jose Martins wrote: >>> >>> Hello experts, >>> >>> IHAC that wants to put more than 250 Million files on a single >>> mountpoint (in a directory tree with no more than 100 files on each >>> directory). >>> >>> He wants to share such filesystem by NFS and mount it through >>> many Linux Debian clients >>> >>> We are proposing a 7410 Openstore appliance... >>> >>> He is claiming that certain operations like find, even if taken from >>> the Linux clients on such NFS mountpoint take significant more >>> time than if such NFS share was provided by other NAS providers >>> like NetApp... >>> >>> Can someone confirm if this is really a problem for ZFS >>> filesystems?... >>> >>> Is there any way to tune it?... >>> >>> We thank any input >>> >>> Best regards >>> >>> Jose >>> >>> >>> >> >> -- >> <http://www.sun.com> *Cor Beumer * >> Data Management & Storage >> >> *Sun Microsystems Nederland BV* >> Saturnus 1 >> 3824 ME Amersfoort The Netherlands >> Phone +31 33 451 5172 >> Mobile +31 6 51 603 142 >> Email Cor.Beumer at Sun.COM >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2431 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090619/6c2eb8e0/attachment.bin>
Thomas
2009-Jun-22 20:25 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
Hi, I have and raidz1 conisting 6 5400rpm drives on this zpool. I have stored some Media in a FS and in an other 200k files. Both FS are written not much. The Pool is 85% Full. Could this issue also the reason that if Iam playing(reading) some Media that the playback is lagging? OSOL ips_111b E5200, 8gb RAM Thank you -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Jun-22 22:43 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
On Mon, 22 Jun 2009, Thomas wrote:> I have and raidz1 conisting 6 5400rpm drives on this zpool. I have > stored some Media in a FS and in an other 200k files. Both FS are > written not much. The Pool is 85% Full. > > Could this issue also the reason that if Iam playing(reading) some > Media that the playback is lagging?Check to see if you have automated snapshots running. If snapshots make your pool full, then your pool will also be more likely to fragment new/modified files. Make sure that you are using the default zfs blocksize of 128K since smaller block sizes may increase fragmentation. You may have a slow disk which is causing the whole pool to run slow. All it takes is one. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Thomas
2009-Jun-23 09:41 UTC
[zfs-discuss] Lots of metadata overhead on filesystems with 100M files
No snapshots running. I have only 21 filesystems mounted. Blocksize is the default one. Slow disk I dont think so because I get read and write rates about 350MB/s. Bios is the last also I tried to splitt the pool to two controllers all this dont help. -- This message posted from opensolaris.org