thr3ads.net - zfs discuss - [zfs-discuss] Lots of metadata overhead on filesystems with 100M files [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Paisit Wongsongsarn

2009-Jun-17 03:26 UTC

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Hi Jose,

Enable SSD (cache device usage) only for "meta data" would help?. 
Assuming that you have read optimized SSD in place.

I never try it out but worth to try by just turn on.

regards,
Paisit W.

Jose Martins wrote:>
> Hello experts,
>
> IHAC that wants to put more than 250 Million files on a single
> mountpoint (in a directory tree with no more than 100 files on each
> directory).
>
> He wants to share such filesystem by NFS and mount it through
> many Linux Debian clients
>
> We are proposing a 7410 Openstore appliance...
>
> He is claiming that certain operations like find, even if taken from
> the Linux clients on such NFS mountpoint take significant more
> time than if such NFS share was provided by other NAS providers
> like NetApp...
>
> Can someone confirm if this is really a problem for ZFS filesystems?...
>
> Is there any way to tune it?...
>
> We thank any input
>
> Best regards
>
> Jose
>
>
>
>
-- 
+---------*---------*---------*---------*---------*---------*---------+
Paisit Wongsongsarn
Regional Technical Lead (DMA & PFT)
Storage Practice, Sun Microsystems Asia South
DID: +65 6-239-2626, Mobile: +65 9-154-1717, Email: paisit at sun.com
Blogs: blogs.sun.com/paisit
+---------*---------*---------*---------*---------*---------*---------+

Alan Hargreaves

2009-Jun-17 03:49 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Another question worth asking here is, is a find over the entire 
filesystem something that they would expect to be executed with 
sufficient regularity that it the execution time would have a business 
impact. Part of teh problem that I come across with people 
"benchmarking" is that they don''t benchmark the operations
that are
critical to business. Sure we can spend a lot of time examining the 
issue and then addressing it; but would it actually help address a real 
business concern, or just an "itch"?

Regards,
Alan Hargreaves

Paisit Wongsongsarn wrote:> Hi Jose,
>
> Enable SSD (cache device usage) only for "meta data" would help?.
> Assuming that you have read optimized SSD in place.
>
> I never try it out but worth to try by just turn on.
>
> regards,
> Paisit W.
>
> Jose Martins wrote:
>>
>> Hello experts,
>>
>> IHAC that wants to put more than 250 Million files on a single
>> mountpoint (in a directory tree with no more than 100 files on each
>> directory).
>>
>> He wants to share such filesystem by NFS and mount it through
>> many Linux Debian clients
>>
>> We are proposing a 7410 Openstore appliance...
>>
>> He is claiming that certain operations like find, even if taken from
>> the Linux clients on such NFS mountpoint take significant more
>> time than if such NFS share was provided by other NAS providers
>> like NetApp...
>>
>> Can someone confirm if this is really a problem for ZFS filesystems?...
>>
>> Is there any way to tune it?...
>>
>> We thank any input
>>
>> Best regards
>>
>> Jose
>>
>>
>>
>>
>
-- 
Alan Hargreaves - http://blogs.sun.com/tpenta
Staff Engineer (Kernel/VOSJEC/Performance)
Asia Pacific/Emerging Markets
Sun Microsystems

Roch Bourbonnais

2009-Jun-17 10:33 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Le 16 juin 09 ? 19:55, Jose Martins a ?crit :
>
> Hello experts,
>
> IHAC that wants to put more than 250 Million files on a single
> mountpoint (in a directory tree with no more than 100 files on each
> directory).
>
> He wants to share such filesystem by NFS and mount it through
> many Linux Debian clients
>
> We are proposing a 7410 Openstore appliance...
>
> He is claiming that certain operations like find, even if taken from
> the Linux clients on such NFS mountpoint take significant more
> time than if such NFS share was provided by other NAS providers
> like NetApp...
>
10%, 100%, 10000% or more ? Knowing magnitude helps diagnostics.
What kind of pool is this ?

This should be a read performance test : pool type and total
disk rotation impacts the resulting performance.
> Can someone confirm if this is really a problem for ZFS  
> filesystems?...
>Nope
> Is there any way to tune it?...
>
> We thank any input
>
> Best regards
>
> Jose
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2431 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090617/e6b3e848/attachment.bin>

robert ungar

2009-Jun-17 11:21 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Jose,

I hope our openstorage experts weigh in on ''is this a good
idea'', it
sounds scary to me but I''m
overly cautious anyway.  I did want to raise the question of other 
client expectations for this
opportunity,  what are the intended data protection requirements, how will
they backup and recover the files,  do they intend to apply replication 
in support of disaster
recovery plan and are the intended data protection schemes practical.  
The other area that jumps
out at me is concurrent access,  in addition to the ''find'' by
''many''
clients,  does the client have any
performance requirements that must be met to insure the solution is 
successful.  Does any of the
above have to happen at the same time ?

I''m not in a position to evaluate these considerations for this 
opportunity, simply sharing some areas
that, often enough, are over looked as we address the chief complaint.

Regards,
Robert

>
> Jose Martins wrote:
>>
>> Hello experts,
>>
>> IHAC that wants to put more than 250 Million files on a single
>> mountpoint (in a directory tree with no more than 100 files on each
>> directory).
>>
>> He wants to share such filesystem by NFS and mount it through
>> many Linux Debian clients
>>
>> We are proposing a 7410 Openstore appliance...
>>
>> He is claiming that certain operations like find, even if taken from
>> the Linux clients on such NFS mountpoint take significant more
>> time than if such NFS share was provided by other NAS providers
>> like NetApp...
>>
>> Can someone confirm if this is really a problem for ZFS filesystems?...
>>
>> Is there any way to tune it?...
>>
>> We thank any input
>>
>> Best regards
>>
>> Jose
>>
>>
>>
>>
>
-- 
****************************

Robert C. Ungar  ABCP
Professional Services Delivery
Storage Solutions Specialist
Telephone 585-598-9020

Sun Microsystems
345 Woodcliff Drive
Fairport, NY 14450

www.sun.com/storage

Eric D. Mudama

2009-Jun-17 16:03 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

On Wed, Jun 17 at 13:49, Alan Hargreaves wrote:> Another question worth asking here is, is a find over the entire  
> filesystem something that they would expect to be executed with  
> sufficient regularity that it the execution time would have a business  
> impact.
Exactly.  That''s such an odd business workload on 250,000,000 files
that there isn''t likely to be much of a shortcut other than just
throwing tons of spindles (or SSDs) at the problem, and/or having tons
of memory.

If the finds are just by name, thats easy for the system to cache, but
if you''re expecting to run something against the output of find with
-exec to parse/process 250M files on a regular basis, you''ll likely be
severely IO bound.  Almost to the point of arguing for something like
Hadoop or another form of distributed map:reduce on your dataset with
a lot of nodes, instead of a single storage server.


-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Louis Romero

2009-Jun-17 16:08 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Jose,

I believe the problem is endemic to Solaris.  I have run into similar 
problems doing a simple find(1) in /etc.  On Linux, a find operation in 
/etc is almost instantaneous.  On solaris, it has a tendency to  spin 
for a long time.  I don''t know what their use of find might be but, 
running updatedb on the linux clients (with the NFS file system mounted 
of course) and using locate(1) will give you a work-around on the linux 
clients. 

Caveat Empore: There is a staleness factor associated with this solution 
as any new files dropped in after updatedb runs will not show up until 
the next updatedb is run.

HTH

louis

 On 06/16/09 11:55, Jose Martins wrote:>
> Hello experts,
>
> IHAC that wants to put more than 250 Million files on a single
> mountpoint (in a directory tree with no more than 100 files on each
> directory).
>
> He wants to share such filesystem by NFS and mount it through
> many Linux Debian clients
>
> We are proposing a 7410 Openstore appliance...
>
> He is claiming that certain operations like find, even if taken from
> the Linux clients on such NFS mountpoint take significant more
> time than if such NFS share was provided by other NAS providers
> like NetApp...
>
> Can someone confirm if this is really a problem for ZFS filesystems?...
>
> Is there any way to tune it?...
>
> We thank any input
>
> Best regards
>
> Jose
>
>
>

Dirk Nitschke

2009-Jun-17 17:38 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Hi Louis!

Solaris /usr/bin/find and Linux (GNU-) find work differently! I have  
experienced dramatic runtime differences some time ago. The reason is  
that Solaris find and GNU find use different algorithms.

GNU find uses the st_nlink ("number of links") field of the stat  
structure to optimize it''s work. Solaris find does not use this kind  
of optimization because the meaning of "number of links" is not well  
defined and file system dependent.

If you are interested, take a look at, say,

CR 4907267 link count problem is hsfs
CR 4462534 RFE: pcfs should emulate link counts for directories

Dirk

Am 17.06.2009 um 18:08 schrieb Louis Romero:
> Jose,
>
> I believe the problem is endemic to Solaris.  I have run into  
> similar problems doing a simple find(1) in /etc.  On Linux, a find  
> operation in /etc is almost instantaneous.  On solaris, it has a  
> tendency to  spin for a long time.  I don''t know what their use of
> find might be but, running updatedb on the linux clients (with the  
> NFS file system mounted of course) and using locate(1) will give you  
> a work-around on the linux clients.
> Caveat Empore: There is a staleness factor associated with this  
> solution as any new files dropped in after updatedb runs will not  
> show up until the next updatedb is run.
>
> HTH
>
> louis
>
> On 06/16/09 11:55, Jose Martins wrote:
>>
>> Hello experts,
>>
>> IHAC that wants to put more than 250 Million files on a single
>> mountpoint (in a directory tree with no more than 100 files on each
>> directory).
>>
>> He wants to share such filesystem by NFS and mount it through
>> many Linux Debian clients
>>
>> We are proposing a 7410 Openstore appliance...
>>
>> He is claiming that certain operations like find, even if taken from
>> the Linux clients on such NFS mountpoint take significant more
>> time than if such NFS share was provided by other NAS providers
>> like NetApp...
>>
>> Can someone confirm if this is really a problem for ZFS  
>> filesystems?...
>>
>> Is there any way to tune it?...
>>
>> We thank any input
>>
>> Best regards
>>
>> Jose
>>
>>
>>
>
-- 
Sun Microsystems GmbH                                     Dirk Nitschke
Nagelsweg 55                                          Storage Architect
20097 Hamburg                                  Phone: +49-40-251523-413
Germany                                          Fax: +49-40-251523-425
http://www.sun.de/                            Mobile: +49-172-847 62 66
                                                   Dirk.Nitschke at Sun.COM
-----------------------------------------------------------------------
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
D-85551 Kirchheim-Heimstetten - Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

Casper.Dik at Sun.COM

2009-Jun-17 18:51 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

>Hi Louis!
>
>Solaris /usr/bin/find and Linux (GNU-) find work differently! I have  
>experienced dramatic runtime differences some time ago. The reason is  
>that Solaris find and GNU find use different algorithms.
>
>GNU find uses the st_nlink ("number of links") field of the stat  
>structure to optimize it''s work. Solaris find does not use this
kind
>of optimization because the meaning of "number of links" is not
well
>defined and file system dependent.
But that''s not the under discussion: apparently the *same* clients
get different performance from a OpenStorage system vs a Netapp
system.

I think we need to know much more and I think OpenStorage can giv
the information you need.

(Yes, I did have problems because of GNU finds shortcuts: they don''t
work all the time)

Casper

Louis Romero

2009-Jun-17 19:09 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

hi Dirk,

How might we explain running find on a linux client to an NFS mounted 
file system under the 7000 taking significantly longer (i.e. performance 
behaving as though the command was run from Solaris?)  Not sure if find 
would have the intelligence to differentiate between file system types 
and run different sections of code based upon what it finds?

louis

On 06/17/09 11:38, Dirk Nitschke wrote:> Hi Louis!
>
> Solaris /usr/bin/find and Linux (GNU-) find work differently! I have 
> experienced dramatic runtime differences some time ago. The reason is 
> that Solaris find and GNU find use different algorithms.
>
> GNU find uses the st_nlink ("number of links") field of the stat 
> structure to optimize it''s work. Solaris find does not use this
kind
> of optimization because the meaning of "number of links" is not
well
> defined and file system dependent.
>
> If you are interested, take a look at, say,
>
> CR 4907267 link count problem is hsfs
> CR 4462534 RFE: pcfs should emulate link counts for directories
>
> Dirk
>
> Am 17.06.2009 um 18:08 schrieb Louis Romero:
>
>> Jose,
>>
>> I believe the problem is endemic to Solaris.  I have run into similar 
>> problems doing a simple find(1) in /etc.  On Linux, a find operation 
>> in /etc is almost instantaneous.  On solaris, it has a tendency to  
>> spin for a long time.  I don''t know what their use of find
might be
>> but, running updatedb on the linux clients (with the NFS file system 
>> mounted of course) and using locate(1) will give you a work-around on 
>> the linux clients.
>> Caveat Empore: There is a staleness factor associated with this 
>> solution as any new files dropped in after updatedb runs will not 
>> show up until the next updatedb is run.
>>
>> HTH
>>
>> louis
>>
>> On 06/16/09 11:55, Jose Martins wrote:
>>>
>>> Hello experts,
>>>
>>> IHAC that wants to put more than 250 Million files on a single
>>> mountpoint (in a directory tree with no more than 100 files on each
>>> directory).
>>>
>>> He wants to share such filesystem by NFS and mount it through
>>> many Linux Debian clients
>>>
>>> We are proposing a 7410 Openstore appliance...
>>>
>>> He is claiming that certain operations like find, even if taken
from
>>> the Linux clients on such NFS mountpoint take significant more
>>> time than if such NFS share was provided by other NAS providers
>>> like NetApp...
>>>
>>> Can someone confirm if this is really a problem for ZFS
filesystems?...
>>>
>>> Is there any way to tune it?...
>>>
>>> We thank any input
>>>
>>> Best regards
>>>
>>> Jose
>>>
>>>
>>>
>>
>

Joerg Schilling

2009-Jun-17 19:37 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Dirk Nitschke <Dirk.Nitschke at Sun.COM> wrote:
> Solaris /usr/bin/find and Linux (GNU-) find work differently! I have  
> experienced dramatic runtime differences some time ago. The reason is  
> that Solaris find and GNU find use different algorithms.
Correct: Solaris find honors the POSIX standard, GNU find does not :-(

> GNU find uses the st_nlink ("number of links") field of the stat
> structure to optimize it''s work. Solaris find does not use this
kind
> of optimization because the meaning of "number of links" is not
well
> defined and file system dependent.
GNU find makes illegal assumptions on the value in st_nlink for diretctories.

These assumptions are derived from implementation specifics found in historic
UNIX filesystem implementations, but there is no grant for the asumed behavior.

> If you are interested, take a look at, say,
>
> CR 4907267 link count problem is hsfs
Hsfs just shows you the numbers set up by the ISO-9660 filesystem creation 
utility. If you use a recent original mkisofs (like the one that come with 
Solaris since 1.5 years), you get the same behavior for hsfs and UFS. The 
related feature has been implemented in October 2006 in mkisofs.

If you use "mkisofs" from one of the non-OSS-friendly Linux
distributions like
Debian, RedHat, Suse, Ubuntu or Mandriva, you use a 5 year old mkisofs version 
and thus the values in st_nlink for directories are random numbers.

The problems caused by programs that ignore POSIX rules have been discussed 
several times in the POSIX mailing list. In order to solve the issue, I did
propose several times to introduce a new pathconf() call that allows to ask 
whether a directory has historic UFS semantics for st_nlink.

Hsfs knows whether the filesystem was created by a recent mkisofs and thus
would be able to give the right return value. NFS clients need to implement a 
RPC that allows to retrieve the value from the expoirted filesystem at the 
server side. 

> Am 17.06.2009 um 18:08 schrieb Louis Romero:
>
> > Jose,
> >
> > I believe the problem is endemic to Solaris.  I have run into  
> > similar problems doing a simple find(1) in /etc.  On Linux, a find  
> > operation in /etc is almost instantaneous.  On solaris, it has a  
If you ignore standards you may get _apparent_ speed. On Linux this speed
is usually bought by giving up correctness.

> > tendency to  spin for a long time.  I don''t know what their
use of
> > find might be but, running updatedb on the linux clients (with the  
> > NFS file system mounted of course) and using locate(1) will give you  
> > a work-around on the linux clients.
With NFS, things are even more complex and in principle similar to the hsfs 
case where the OS filesystem implementation just shows you the values set up 
by mkisofs.

On a NFS client, you see the number that have been set up by the NFS server
but you don''t know what filesystem type is under the NFS server.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Cor Beumer - Storage Solution Architect

2009-Jun-18 10:12 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Hi Jose,

Well it depends on the total size of your Zpool and how often these 
files are changed.

I was at a customer an huge internet provider, who had 40x an X4500 with 
Standard solaris and using ZFS.
All the machines were equiped with 48x 1TB disks. The machines were used 
to provide the email platform, so all
the user email accounts were on the system. This did mean also millions 
of files in one ZPOOL.

What they noticed on the the X4500 systems, that when the zpool became 
filled up for about 50-60% the performance of the system
did drop enormously.
They do claim this has to do with the fragmentation of the ZFS 
filesystem. So we did try over there putting an S7410 system in with 
about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla (in 
a stripe) we were able to get much and much more i/o''s from the system 
the the comparable X4500, however they did put it in production for a 
couple of weeks, and as soon as the ZFS filesystem did come in the range 
of about 50-60% filling the did see the same problem.
The performance did drop down enormously. Netapps has the same problem 
with there Waffle filesystem, (they also tested this) however they do 
provide an Defragmentation tool for this. This is also NOT a nice 
solution, because you have to run this, manually or scheduled and it is 
taking a lot of system resources but it helps.

I did hear Sun is denying we do have this problem in ZFS, and therefore 
we don''t need a kind of defragmentation mechanism,
however our customer experiences are different............

May be it is good for the ZFS group to look at this (potential) problem.

The customer i am talking about is willing to share there experiences 
with Sun engineering.

greetings,

Cor Beumer

Jose Martins wrote:>
> Hello experts,
>
> IHAC that wants to put more than 250 Million files on a single
> mountpoint (in a directory tree with no more than 100 files on each
> directory).
>
> He wants to share such filesystem by NFS and mount it through
> many Linux Debian clients
>
> We are proposing a 7410 Openstore appliance...
>
> He is claiming that certain operations like find, even if taken from
> the Linux clients on such NFS mountpoint take significant more
> time than if such NFS share was provided by other NAS providers
> like NetApp...
>
> Can someone confirm if this is really a problem for ZFS filesystems?...
>
> Is there any way to tune it?...
>
> We thank any input
>
> Best regards
>
> Jose
>
>
>
-- 
<http://www.sun.com> 	  *Cor Beumer *
  Data Management & Storage

  *Sun Microsystems Nederland BV*
  Saturnus 1
  3824 ME Amersfoort The Netherlands
  Phone +31 33 451 5172
  Mobile +31 6 51 603 142
  Email Cor.Beumer at Sun.COM

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090618/be6d1ff4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: c:\SunSTK.JPG
Type: image/jpeg
Size: 4199 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090618/be6d1ff4/attachment.jpe>

Richard Elling

2009-Jun-18 18:23 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Cor Beumer - Storage Solution Architect wrote:> Hi Jose,
>
> Well it depends on the total size of your Zpool and how often these 
> files are changed.
...and the average size of the files.  For small files, it is likely 
that the default
recordsize will not be optimal, for several reasons.  Are these small files?
 -- richard
>
> I was at a customer an huge internet provider, who had 40x an X4500 
> with Standard solaris and using ZFS.
> All the machines were equiped with 48x 1TB disks. The machines were 
> used to provide the email platform, so all
> the user email accounts were on the system. This did mean also 
> millions of files in one ZPOOL.
>
> What they noticed on the the X4500 systems, that when the zpool became 
> filled up for about 50-60% the performance of the system
> did drop enormously.
> They do claim this has to do with the fragmentation of the ZFS 
> filesystem. So we did try over there putting an S7410 system in with 
> about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla 
> (in a stripe) we were able to get much and much more i/o''s from
the
> system the the comparable X4500, however they did put it in production 
> for a couple of weeks, and as soon as the ZFS filesystem did come in 
> the range of about 50-60% filling the did see the same problem.
> The performance did drop down enormously. Netapps has the same problem 
> with there Waffle filesystem, (they also tested this) however they do 
> provide an Defragmentation tool for this. This is also NOT a nice 
> solution, because you have to run this, manually or scheduled and it 
> is taking a lot of system resources but it helps.
>
> I did hear Sun is denying we do have this problem in ZFS, and 
> therefore we don''t need a kind of defragmentation mechanism,
> however our customer experiences are different............
>
> May be it is good for the ZFS group to look at this (potential) problem.
>
> The customer i am talking about is willing to share there experiences 
> with Sun engineering.
>
> greetings,
>
> Cor Beumer
>
>
> Jose Martins wrote:
>>
>> Hello experts,
>>
>> IHAC that wants to put more than 250 Million files on a single
>> mountpoint (in a directory tree with no more than 100 files on each
>> directory).
>>
>> He wants to share such filesystem by NFS and mount it through
>> many Linux Debian clients
>>
>> We are proposing a 7410 Openstore appliance...
>>
>> He is claiming that certain operations like find, even if taken from
>> the Linux clients on such NFS mountpoint take significant more
>> time than if such NFS share was provided by other NAS providers
>> like NetApp...
>>
>> Can someone confirm if this is really a problem for ZFS filesystems?...
>>
>> Is there any way to tune it?...
>>
>> We thank any input
>>
>> Best regards
>>
>> Jose
>>
>>
>>
>
> -- 
> <http://www.sun.com> 	  *Cor Beumer *
>   Data Management & Storage
>
>   *Sun Microsystems Nederland BV*
>   Saturnus 1
>   3824 ME Amersfoort The Netherlands
>   Phone +31 33 451 5172
>   Mobile +31 6 51 603 142
>   Email Cor.Beumer at Sun.COM
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Gary Mills

2009-Jun-18 19:24 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

On Thu, Jun 18, 2009 at 12:12:16PM +0200, Cor Beumer - Storage Solution
Architect wrote:> 
> What they noticed on the the X4500 systems, that when the zpool became 
> filled up for about 50-60% the performance of the system
> did drop enormously.
> They do claim this has to do with the fragmentation of the ZFS 
> filesystem. So we did try over there putting an S7410 system in with 
> about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla (in 
> a stripe) we were able to get much and much more i/o''s from the
system
> the the comparable X4500, however they did put it in production for a 
> couple of weeks, and as soon as the ZFS filesystem did come in the range 
> of about 50-60% filling the did see the same problem.
We had a similar problem with a T2000 and 2 TB of ZFS storage.  Once
the usage reached 1 TB, the write performance dropped considerably and
the CPU consumption increased.  Our problem was indirectly a result of
fragmentation, but it was solved by a ZFS patch.  I understand that
this patch, which fixes a whole bunch of ZFS bugs, should be released
soon.  I wonder if this was your problem.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Richard Elling

2009-Jun-18 21:47 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Gary Mills wrote:> On Thu, Jun 18, 2009 at 12:12:16PM +0200, Cor Beumer - Storage Solution
Architect wrote:
>   
>> What they noticed on the the X4500 systems, that when the zpool became 
>> filled up for about 50-60% the performance of the system
>> did drop enormously.
>> They do claim this has to do with the fragmentation of the ZFS 
>> filesystem. So we did try over there putting an S7410 system in with 
>> about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla (in
>> a stripe) we were able to get much and much more i/o''s from
the system
>> the the comparable X4500, however they did put it in production for a 
>> couple of weeks, and as soon as the ZFS filesystem did come in the
range
>> of about 50-60% filling the did see the same problem.
>>     
>
> We had a similar problem with a T2000 and 2 TB of ZFS storage.  Once
> the usage reached 1 TB, the write performance dropped considerably and
> the CPU consumption increased.  Our problem was indirectly a result of
> fragmentation, but it was solved by a ZFS patch.  I understand that
> this patch, which fixes a whole bunch of ZFS bugs, should be released
> soon.  I wonder if this was your problem.
>   
George would probably have the latest info, but there were a number of
things which circled around the notorious "Stop looking and start
ganging"
bug report,
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6596237
 -- richard

Rainer Orth

2009-Jun-19 11:17 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Richard Elling <richard.elling at gmail.com> writes:
> George would probably have the latest info, but there were a number of
> things which circled around the notorious "Stop looking and start
ganging"
> bug report,
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6596237
Indeed: we were seriously bitten by this one, taking three Solaris 10
fileservers down for about a week until the problem was diagnosed by Sun
Service and an IDR provided.  Unfortunately, this issue (seriously
fragmented pools or pools beyond ca. 90% full cause file servers to grind
to a halt) were only announced/acknowledged publicly after our incident,
although the problem seems to have been reported almost two years ago.
While a fix has been integrated into snv_114, there''s still no patch
for
S10, only various IDRs.

It''s unclear what the state of the related CR 4854312 (need to
defragment
storage pool, submitted in 2003!) is.  I suppose this might be dealt with
by the vdev removal code, but overall it''s scary that dealing with such
fundamental issues takes so long.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Faculty of Technology, Bielefeld University

Roch Bourbonnais

2009-Jun-19 11:47 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Le 18 juin 09 ? 20:23, Richard Elling a ?crit :
> Cor Beumer - Storage Solution Architect wrote:
>> Hi Jose,
>>
>> Well it depends on the total size of your Zpool and how often these  
>> files are changed.
>
> ...and the average size of the files.  For small files, it is likely  
> that the default
> recordsize will not be optimal, for several reasons.  Are these  
> small files?
> -- richard
Hey Richard, I have to correct that. For small files & big files no  
need to tune the recordsize.
(files are stored as single perfectly adjusted records up to the  
dataset recordsize property).

Only for big files access and updated in aligned application record  
(RDBMS) does it help to tune the ZFS recordsize.

-r
>
>>
>> I was at a customer an huge internet provider, who had 40x an X4500  
>> with Standard solaris and using ZFS.
>> All the machines were equiped with 48x 1TB disks. The machines were  
>> used to provide the email platform, so all
>> the user email accounts were on the system. This did mean also  
>> millions of files in one ZPOOL.
>>
>> What they noticed on the the X4500 systems, that when the zpool  
>> became filled up for about 50-60% the performance of the system
>> did drop enormously.
>> They do claim this has to do with the fragmentation of the ZFS  
>> filesystem. So we did try over there putting an S7410 system in  
>> with about the same config on disks, 44x 1TB SATA BUT 4x 18GB  
>> WriteZilla (in a stripe) we were able to get much and much more i/ 
>> o''s from the system the the comparable X4500, however they did
put
>> it in production for a couple of weeks, and as soon as the ZFS  
>> filesystem did come in the range of about 50-60% filling the did  
>> see the same problem.
>> The performance did drop down enormously. Netapps has the same  
>> problem with there Waffle filesystem, (they also tested this)  
>> however they do provide an Defragmentation tool for this. This is  
>> also NOT a nice solution, because you have to run this, manually or  
>> scheduled and it is taking a lot of system resources but it helps.
>>
>> I did hear Sun is denying we do have this problem in ZFS, and  
>> therefore we don''t need a kind of defragmentation mechanism,
>> however our customer experiences are different............
>>
>> May be it is good for the ZFS group to look at this (potential)  
>> problem.
>>
>> The customer i am talking about is willing to share there  
>> experiences with Sun engineering.
>>
>> greetings,
>>
>> Cor Beumer
>>
>>
>> Jose Martins wrote:
>>>
>>> Hello experts,
>>>
>>> IHAC that wants to put more than 250 Million files on a single
>>> mountpoint (in a directory tree with no more than 100 files on each
>>> directory).
>>>
>>> He wants to share such filesystem by NFS and mount it through
>>> many Linux Debian clients
>>>
>>> We are proposing a 7410 Openstore appliance...
>>>
>>> He is claiming that certain operations like find, even if taken
from
>>> the Linux clients on such NFS mountpoint take significant more
>>> time than if such NFS share was provided by other NAS providers
>>> like NetApp...
>>>
>>> Can someone confirm if this is really a problem for ZFS  
>>> filesystems?...
>>>
>>> Is there any way to tune it?...
>>>
>>> We thank any input
>>>
>>> Best regards
>>>
>>> Jose
>>>
>>>
>>>
>>
>> -- 
>> <http://www.sun.com> 	  *Cor Beumer *
>>  Data Management & Storage
>>
>>  *Sun Microsystems Nederland BV*
>>  Saturnus 1
>>  3824 ME Amersfoort The Netherlands
>>  Phone +31 33 451 5172
>>  Mobile +31 6 51 603 142
>>  Email Cor.Beumer at Sun.COM
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2431 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090619/6c2eb8e0/attachment.bin>

Thomas

2009-Jun-22 20:25 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

Hi,

I have and raidz1 conisting 6 5400rpm drives on this zpool. I have stored some
Media in a FS and in an other 200k files. Both FS are written not much. The Pool
is 85% Full.

Could this issue also the reason that if Iam playing(reading) some Media that
the playback is lagging?

OSOL ips_111b
E5200, 8gb RAM

Thank you
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jun-22 22:43 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

On Mon, 22 Jun 2009, Thomas wrote:
> I have and raidz1 conisting 6 5400rpm drives on this zpool. I have 
> stored some Media in a FS and in an other 200k files. Both FS are 
> written not much. The Pool is 85% Full.
>
> Could this issue also the reason that if Iam playing(reading) some 
> Media that the playback is lagging?
Check to see if you have automated snapshots running. If snapshots 
make your pool full, then your pool will also be more likely to 
fragment new/modified files.

Make sure that you are using the default zfs blocksize of 128K since 
smaller block sizes may increase fragmentation.

You may have a slow disk which is causing the whole pool to run slow. 
All it takes is one.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Thomas

2009-Jun-23 09:41 UTC

head link

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

No snapshots running. I have only 21 filesystems mounted. Blocksize is the
default one. Slow disk I dont think so because I get read and write rates about
350MB/s. Bios is the last also I tried to splitt the pool to two controllers all
this dont help.
-- 
This message posted from opensolaris.org

zfs discuss - Jun 2009 - Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files

[zfs-discuss] Lots of metadata overhead on filesystems with 100M files