thr3ads.net - zfs discuss - [zfs-discuss] Multi-tera, small-file filesystems [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Yaniv Aknin

2007-Apr-18 13:44 UTC

[zfs-discuss] Multi-tera, small-file filesystems

Hello,

I''d like to plan a storage solution for a system currently in
production.

The system''s storage is based on code which writes many files to the
file system, with overall storage needs currently around 40TB and expected to
reach hundreds of TBs. The average file size of the system is ~100K, which
translates to ~500 million files today, and billions of files in the future.
This storage is accessed over NFS by a rack of 40 Linux blades, and is mostly
read-only (99% of the activity is reads). While I realize calling this
sub-optimal system design is probably an understatement, the design of the
system is beyond my control and isn''t likely to change in the near
future.

The system''s current storage is based on 4 VxFS filesystems, created on
SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves the
filesystems, 2 filesystems per node. Each of the filesystems undergoes growfs as
more storage is made available. We''re looking for an alternative
solution, in an attempt to improve performance and ability to recover from
disasters (fsck on 2^42 files isn''t practical, and I''m getting
pretty worried due to this fact - even the smallest filesystem inconsistency
will leave me lots of useless bits).

Question is - does anyone here have experience with large ZFS filesystems with
many small-files? Is it practical to base such a solution on a few (8) zpools,
each with single large filesystem in it?

Many thanks in advance for any advice,
 - Yaniv
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Apr-18 23:13 UTC

head link

[zfs-discuss] Multi-tera, small-file filesystems

Hello Yaniv,

Wednesday, April 18, 2007, 3:44:57 PM, you wrote:

YA> Hello,

YA> I''d like to plan a storage solution for a system currently in
production.

YA> The system''s storage is based on code which writes many files to
YA> the file system, with overall storage needs currently around 40TB
YA> and expected to reach hundreds of TBs. The average file size of
YA> the system is ~100K, which translates to ~500 million files today,
YA> and billions of files in the future. This storage is accessed over
YA> NFS by a rack of 40 Linux blades, and is mostly read-only (99% of
YA> the activity is reads). While I realize calling this sub-optimal
YA> system design is probably an understatement, the design of the
YA> system is beyond my control and isn''t likely to change in the
near future.

YA> The system''s current storage is based on 4 VxFS filesystems,
YA> created on SVM meta-devices each ~10TB in size. A 2-node Sun
YA> Cluster serves the filesystems, 2 filesystems per node. Each of
YA> the filesystems undergoes growfs as more storage is made
YA> available. We''re looking for an alternative solution, in an
YA> attempt to improve performance and ability to recover from
YA> disasters (fsck on 2^42 files isn''t practical, and I''m
getting
YA> pretty worried due to this fact - even the smallest filesystem
YA> inconsistency will leave me lots of useless bits).

YA> Question is - does anyone here have experience with large ZFS
YA> filesystems with many small-files? Is it practical to base such a
YA> solution on a few (8) zpools, each with single large filesystem in it?

YA> Many thanks in advance for any advice,

I have "some" experience with similar but bigger environment and lot
of
data already on zfs (for years now). Although I can''t talk many
details...

One of a problems is: how are you going to backup all this data?
With so many small files classical approach probably won''t work, and
if it does now it won''t in a (near) future. I would strongly suggest
disk-to-disk backup + snapshots for point-in-time backups.

With lot of small files I observed zfs to consume about the same disk
space as UFS.

It seems there''s a problem with fs fragmentation after some time with
lot of files (zfs send|recv helps for some time).

While I see no problem going with one file system (pool itself?) in
each zpool, with TBs of data I would consider to split it into more
file systems mostly for "management" reasons like  backup,
snapshotting. Splitting into more file systems also helps when you
have to migrate one of file systems to another storage - it''s easier
to find 1TB of storage than 20TB.
I try to keep each production file system below 1TB, not
that there are any problems with larger file systems.

When doing Sun Cluster consider creating at least as many zpools as
you have nodes in a cluster, so if you have to you can spread out your
workload to each node in a cluster (put each zpool in a different sc
group with its own IP).

We did some tests with Linux (2.4 and 2.6) and it seems there''s a
problem if you have thousands of nfs file systems - they won''t all be
mounted automatically, and even doing it manually (or in a script with
a sleep between each mount) there seems to be a limit below 1000. We
did not investigate further as in that environment all nfs clients are
Solaris server (x86, sparc) and we see no problems with thousands of
file systems.

If you switch a rg from one node to another which was serving another
nfs rg group, keep in mind that nfsd will actually restart which means
service distraction for that other group also. With zfs stopping nfsd
can sometimes take even minutes...

There are more things also to consider (storage layout, network
config, etc...).







-- 
Best regards,
 Robert Milkowski                      mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Carson Gaspar

2007-Apr-18 23:22 UTC

head link

[zfs-discuss] Multi-tera, small-file filesystems

Robert Milkowski wrote:
> We did some tests with Linux (2.4 and 2.6) and it seems there''s a
> problem if you have thousands of nfs file systems - they won''t all
be
> mounted automatically, and even doing it manually (or in a script with
> a sleep between each mount) there seems to be a limit below 1000. We
> did not investigate further as in that environment all nfs clients are
> Solaris server (x86, sparc) and we see no problems with thousands of
> file systems.
The Linux limitation is possibly due to privileged port exhaustion with 
TCP mounts, FYI.

-- 
Carson

Robert Milkowski

2007-Apr-18 23:44 UTC

head link

[zfs-discuss] Multi-tera, small-file filesystems

Hello Carson,

Thursday, April 19, 2007, 1:22:17 AM, you wrote:

CG> Robert Milkowski wrote:
>> We did some tests with Linux (2.4 and 2.6) and it seems
there''s a
>> problem if you have thousands of nfs file systems - they won''t
all be
>> mounted automatically, and even doing it manually (or in a script with
>> a sleep between each mount) there seems to be a limit below 1000. We
>> did not investigate further as in that environment all nfs clients are
>> Solaris server (x86, sparc) and we see no problems with thousands of
>> file systems.
CG> The Linux limitation is possibly due to privileged port exhaustion with
CG> TCP mounts, FYI.


We''ve been thinking about the same lines (1024-some services already
running).

But still with few hundreds nfs entries Linux time outs end you end up
with some file system not mounted, etc.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Spencer Shepler

2007-Apr-19 00:28 UTC

head link

[zfs-discuss] Multi-tera, small-file filesystems

On Apr 18, 2007, at 6:44 PM, Robert Milkowski wrote:
> Hello Carson,
>
> Thursday, April 19, 2007, 1:22:17 AM, you wrote:
>
> CG> Robert Milkowski wrote:
>
>>> We did some tests with Linux (2.4 and 2.6) and it seems
there''s a
>>> problem if you have thousands of nfs file systems - they
won''t
>>> all be
>>> mounted automatically, and even doing it manually (or in a script  
>>> with
>>> a sleep between each mount) there seems to be a limit below 1000.
We
>>> did not investigate further as in that environment all nfs  
>>> clients are
>>> Solaris server (x86, sparc) and we see no problems with thousands
of
>>> file systems.
>
> CG> The Linux limitation is possibly due to privileged port  
> exhaustion with
> CG> TCP mounts, FYI.
>
>
> We''ve been thinking about the same lines (1024-some services
already
> running).
>
> But still with few hundreds nfs entries Linux time outs end you end up
> with some file system not mounted, etc.
See the Linux NFS FAQ at http://nfs.sourceforge.net/
Question/Answer B3.  There is a limit of a few hundred
NFS mounts.

Spencer

Robert Milkowski

2007-Apr-19 01:38 UTC

head link

[zfs-discuss] Multi-tera, small-file filesystems

Hello Spencer,

Thursday, April 19, 2007, 2:28:30 AM, you wrote:

SS> On Apr 18, 2007, at 6:44 PM, Robert Milkowski wrote:
>> Hello Carson,
>>
>> Thursday, April 19, 2007, 1:22:17 AM, you wrote:
>>
>> CG> Robert Milkowski wrote:
>>
>>>> We did some tests with Linux (2.4 and 2.6) and it seems
there''s a
>>>> problem if you have thousands of nfs file systems - they
won''t
>>>> all be
>>>> mounted automatically, and even doing it manually (or in a
script
>>>> with
>>>> a sleep between each mount) there seems to be a limit below
1000. We
>>>> did not investigate further as in that environment all nfs  
>>>> clients are
>>>> Solaris server (x86, sparc) and we see no problems with
thousands of
>>>> file systems.
>>
>> CG> The Linux limitation is possibly due to privileged port  
>> exhaustion with
>> CG> TCP mounts, FYI.
>>
>>
>> We''ve been thinking about the same lines (1024-some services
already
>> running).
>>
>> But still with few hundreds nfs entries Linux time outs end you end up
>> with some file system not mounted, etc.
SS> See the Linux NFS FAQ at http://nfs.sourceforge.net/
SS> Question/Answer B3.  There is a limit of a few hundred
SS> NFS mounts.

Thanks.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Yaniv Aknin

2007-Apr-19 05:22 UTC

head link

[zfs-discuss] Re: Multi-tera, small-file filesystems

Hi Robert, thanks for the information.

I understand from your words that you''re more worried about overall
filesystem size rather than the number of files, yes? Is the number of files 
something I should or should not worry about?
i.e., what are the differences (in stability, recoverability, performance,
manageability... etc) between a 25TB filesystem with 2^35 files and a 25TB
filesystem with 1,000 files, each 25GB?

Also, if it''s possible to ask without stepping out of any of your
customers'' NDAs, can you at least say what''s the average
filesize you have on some of your multi-tera volumes (is 10K a small file? is
100K? 1K?) Is anyone else on the forum able to quote their numbers?

Regarding your Sun Cluster recommendations - thanks, and I''ll do just
that.

Thanks again and regards,
 - Yaniv
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Apr-19 14:17 UTC

head link

Re[2]: Multi-tera, small-file filesystems

Hello Aknin,

Thursday, April 19, 2007, 7:20:26 AM, you wrote:

&gt;

Hi Robert, thanks for the information.

I understand from your words that you''re more worried about overall
filesystem size rather than the number of files, yes? Is the number of files
 something I should or should not worry about? 

i.e., what are the differences (in stability, recoverability, performance,
manageability... etc) between a 25TB filesystem with 2^35 files and a 25TB
filesystem with 1,000 files, each 25GB?

If you are ok with your application having to access lot of small files then
it''s not an issue except backup.

Really depends how you want to do your backup. Lot of small files is bad, very
bad for classical backup solutions.

In terms of many small files I see no problem with stability, recoverability or
performance (depends on app and workload).

Now the difference in a scenario you asked is that if you want to backup 1000
files, depending what file system you use and how you created those files
you''re probably going to read them mostly sequentially on physical
layer. Also it''s very cheap in most cases to check 1000 files if they
changed instead of millions.

As I wrote - if your app/workload is happy with many small files then fine.

But you''ll definitely have a problem with a backup.

&gt;

Also, if it''s possible to ask without stepping out of any of your
customers'' NDAs, can you at least say what''s the average
filesize you have on some of your multi-tera volumes (is 10K a small file? is
100K? 1K?) 

I''m afraid I can''t :(

But I can say that to me anything below 512KB is a small file (starting from few
bytes).

Also a file size distribution is such that I have mostly small files and large
files, the rest 10% is somewhere between.

-- 

Best regards,

 Robert Milkowski                            mailto:rmilkowski@task.gda.pl

                                                         
  http://milek.blogspot.com


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2007-Apr-19 15:52 UTC

head link

[zfs-discuss] Re: Multi-tera, small-file filesystems

Yaniv Aknin wrote:> I understand from your words that you''re more worried about
overall filesystem size rather than the number of files, yes? Is the number of
files  something I should or should not worry about?
> i.e., what are the differences (in stability, recoverability, performance,
manageability... etc) between a 25TB filesystem with 2^35 files and a 25TB
filesystem with 1,000 files, each 25GB?
I don''t think we anticipate problems, but I''m not sure there
are a lot
of people who have done this, yet.  We do know of such limitations in UFS,
and other file systems, which do not exist in ZFS, by design.
  -- richard

Anton B. Rang

2007-Apr-20 07:14 UTC

head link

[zfs-discuss] Re: Multi-tera, small-file filesystems

You should definitely worry about the number of files when it comes to backup
& management. It will also make a big difference in space overhead.

A ZFS filesystem with 2^35 files will have a minimum of 2^44 bytes of overhead
just for the file nodes, which is about 16 TB.

If it takes about 20 ms for the overhead to backup a file (2 seeks), then 2^35
files will take 21 years to back up.  ;-)

I''m guessing you didn''t really mean 2^35, though. (If you did,
you''re likely to need a system along the lines of DARPA''s HPCS
program....)

Anton
 
 
This message posted from opensolaris.org

zfs discuss - Apr 2007 - Multi-tera, small-file filesystems

[zfs-discuss] Multi-tera, small-file filesystems

[zfs-discuss] Multi-tera, small-file filesystems

[zfs-discuss] Multi-tera, small-file filesystems

[zfs-discuss] Multi-tera, small-file filesystems

[zfs-discuss] Multi-tera, small-file filesystems

[zfs-discuss] Multi-tera, small-file filesystems

[zfs-discuss] Re: Multi-tera, small-file filesystems

Re[2]: Multi-tera, small-file filesystems

[zfs-discuss] Re: Multi-tera, small-file filesystems

[zfs-discuss] Re: Multi-tera, small-file filesystems