thr3ads.net - CentOS - [CentOS] Question about optimal filesystem with many small files. [Jul 2009]

If this information is useful, please help other people find it:
Share via:

oooooooooooo ooooooooooooo

2009-Jul-08 06:27 UTC

[CentOS] Question about optimal filesystem with many small files.

Hi,

I have a program that writes lots of files to a directory tree (around 15
Million fo files), and a node can have up to 400000 files (and I don't have
any way to split this ammount in smaller ones). As the number of files grows, my
application gets slower and slower (the app is works something like a cache for
another app and I can't redesign the way it distributes files into disk due
to the other app requirements).

The filesystem I use is ext3 with teh following options enabled:

Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file

Is there any way to improve performance in ext3? Would you suggest another FS
for this situation (this is a prodution server, so I need a stable one) ?

Thanks in advance (and please excuse my bad english).


_________________________________________________________________
Connect to the next generation of MSN Messenger?
http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline

Niki Kovacs

2009-Jul-08 06:41 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

oooooooooooo ooooooooooooo a ?crit :> Hi,
> 
> I have a program that writes lots of files to a directory tree 
Did that program also write your address header ?

:o)

Per Qvindesland

2009-Jul-08 06:43 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

Perhaps think about running tune2fs maybe also consider
adding?noatime 

Regards
Per
E-mail: per at norhex.com [1]
http://www.linkedin.com/in/perqvindesland [2]
--- Original message follows ---
SUBJECT:?Re: [CentOS] Question about optimal filesystem with many
small files.
FROM: ?Niki Kovacs 
TO:?"CentOS mailing list" 
DATE:?08-07-2009 8:41

oooooooooooo ooooooooooooo a ?crit :> Hi,
> 
> I have a program that writes lots of files to a directory tree 
Did that program also write your address header ?

:o)
_______________________________________________
CentOS mailing list
CentOS at centos.org
http://lists.centos.org/mailman/listinfo/centos

Links:
------
[1] http://webmail.norhex.com/#
[2] http://www.linkedin.com/in/perqvindesland
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20090708/67033429/attachment-0003.html>

Les Mikesell

2009-Jul-08 15:56 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

oooooooooooo ooooooooooooo wrote:> Hi,
> 
> I have a program that writes lots of files to a directory tree (around 15
Million fo files), and a node can have up to 400000 files (and I don't have
any way to split this ammount in smaller ones). As the number of files grows, my
application gets slower and slower (the app is works something like a cache for
another app and I can't redesign the way it distributes files into disk due
to the other app requirements).
> 
> The filesystem I use is ext3 with teh following options enabled:
> 
> Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file
> 
> Is there any way to improve performance in ext3? Would you suggest another
FS for this situation (this is a prodution server, so I need a stable one) ?
> 
> Thanks in advance (and please excuse my bad english).
I haven't done, or even seen, any recent benchmarks but I'd expect 
reiserfs to still be the best at that sort of thing.   However even if 
you can improve things slightly, do not let whoever is responsible for 
that application ignore the fact that it is a horrible design that 
ignores a very well known problem that has easy solutions.  And don't 
ever do business with someone who would write a program like that again. 
  Any way you approach it, when you want to write a file the system must 
check to see if the name already exists, and if not, create it in an 
empty space that it must also find - and this must be done atomically so 
the directory must be locked against other concurrent operations until 
the update is complete.  If you don't index the contents the lookup is a 
slow linear scan - if you do, you then have to rewrite the index on 
every change so you can't win.  Sensible programs that expect to access 
a lot of files will build a tree structure to break up the number that 
land in any single directory (see squid for an example).  Even more 
sensible programs would re-use some existing caching mechanism like 
squid or memcached instead of writing a new one badly.

-- 
   Les Mikesell
    lesmikesell at gmail.com

Kwan Lowe

2009-Jul-08 16:23 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

On Wed, Jul 8, 2009 at 2:27 AM, oooooooooooo ooooooooooooo <
hhh735 at hotmail.com> wrote:
>
> Hi,
>
> I have a program that writes lots of files to a directory tree (around 15
> Million fo files), and a node can have up to 400000 files (and I don't
have
> any way to split this ammount in smaller ones). As the number of files
> grows, my application gets slower and slower (the app is works something
> like a cache for another app and I can't redesign the way it
distributes
> files into disk due to the other app requirements).
>
> The filesystem I use is ext3 with teh following options enabled:
>
> Filesystem features:      has_journal resize_inode dir_index filetype
> needs_recovery sparse_super large_file
>
> Is there any way to improve performance in ext3? Would you suggest another
> FS for this situation (this is a prodution server, so I need a stable one)
?
>
I saw this article some time back.

http://www.linux.com/archive/feature/127055

I've not implemented it, but from past experience, you may lose some
performance initially, but the database fs performance might be more
consistent as the number of files grow.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20090708/21d73fc0/attachment-0003.html>

oooooooooooo ooooooooooooo

2009-Jul-08 21:59 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

>Perhaps think about running tune2fs maybe also consider adding noatime 
Yes, I added it and I got a perfomance increase, anyway as the number of fields
grows the speed keeps going below an acceptable level.
>I saw this article some time back.
http://www.linux.com/archive/feature/127055
Good idea, I already use mysql for indexing the files, so everytime I need to
make a lookup I don't need the entire dir and then get the file, anyway my
requirements are keeping the files on disk.
>The only way to deal with it (especially if theapplication adds and removes these files regularly) is to every once in a
while copy the files to another directory, nuke the directory and restore
from the copy.Thanks, but there will not be too many file updates once the cache
is done, so recreating directories can not be very helpful here. The issue is
that as the number of files grows, bot reads from existing files and new
insertion gets slower and slower.
>I haven't done, or even seen, any recent benchmarks but I'd expect reiserfs to still be the best at that sort of thing. I've looking at some
benchmarks and reiser seems a bit faster in my scenario, however my problem
happens when I have a arge number of files, for what I have seen, I'm not
sure if reiser would be a fix....>However even if you can improve things slightly, do not let whoever is responsible for 
that application ignore the fact that it is a horrible design that 
ignores a very well known problem that has easy solutions.My original idea was
storing the file with a hash of it name, and then store a  hash->real
filename in mysql. By this way I have direct access to the file and I can make a
directory hierachy with the first characters of teh hash /c/0/2/a, so i would
have 16*4 =65536 leaves in the directoy tree, and the files would be identically
distributed, with around 200 files per dir (waht should not give any perfomance
issues). But the requiremenst are to use the real file name for the directory
tree, what gives the issue.

>Did that program also write your address header ?:)

Thanks for the help.


----------------------------------------> From: hhh735 at hotmail.com
> To: centos at centos.org
> Date: Wed, 8 Jul 2009 06:27:40 +0000
> Subject: [CentOS] Question about optimal filesystem with many small files.
>
>
> Hi,
>
> I have a program that writes lots of files to a directory tree (around 15
Million fo files), and a node can have up to 400000 files (and I don't have
any way to split this ammount in smaller ones). As the number of files grows, my
application gets slower and slower (the app is works something like a cache for
another app and I can't redesign the way it distributes files into disk due
to the other app requirements).
>
> The filesystem I use is ext3 with teh following options enabled:
>
> Filesystem features: has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file
>
> Is there any way to improve performance in ext3? Would you suggest another
FS for this situation (this is a prodution server, so I need a stable one) ?
>
> Thanks in advance (and please excuse my bad english).
>
>
> _________________________________________________________________
> Connect to the next generation of MSN Messenger
>
http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
_________________________________________________________________
News, entertainment and everything you care about at Live.com. Get it now!
http://www.live.com/getstarted.aspx

oooooooooooo ooooooooooooo

2009-Jul-08 22:03 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

(i resent thsi message as previous one seems bad formatted, sorry for the mess).

>Perhaps think about running tune2fs maybe also consider adding noatime  
Yes, I added it and I got a perfomance increase, anyway as the number of fields
grows the speed keeps going below an acceptable level.
 

>I saw this article some time back. 
http://www.linux.com/archive/feature/127055


Good idea, I already use mysql for indexing the files, so everytime I need to
make a lookup I don't need the entire dir and then get the file, anyway my
requirements are keeping the files on disk.


 >The only way to deal with it (especially if theapplication adds and removes these files regularly) is to every once in a
while copy the files to another directory, nuke the directory and restore
from the copy.


Thanks, but there will not be too many file updates once the cache is done, so
recreating directories can not be very helpful here. The issue is that as the
number of files grows, bot reads from existing files and new insertion gets
slower and slower.


 >I haven't done, or even seen, any recent benchmarks but I'd expect reiserfs to still be the best at that sort of thing. I've looking at some
benchmarks and reiser seems a bit faster in my scenario, however my problem
happens when I have a arge number of files, for what I have seen, I'm not
sure if reiser would be a fix....>However even if you can improve things slightly, do not let whoever is responsible for 
that application ignore the fact that it is a horrible design that 
ignores a very well known problem that has easy solutions.

My original idea was storing the file with a hash of it name, and then store a 
hash->real filename in mysql. By this way I have direct access to the file
and I can make a directory hierachy with the first characters of teh hash
/c/0/2/a, so i would have 16*4 =65536 leaves in the directoy tree, and the files
would be identically distributed, with around 200 files per dir (waht should not
give any perfomance issues). But the requiremenst are to use the real file name
for the directory tree, what gives the issue.

 
 >Did that program also write your address header ?:)


 
Thanks for the help.
 
 
----------------------------------------> From: hhh735 at hotmail.com
> To: centos at centos.org
> Date: Wed, 8 Jul 2009 06:27:40 +0000
> Subject: [CentOS] Question about optimal filesystem with many small files.
>
>
> Hi,
>
> I have a program that writes lots of files to a directory tree (around 15
Million fo files), and a node can have up to 400000 files (and I don't have
any way to split this ammount in smaller ones). As the number of files grows, my
application gets slower and slower (the app is works something like a cache for
another app and I can't redesign the way it distributes files into disk due
to the other app requirements).
>
> The filesystem I use is ext3 with teh following options enabled:
>
> Filesystem features: has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file
>
> Is there any way to improve performance in ext3? Would you suggest another
FS for this situation (this is a prodution server, so I need a stable one) ?
>
> Thanks in advance (and please excuse my bad english).
>
>
> _________________________________________________________________
> Connect to the next generation of MSN Messenger
>
http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos 
_________________________________________________________________
News, entertainment and everything you care about at Live.com. Get it now!
http://www.live.com/getstarted.aspx
_________________________________________________________________
Connect to the next generation of MSN Messenger?
http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline

James A. Peltier

2009-Jul-09 03:13 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

On Wed, 8 Jul 2009, oooooooooooo ooooooooooooo wrote:
>
> Hi,
>
> I have a program that writes lots of files to a directory tree (around 15
Million fo files), and a node can have up to 400000 files (and I don't have
any way to split this ammount in smaller ones). As the number of files grows, my
application gets slower and slower (the app is works something like a cache for
another app and I can't redesign the way it distributes files into disk due
to the other app requirements).
>
> The filesystem I use is ext3 with teh following options enabled:
>
> Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file
>
> Is there any way to improve performance in ext3? Would you suggest another
FS for this situation (this is a prodution server, so I need a stable one) ?
>
> Thanks in advance (and please excuse my bad english).
There isn't a good file system for this type of thing.  filesystems with 
many very small files are always slow.  Ext3, XFS, JFS are all terrible 
for this type of thing.

Rethink how you're writing files or you'll be in a world of hurt.

-- 
James A. Peltier
Systems Analyst (FASNet), VIVARIUM Technical Director
HPC Coordinator
Simon Fraser University - Burnaby Campus
Phone   : 778-782-6573
Fax     : 778-782-3045
E-Mail  : jpeltier at sfu.ca
Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca
           http://blogs.sfu.ca/people/jpeltier
MSN     : subatomic_spam at hotmail.com

The point of the HPC scheduler is to
keep everyone equally unhappy.

James A. Peltier

2009-Jul-09 03:14 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

On Wed, 8 Jul 2009, oooooooooooo ooooooooooooo wrote:
>
> Hi,
>
> I have a program that writes lots of files to a directory tree (around 15
Million fo files), and a node can have up to 400000 files (and I don't have
any way to split this ammount in smaller ones). As the number of files grows, my
application gets slower and slower (the app is works something like a cache for
another app and I can't redesign the way it distributes files into disk due
to the other app requirements).
>
> The filesystem I use is ext3 with teh following options enabled:
>
> Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file
>
> Is there any way to improve performance in ext3? Would you suggest another
FS for this situation (this is a prodution server, so I need a stable one) ?
>
> Thanks in advance (and please excuse my bad english).

BTW, you can pretty much say goodbye to any backup solution for this type 
of project as well.  They'll all die dealing with a file system structure 
like this

  -- 
James A. Peltier
Systems Analyst (FASNet), VIVARIUM Technical Director
HPC Coordinator
Simon Fraser University - Burnaby Campus
Phone   : 778-782-6573
Fax     : 778-782-3045
E-Mail  : jpeltier at sfu.ca
Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca
           http://blogs.sfu.ca/people/jpeltier
MSN     : subatomic_spam at hotmail.com

The point of the HPC scheduler is to
keep everyone equally unhappy.

oooooooooooo ooooooooooooo

2009-Jul-13 05:49 UTC

head link

[CentOS] Question about optimal filesystem with many small files.

>How many files per directory do you have?
I have 4 directory levels, 65536 leaves directories and around 200 files per dir
(15M in total)-
 >Something is wrong. Got to figure this out.  Where did this RAM go?
Thanks I reduced the memory usage of mysql and my app it and I got around a 15%
performance increase. Now my atop looks like this (currently reading only cached
files from disk).

PRC | sys   0.51s | user   9.29s | #proc    114 | #zombie    0 | #exit      0 |
CPU | sys      4% | user     93% | irq       1% | idle    208% | wait     94% |
cpu | sys      2% | user     48% | irq       1% | idle     21% | cpu001 w 28% |
cpu | sys      1% | user     17% | irq       0% | idle     41% | cpu000 w 40% |
cpu | sys      1% | user     14% | irq       0% | idle     74% | cpu003 w 12% |
cpu | sys      1% | user     13% | irq       0% | idle     72% | cpu002 w 14% |
CPL | avg1   3.45 | avg5    7.42 | avg15  10.76 | csw    15891 | intr   11695 |
MEM | tot    2.0G | free   51.2M | cache 587.8M | buff    1.0M | slab  281.2M |
SWP | tot    1.9G | free    1.9G |              | vmcom   1.6G | vmlim   2.9G |
PAG | scan   3072 | stall      0 |              | swin       0 | swout      0 |
DSK |         sdb | busy     89% | read    1451 | write      0 | avio    6 ms |
DSK |         sda | busy      6% | read     178 | write     54 | avio    2 ms |
NET | transport   | tcpi    3631 | tcpo    3629 | udpi       0 | udpo       0 |
NET | network     | ipi     3632 | ipo     3630 | ipfrw      0 | deliv   3632 |
NET | eth0     0% | pcki       5 | pcko       3 | si    0 Kbps | so    1 Kbps |
NET | lo     ---- | pcki    3627 | pcko    3627 | si  775 Kbps | so  775 Kbps |
>It is 1024 chars long. Witch want still help.I'm usng mysam and according to:
http://dev.mysql.com/doc/refman/5.1/en/myisam-storage-engine.html
"The maximum key length is 1000 bytes. This can also be changed by changing
the source and recompiling. For the case of a key longer than 250 bytes, a
larger key block size than the default of 1024 bytes is used. "
>I would not store images in either oneas your SELECT LIKE and Random will kill it. 

Well, I think that this can be avoided, using just searches in teh key fields
should not give these issues. Does somebody have experience storing a large
amount of medium (1KB-150KB) blob objects in mysql?
>However I have not a clue that this is even doable in MySQL.
In mysql there is already a MD5 funtion:
http://dev.mysql.com/doc/refman/5.1/en/encryption-functions.html#function_md5

Thanks for the help.

_________________________________________________________________
Connect to the next generation of MSN Messenger?
http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline

Apparently Analagous Threads

Search for more seemingly similar threads

CentOS - Jul 2009 - Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

[CentOS] Question about optimal filesystem with many small files.

Apparently Analagous Threads