thr3ads.net - Ext3 users - Poor Performance WhenNumber of Files

If this information is useful, please help other people find it:
Share via:

John Kalucki

2008-Jun-11 05:18 UTC

Poor Performance WhenNumber of Files > 1M

I am seeing similar problems to Sean McCauliff (2007-08-02) using ext3. 
I have a simple test that times file creations in a hashed directory 
structure. File creation time inexorably increases as the number of 
files in the filesystem increases. Altering variables can change the 
absolute performance, but I always see the steady performance degradation.

All of the following have no material effect on the steady drop in 
performance:

File length (1k, 4k, 16k)
Directory depth (5, 10, 15)
Average & Max files per directory (10, 20, 100)
Single or multi-threaded test
Moving test directory to a new name on same filesystem, restarting test.
Directory hash
RAID10 vs. simple disk
Linux version (RHE, Ubuntu)
System memory (32gig, 2gig)
Syncing after each close
Free space
Partition Age (old, perhaps fragmented, a bit dirty, new fs)

Performance seems to always map directly to the number of files in the 
ext3 filesystem.

After some initial run-fast time, perhaps once dirty pages begin to be 
written aggressively, for every 5,000 files added, my files created per 
second tends to drop by about one. So, depending on the variables, say 
with 6 RAID10 spindles, I might start at ~700 files/sec, quickly drop, 
then more slowly drop to ~300 files/sec at perhaps 1 million files, then 
see 299 files/sec for the next 5,000 creations, 298 files/sec, etc. etc.

As you'd expect, there isn't much CPU utilization, other than iowait, 
and some kjournald activity.

Is this a known limitation of ext3? Is expecting to write to 
O(10^6)-O(10^7) files in something approaching constant time expecting 
too much from a filesystem? What, exactly, am I stressing to cause this 
unbounded performance degradation?

Thanks,
-John Kalucki
ext3 at kalucki.com




----

    Hi all,

    I plan on having about 100M files totaling about 8.5TiBytes.   To see
    how ext3 would perform with large numbers of files I've written a test
    program which creates a configurable number of files into a 
configurable
    number of directories, reads from those files, lists them and then
    deletes them.  Even up to 1M files ext3 seems to perform well and scale
    linearly; the time to execute the program on 1M files is about double
    the time it takes it to execute on .5M files.  But past 1M files it
    seems to have n^2 scalability.  Test details appear below.

    Looking at the various options for ext3 nothing jumps out as the 
obvious
    one to use to improve performance.

    Any recommendations?

    Thanks!
    Sean

Eric Sandeen

2008-Jun-11 05:33 UTC

head link

Poor Performance WhenNumber of Files > 1M

John Kalucki wrote:
> Performance seems to always map directly to the number of files in the 
> ext3 filesystem.
> 
> After some initial run-fast time, perhaps once dirty pages begin to be 
> written aggressively, for every 5,000 files added, my files created per 
> second tends to drop by about one. So, depending on the variables, say 
> with 6 RAID10 spindles, I might start at ~700 files/sec, quickly drop, 
> then more slowly drop to ~300 files/sec at perhaps 1 million files, then 
> see 299 files/sec for the next 5,000 creations, 298 files/sec, etc. etc.
> 
> As you'd expect, there isn't much CPU utilization, other than
iowait,
> and some kjournald activity.
> 
> Is this a known limitation of ext3? Is expecting to write to 
> O(10^6)-O(10^7) files in something approaching constant time expecting 
> too much from a filesystem? What, exactly, am I stressing to cause this 
> unbounded performance degradation?
I think this is a linear search through the block groups for the new
inode allocation, which always starts at the parent directory's block
group; and starts over from there each time.  See find_group_other().

So if the parent's group is full and so are the next 1000 block groups,
it will search 1000 groups and find space in the 1001st.  On the next
inode allocation it will re-search(!) those 1000 groups, and again find
space in the 1001st.  And so on.  Until the 1001st is full, and then
it'll search 1001 groups and find space in the 1002nd... etc (If I'm
remembering/reading correctly, but this does jive with what you see.).

I've toyed  with keeping track (in the parent's inode) where the last
successful child allocation happened, and start the search there.  I'm a
bit leery of how this might age, though... plus I'm not sure if it
should be on-disk or just in memory.... But this behavior clearly needs
some help.  I should probably just get it sent out for comment.

-Eric

John Kalucki

2008-Jun-11 22:25 UTC

head link

Poor Performance WhenNumber of Files > 1M

Ric Wheeler wrote:> Eric Sandeen wrote:
>> John Kalucki wrote:
>>
>>  
>>> Performance seems to always map directly to the number of files in 
>>> the ext3 filesystem.
>>>
>>> After some initial run-fast time, perhaps once dirty pages begin to
>>> be written aggressively, for every 5,000 files added, my files 
>>> created per second tends to drop by about one. So, depending on the
>>> variables, say with 6 RAID10 spindles, I might start at ~700 
>>> files/sec, quickly drop, then more slowly drop to ~300 files/sec at
>>> perhaps 1 million files, then see 299 files/sec for the next 5,000 
>>> creations, 298 files/sec, etc. etc.
>>>
>>> As you'd expect, there isn't much CPU utilization, other
than
>>> iowait, and some kjournald activity.
>>>
>>> Is this a known limitation of ext3? Is expecting to write to 
>>> O(10^6)-O(10^7) files in something approaching constant time 
>>> expecting too much from a filesystem? What, exactly, am I stressing
>>> to cause this unbounded performance degradation?
>>>     
>>
>> I think this is a linear search through the block groups for the new
>> inode allocation, which always starts at the parent directory's
block
>> group; and starts over from there each time.  See find_group_other().
>>
>> So if the parent's group is full and so are the next 1000 block
groups,
>> it will search 1000 groups and find space in the 1001st.  On the next
>> inode allocation it will re-search(!) those 1000 groups, and again find
>> space in the 1001st.  And so on.  Until the 1001st is full, and then
>> it'll search 1001 groups and find space in the 1002nd... etc (If
I'm
>> remembering/reading correctly, but this does jive with what you see.).
>>
>> I've toyed  with keeping track (in the parent's inode) where
the last
>> successful child allocation happened, and start the search there. 
I'm a
>> bit leery of how this might age, though... plus I'm not sure if it
>> should be on-disk or just in memory.... But this behavior clearly needs
>> some help.  I should probably just get it sent out for comment.
>>
>> -Eric
>>
>>   
> I run a very similar test, but normally run with a synchronous write 
> work load (i.e., fsync before close). In my testing, you will see a 
> slow but gradual decline in the files/sec. For example, on a 1TB S-ATA 
> drive, the latest test run started off at a rate of 22 files/sec (each 
> file is 40k) and is currently chugging along at a bit over 17 
> files/sec when it has hit 2.8 million files in one directory. I am 
> using the ext3 run to get a baseline for a similar run of xfs and btrfs.
>
> One other random tuning thought - you can help by writing into 
> separate directories, but you will need to make sure that you don't 
> produce a random write pattern when you select your target 
> subdirectory. I think that the use case mentioned using a hashed 
> directory structure which is fine, but you want to hash in a way that 
> writes into a shared subdirectory for some period of time (say get a 
> rotation of every X files or Y seconds).  Easiest way to do this is to 
> use a GUID with a time stamp and hash on the time stamp bits.
>
> Note that there is a multi-threaded performance bug in ext3 (Josef 
> Bacik had looked at fixing this) which throttles writes/sec down to 
> around 230 when you do synchronous transactions so you might be 
> hitting that as well.
>
> ric
Unfortunately, I don't have the opportunity to limit the directories. My 
application is taking random-ish data and organizing it into logical 
groups for subsequent quick reading. But I did take your suggestion into 
account and it contains what seems to be the important nugget -- too 
many active directories makes a bad situation worse.

But still, my test reaches a steady state of active directories pretty 
quickly -- or so I'd like to think. The performance does indeed continue 
to creep downwards.

I'm doing everything single-threaded. Introducing a second thread seems 
to be an immediate disaster, even though I'm stripped across 3 disks. 
Unfortunate. Perhaps moving the journal to another filesystem would 
allow better multi-threaded throughput, but I'm not sure that this is 
important to me.

xfs, zfs, btrfs, and reiser could be attractive for my use-case.

Thanks for your response,
John

Ext3 users - Jun 2008 - Poor Performance WhenNumber of Files > 1M

Poor Performance WhenNumber of Files > 1M

Poor Performance WhenNumber of Files > 1M

Poor Performance WhenNumber of Files > 1M