thr3ads.net - Lustre discuss - [Lustre-discuss] Large directory performance [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Michael Robbert

2010-Sep-10 18:11 UTC

[Lustre-discuss] Large directory performance

We have been struggling with our Lustre performance for some time now especially
with large directories. I recently did some informal benchmarking (on a live
system so I know results are not scientifically valid) and noticed a huge drop
in performance of reads(stat operations) past 20k files in a single directory.
I''m using bonnie++, disabling IO testing (-s 0) and just creating,
reading, and deleting 40kb files in a single directory. I''ve done this
on for directory sizes of 2,000 to 40,000 files. Create performance is a flat
line of ~150 files/sec across the board. Delete performance is all over the
place, but no higher than 3,000 files/sec. The really interesting data point is
read performance, which for these tests is just a stat of the file not reading
data. Starting with the smaller directories it is relatively consistent at just
below 2,500 files/sec, but when I jump from 20,000 files to 30,000 files the
performance drops to around 100 files/sec. We were assuming this was somewhat
expected behavior and are in the process of trying to get our users to change
their code. Then yesterday I was browsing the Lustre Operations Manual and found
section 33.8 that says Lustre is tested with directories as large as 10 million
files in a single directory and still get lookups at a rate of 5,000 files/sec.
That leaves me wondering 2 things. How can we get 5,000 files/sec for anything
and why is our performance dropping off so suddenly at after 20k files?

Here is our setup:
All IO servers are Dell PowerEdge 2950s. 2 8-core sockets with X5355  @ 2.66GHz
and 16Gb of RAM.
The data is on DDN S2A 9550s with 8+2 RAID configuration connected directly with
4Gb Fibre channel.
They are running RHEL 4.5, Lustre 6.7.2-ddn3, kernel
2.6.18-128.7.1.el5.ddn1.l1.6.7.2.ddn3smp

As a side note the users code is Parflow, developed at LLNL. The files are SILO
files. We have as many as 1.4 million files in a single directory and we now
have half a billion files that we need to deal with in one way or another. The
code has already been modified to split the files on newer runs until multiple
subdirectories, but we''re still dealing with 10s of thousands of files
in a single directory. The users have been able to run these data sets on Lustre
systems at LLNL 3 orders of magnitude faster.

Thanks,
Mike Robbert
HPC & Networking Engineer
Colorado School of Mines

Hedges, Richard M.

2010-Sep-10 18:16 UTC

head link

[Lustre-discuss] Large directory performance

It will continue downward as the number of files in the directory increase.
Interestingly, GPFS stat performance increased as the number of files
increased.  My tests were on 128 nodes * 8 processes/node * 10 - 500 files
per process.

- Richard


On 9/10/10 11:11 AM, "Michael Robbert" <mrobbert at mines.edu>
wrote:
> We have been struggling with our Lustre performance for some time now
> especially with large directories. I recently did some informal
benchmarking
> (on a live system so I know results are not scientifically valid) and
noticed
> a huge drop in performance of reads(stat operations) past 20k files in a
> single directory. I''m using bonnie++, disabling IO testing (-s 0)
and just
> creating, reading, and deleting 40kb files in a single directory.
I''ve done
> this on for directory sizes of 2,000 to 40,000 files. Create performance is
a
> flat line of ~150 files/sec across the board. Delete performance is all
over
> the place, but no higher than 3,000 files/sec. The really interesting data
> point is read performance, which for these tests is just a stat of the file
> not reading data. Starting with the smaller directories it is relatively
> consistent at just below 2,500 files/sec, but when I jump from 20,000 files
to
> 30,000 files the performance drops to around 100 files/sec. We were
assuming
> this w
>  as somewhat expected behavior and are in the process of trying to get our
> users to change their code. Then yesterday I was browsing the Lustre
> Operations Manual and found section 33.8 that says Lustre is tested with
> directories as large as 10 million files in a single directory and still
get
> lookups at a rate of 5,000 files/sec. That leaves me wondering 2 things.
How
> can we get 5,000 files/sec for anything and why is our performance dropping
> off so suddenly at after 20k files?
> 
> Here is our setup:
> All IO servers are Dell PowerEdge 2950s. 2 8-core sockets with X5355  @
> 2.66GHz and 16Gb of RAM.
> The data is on DDN S2A 9550s with 8+2 RAID configuration connected directly
> with 4Gb Fibre channel.
> They are running RHEL 4.5, Lustre 6.7.2-ddn3, kernel
> 2.6.18-128.7.1.el5.ddn1.l1.6.7.2.ddn3smp
> 
> As a side note the users code is Parflow, developed at LLNL. The files are
> SILO files. We have as many as 1.4 million files in a single directory and
we
> now have half a billion files that we need to deal with in one way or
another.
> The code has already been modified to split the files on newer runs until
> multiple subdirectories, but we''re still dealing with 10s of
thousands of
> files in a single directory. The users have been able to run these data
sets
> on Lustre systems at LLNL 3 orders of magnitude faster.
> 
> Thanks,
> Mike Robbert
> HPC & Networking Engineer
> Colorado School of Mines
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://*lists.lustre.org/mailman/listinfo/lustre-discuss
> 

===================================================
Richard Hedges
Customer Support and Test - File Systems Project
Development Environment Group - Livermore Computing
Lawrence Livermore National Laboratory
7000 East Avenue, MS L-557
Livermore, CA    94551

v:    (925) 423-2699
f:    (925) 423-6961
E:    richard-hedges at llnl.gov

Michael Robbert

2010-Sep-10 18:49 UTC

head link

[Lustre-discuss] Large directory performance

Richard,
Are you talking about bonnie++ performance or Parflow performance? And
doesn''t this fly in the face of the Lustre Operation Manual that seems
to indicate that performance should be fun up to at least 10 million files in a
single directory? How do you reconcile your results with the fact that there are
users running at LLNL with up to 1.4 million files in a single directory?

Thanks,
Mike

On Sep 10, 2010, at 12:16 PM, Hedges, Richard M. wrote:
> It will continue downward as the number of files in the directory increase.
> Interestingly, GPFS stat performance increased as the number of files
> increased.  My tests were on 128 nodes * 8 processes/node * 10 - 500 files
> per process.
> 
> - Richard
> 
> 
> On 9/10/10 11:11 AM, "Michael Robbert" <mrobbert at
mines.edu> wrote:
> 
>> We have been struggling with our Lustre performance for some time now
>> especially with large directories. I recently did some informal
benchmarking
>> (on a live system so I know results are not scientifically valid) and
noticed
>> a huge drop in performance of reads(stat operations) past 20k files in
a
>> single directory. I''m using bonnie++, disabling IO testing (-s
0) and just
>> creating, reading, and deleting 40kb files in a single directory.
I''ve done
>> this on for directory sizes of 2,000 to 40,000 files. Create
performance is a
>> flat line of ~150 files/sec across the board. Delete performance is all
over
>> the place, but no higher than 3,000 files/sec. The really interesting
data
>> point is read performance, which for these tests is just a stat of the
file
>> not reading data. Starting with the smaller directories it is
relatively
>> consistent at just below 2,500 files/sec, but when I jump from 20,000
files to
>> 30,000 files the performance drops to around 100 files/sec. We were
assuming
>> this w
>> as somewhat expected behavior and are in the process of trying to get
our
>> users to change their code. Then yesterday I was browsing the Lustre
>> Operations Manual and found section 33.8 that says Lustre is tested
with
>> directories as large as 10 million files in a single directory and
still get
>> lookups at a rate of 5,000 files/sec. That leaves me wondering 2
things. How
>> can we get 5,000 files/sec for anything and why is our performance
dropping
>> off so suddenly at after 20k files?
>> 
>> Here is our setup:
>> All IO servers are Dell PowerEdge 2950s. 2 8-core sockets with X5355  @
>> 2.66GHz and 16Gb of RAM.
>> The data is on DDN S2A 9550s with 8+2 RAID configuration connected
directly
>> with 4Gb Fibre channel.
>> They are running RHEL 4.5, Lustre 6.7.2-ddn3, kernel
>> 2.6.18-128.7.1.el5.ddn1.l1.6.7.2.ddn3smp
>> 
>> As a side note the users code is Parflow, developed at LLNL. The files
are
>> SILO files. We have as many as 1.4 million files in a single directory
and we
>> now have half a billion files that we need to deal with in one way or
another.
>> The code has already been modified to split the files on newer runs
until
>> multiple subdirectories, but we''re still dealing with 10s of
thousands of
>> files in a single directory. The users have been able to run these data
sets
>> on Lustre systems at LLNL 3 orders of magnitude faster.
>> 
>> Thanks,
>> Mike Robbert
>> HPC & Networking Engineer
>> Colorado School of Mines
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
> 
> 
> ===================================================> 
> Richard Hedges
> Customer Support and Test - File Systems Project
> Development Environment Group - Livermore Computing
> Lawrence Livermore National Laboratory
> 7000 East Avenue, MS L-557
> Livermore, CA    94551
> 
> v:    (925) 423-2699
> f:    (925) 423-6961
> E:    richard-hedges at llnl.gov
>

Andreas Dilger

2010-Sep-10 22:26 UTC

head link

[Lustre-discuss] Large directory performance

On 2010-09-10, at 12:11, Michael Robbert wrote:> Create performance is a flat line of ~150 files/sec across the board.
Delete performance is all over the place, but no higher than 3,000 files/sec...
Then yesterday I was browsing the Lustre Operations Manual and found section
33.8 that says Lustre is tested with directories as large as 10 million files in
a single directory and still get lookups at a rate of 5,000 files/sec. That
leaves me wondering 2 things. How can we get 5,000 files/sec for anything and
why is our performance dropping off so suddenly at after 20k files?
> 
> Here is our setup:
> All IO servers are Dell PowerEdge 2950s. 2 8-core sockets with X5355  @
2.66GHz and 16Gb of RAM.
> The data is on DDN S2A 9550s with 8+2 RAID configuration connected directly
with 4Gb Fibre channel.
Are you using the DDN 9550s for the MDT?  That would be a bad configuration,
because they can only be configured with RAID-6, and would explain why you are
seeing such bad performance.  For the MDT you always want to have RAID-1+0
storage.  Potentially, for every 512-byte inode written to disk you need to
write many times that much data inside the RAID-6 array to keep the parity
correct.

For large filesystems, sites have used 12 or 24 small SAS disks (15k RPM) in
RAID-1+0 to get high IOPS performance for the MDT.
> We have as many as 1.4 million files in a single directory and we now have
half a billion files that we need to deal with in one way or another.
Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Bernd Schubert

2010-Sep-10 23:32 UTC

head link

[Lustre-discuss] Large directory performance

On Saturday, September 11, 2010, Andreas Dilger wrote:> On 2010-09-10, at 12:11, Michael Robbert wrote:
> > Create performance is a flat line of ~150 files/sec across the board.
> > Delete performance is all over the place, but no higher than 3,000
> > files/sec... Then yesterday I was browsing the Lustre Operations
Manual
> > and found section 33.8 that says Lustre is tested with directories as
> > large as 10 million files in a single directory and still get lookups
at
> > a rate of 5,000 files/sec. That leaves me wondering 2 things. How can
we
> > get 5,000 files/sec for anything and why is our performance dropping
off
> > so suddenly at after 20k files?
> > 
> > Here is our setup:
> > All IO servers are Dell PowerEdge 2950s. 2 8-core sockets with X5355 
@
> > 2.66GHz and 16Gb of RAM. The data is on DDN S2A 9550s with 8+2 RAID
> > configuration connected directly with 4Gb Fibre channel.
> 
> Are you using the DDN 9550s for the MDT?  That would be a bad
> configuration, because they can only be configured with RAID-6, and would
> explain why you are seeing such bad performance.  For the MDT you always
Unfortunately, we failed to copy the scratch MDT in a reasonable time so far. 
Copying several hundreds of million files turned out to take ages ;) But I 
guess Mike did the benchmarks for the other filesystem with an EF3010.
> > We have as many as 1.4 million files in a single directory and we now
> > have half a billion files that we need to deal with in one way or
> > another.
Mike, is there a chance you can try which rate acp reports?

http://oss.oracle.com/~mason/acp/

Also could you please send me your exact bonnie line or script? We could try 
to reproduce it on and idle test 9550 with a 6620 for metada (the 6620 is 
slower for that than the ef3010).


Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks

Andreas Dilger

2010-Sep-10 23:38 UTC

head link

[Lustre-discuss] Large directory performance

On 2010-09-10, at 17:32, Bernd Schubert wrote:> On Saturday, September 11, 2010, Andreas Dilger wrote:
>> 
>> Are you using the DDN 9550s for the MDT?  That would be a bad
>> configuration, because they can only be configured with RAID-6, and
would
>> explain why you are seeing such bad performance.  For the MDT you
always
> 
> Unfortunately, we failed to copy the scratch MDT in a reasonable time so
far.
> Copying several hundreds of million files turned out to take ages ;) But I 
> guess Mike did the benchmarks for the other filesystem with an EF3010.
Probably for a straight transfer of the MDT, it would be MUCH faster to do a
straight "dd" of the filesystem over to the new LUN (assuming it is at
least as large as the original LUN).

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Michael Robbert

2010-Sep-11 20:41 UTC

head link

[Lustre-discuss] Large directory performance

On Sep 10, 2010, at 5:32 PM, Bernd Schubert wrote:
> On Saturday, September 11, 2010, Andreas Dilger wrote:
>> On 2010-09-10, at 12:11, Michael Robbert wrote:
>>> Create performance is a flat line of ~150 files/sec across the
board.
>>> Delete performance is all over the place, but no higher than 3,000
>>> files/sec... Then yesterday I was browsing the Lustre Operations
Manual
>>> and found section 33.8 that says Lustre is tested with directories
as
>>> large as 10 million files in a single directory and still get
lookups at
>>> a rate of 5,000 files/sec. That leaves me wondering 2 things. How
can we
>>> get 5,000 files/sec for anything and why is our performance
dropping off
>>> so suddenly at after 20k files?
>>> 
>>> Here is our setup:
>>> All IO servers are Dell PowerEdge 2950s. 2 8-core sockets with
X5355  @
>>> 2.66GHz and 16Gb of RAM. The data is on DDN S2A 9550s with 8+2 RAID
>>> configuration connected directly with 4Gb Fibre channel.
>> 
>> Are you using the DDN 9550s for the MDT?  That would be a bad
>> configuration, because they can only be configured with RAID-6, and
would
>> explain why you are seeing such bad performance.  For the MDT you
always
> 
> Unfortunately, we failed to copy the scratch MDT in a reasonable time so
far.
> Copying several hundreds of million files turned out to take ages ;) But I 
> guess Mike did the benchmarks for the other filesystem with an EF3010.
The benchmarks listed above are for our scratch filesystem, whose MDT is on the
9550. I don''t know why I didn''t mention the benchmarks that I
also ran on our home filesystem whose MDT was recently moved to the EF3010 with
RAID 1+0 on 6 SAS disks. The other 6 disks in the EF3010 are waiting for when we
can move the scratch MDT there. Anyways, the benchmarks on home were actually
worse. Create performance was about the same, but read performance was in the
low hundreds. The command line was:
./bonnie++ -d $dir -s 0 -n $size:40000:40000:1
Where $dir was a directory on the filesystem being tested and $size was the
number of files in thousands (5, 10, 20, 30)

A dd of the MDT wasn''t possible because the original LUN was nearly 5Tb
(only 35Gb used), but the new LUN is just over 1Tb.
> 
>>> We have as many as 1.4 million files in a single directory and we
now
>>> have half a billion files that we need to deal with in one way or
>>> another.
> 
> Mike, is there a chance you can try which rate acp reports?
> 
> http://oss.oracle.com/~mason/acp/
> 
> Also could you please send me your exact bonnie line or script? We could
try
> to reproduce it on and idle test 9550 with a 6620 for metada (the 6620 is 
> slower for that than the ef3010).
I have downloaded and compiled acp. I have started a copy of one of 1.6 million
file directories. After 1 hour it is still reading files from a top level
directory with only 122k files and hasn''t written anything. The only
option used on the command line was -v so I could see what it was doing.


Thanks,
Mike

Peter Grandi

2010-Sep-13 14:22 UTC

head link

[Lustre-discuss] Large directory performance

> We have been struggling with our Lustre performance for some
> time now especially with large directories.
Are you assuming that Lustre has been designed for good
performance with lots of (probably tiny) files in large
directories?
> I recently did some informal benchmarking (on a live system so
> I know results are not scientifically valid) and noticed a huge
> drop in performance of reads(stat operations) past 20k files in
> a single directory.
Is a a benchmark really needed to figure that out?
> I''m using bonnie++, disabling IO testing (-s 0) and just
> creating, reading, and deleting 40kb files in a single
> directory.
What do you think Bonnie++ is a benchmark of?
> [ ... ] The really interesting data point is read performance,
> which for these tests is just a stat of the file not reading
> data. Starting with the smaller directories it is relatively
> consistent at just below 2,500 files/sec, but when I jump from
> 20,000 files to 30,000 files the performance drops to around
> 100 files/sec.
Why is that surprising?
> [ ... ] are in the process of trying to get our users to
> change their code. [ ... ]
But as mentioned below it is being changed in a way that will
help but not a lot.
> Then yesterday I was browsing the Lustre Operations Manual
Did you read it before designing and setting up your system?

There are relevant bits of advice in 1.4.2.2 and 10.1.1-4 for
example (some of them objectionable, such as recommending RAID6
for data storage, without the necessary qualifications at the
very least).
> and found section 33.8 that says Lustre is tested with
> directories as large as 10 million files in a single directory
Why would "tested" imply "works real fast in every possible,
including really stupid, setup"?
> and still get lookups at a rate of 5,000 files/sec.
What sort of "lookups" do you think they were talking about?

On what sort of storage systems do you think you get 5,000 random
metadata operations/s?

Can you explain how to get 5,000 *random* metadata lookup/s from
disks that can do 50-100 random IOP/s each?
> That leaves me wondering 2 things. How can we get 5,000
> files/sec for anything and why is our performance dropping off
> so suddenly at after 20k files?
Why do you need to wonder?

Have you read about new amazing techniques like caching in
RAM/flash and scaling via RAID?

Have your read the extensive discussions of metadata and data
performance in the Lustre docs?
> Here is our setup: All IO servers are Dell PowerEdge 2950s. 2
> 8-core sockets with X5355 @ 2.66GHz and 16Gb of RAM. The data
> is on DDN S2A 9550s with 8+2 RAID configuration connected
> directly with 4Gb Fibre channel.
Why do you describe where the data is when you have so far talked
only about the netadata?

Do you have a good idea of the differences (and the different
workloads a described in the Lustre manual) between MDS/MDTs and
OSSes/OSTs?

ALso, if you have a highly parallel program that deals with what
look like millions of tiny files (which looks like an appalling
misdesign to me), why do you run it on a RAID3 (of all things)
storage system?

If you storing the metadata for Lustre on the same storage system
as the data *and* it is a RAID3 setup, WHY WHY WHY?

Why haven''t you hired Sun/Oracle consultants to design and
configure your metadata and data storage systems?
> They are running RHEL 4.5, Lustre 6.7.2-ddn3, kernel
> 2.6.18-128.7.1.el5.ddn1.l1.6.7.2.ddn3smp
Why are you running a very old version of Lustre (and on RHEL45
of all things, but that is less relevant)?

Are your running the servers in 32b or 64b mode?
> As a side note the users code is Parflow, developed at LLNL.
> The files are SILO files. We have as many as 1.4 million files
> in a single directory
Why hasn''t LLNL hired consultants who understand the differences
between file systems and DBMSes to help design ParFlow?
> and we now have half a billion files that we need to deal with
> in one way or another.
To me that means that the application is appallingly written
(there are a lot of those about).

Then perhaps your setup is entirely inappropriate for most types
of workload and even more so for metadata intensive ones, and
maybe Lustre was designed for optimal performnance on large
streaming workloads, so what looks to me an appallingly
misdesigned application works particularly badly in your case.
> The code has already been modified to split the files on newer
> runs until multiple subdirectories, but we''re still dealing
> with 10s of thousands of files in a single directory.
To me that''s still appalling. There are very good reasons why
file systems and DBMSes both exist, and they are not the same.
> The users have been able to run these data sets on Lustre
> systems at LLNL 3 orders of magnitude faster.
Do you think that LLNL have metadata storage and caches as weak
as yours?

Given how the application is "designed", would it suffer a
colossal performance drop at LLNL too on a suitably larger data
set?

Have you realized by now that Lustre performance is very, very
anisotropic in the space of possible setups and applications?

Joe Landman

2010-Sep-13 16:45 UTC

head link

[Lustre-discuss] Large directory performance

Michael Robbert wrote:> We have been struggling with our Lustre performance for some time now
> especially with large directories. I recently did some informal
> benchmarking (on a live system so I know results are not
> scientifically valid) and noticed a huge drop in performance of
> reads(stat operations) past 20k files in a single directory. I''m
> using bonnie++, disabling IO testing (-s 0) and just creating,
> reading, and deleting 40kb files in a single directory. I''ve done
> this on for directory sizes of 2,000 to 40,000 files. Create
> performance is a flat line of ~150 files/sec across the board. Delete
> performance is all over the place, but no higher than 3,000
> files/sec. The really interesting data point is read performance,
> which for these tests is just a stat of the file not reading data.
> Starting with the smaller directories it is relatively consistent at
> just below 2,500 files/sec, but when I jump from 20,000 files to
> 30,000 files the performance drops to around 100 files/sec. We were
Think small random RAID6 reads.  Performance craters when you do this.
> assuming this w as somewhat expected behavior and are in the process
> of trying to get our users to change their code. Then yesterday I was
> browsing the Lustre Operations Manual and found section 33.8 that
> says Lustre is tested with directories as large as 10 million files
> in a single directory and still get lookups at a rate of 5,000
> files/sec. That leaves me wondering 2 things. How can we get 5,000
> files/sec for anything and why is our performance dropping off so
> suddenly at after 20k files?
Change your MDT to be on a different machine.  A very fast RAID10.

I''ve seen fast SAS 15k recommended, but they aren''t the only
options.

What you want are very high random read IOPs.

> Here is our setup: All IO servers are Dell PowerEdge 2950s. 2 8-core
> sockets with X5355  @ 2.66GHz and 16Gb of RAM. The data is on DDN S2A
> 9550s with 8+2 RAID configuration connected directly with 4Gb Fibre
> channel. They are running RHEL 4.5, Lustre 6.7.2-ddn3, kernel
> 2.6.18-128.7.1.el5.ddn1.l1.6.7.2.ddn3smp
Hmmm... thats a RHEL5 kernel, not a RHEL4 kernel.  Are you sure you have 
4.5?
> 
> As a side note the users code is Parflow, developed at LLNL. The
> files are SILO files. We have as many as 1.4 million files in a
> single directory and we now have half a billion files that we need to
> deal with in one way or another. The code has already been modified
> to split the files on newer runs until multiple subdirectories, but
> we''re still dealing with 10s of thousands of files in a single
> directory. The users have been able to run these data sets on Lustre
> systems at LLNL 3 orders of magnitude faster.
This shouldn''t be a problem for a well designed system.

Regards,

Joe


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

Bob Ball

2010-Sep-13 17:31 UTC

head link

[Lustre-discuss] Large directory performance

Peter, can you comment on what you said here about RAID6?  Are there 
Twiki or other entries somewhere about this?

There are relevant bits of advice in 1.4.2.2 and 10.1.1-4 for
example (some of them objectionable, such as recommending RAID6
for data storage, without the necessary qualifications at the
very least).


Thanks,
bob

Bernd Schubert

2010-Sep-13 17:46 UTC

head link

[Lustre-discuss] Large directory performance

> > [ ... ] The really interesting data point is read performance,
> > which for these tests is just a stat of the file not reading
> > data. Starting with the smaller directories it is relatively
> > consistent at just below 2,500 files/sec, but when I jump from
> > 20,000 files to 30,000 files the performance drops to around
> > 100 files/sec.
> 
> Why is that surprising?
No, with dirindex 30000 files are not that much. In fact I could reproduce 
Mikes numbers also with smaller directory sizes. But I could bump it for a 
single node to consistently 30000 after increasing the LRU_SIZE. Now people 
might wonder why this matters if there is lru-auto-resize. Simple answer, 
several DDN customers including CSM run into serious issues with lru-auto-
resize enabled. Not all of those issues are resolved even in latest Lustre 
releases. However, I definitely need to work on patch to be able to 
disable/enable it on demand (so far each and every network reconnection resets 
it to default, so something like a cron script it required on clients to set 
the value one wants to have).
> 
> What sort of "lookups" do you think they were talking about?
> 
> On what sort of storage systems do you think you get 5,000 random
> metadata operations/s?
Really large directories suffer from the htree dirindex implementation 
returning random inode numbers instead of sequential inode numbers for 
readdir(). And that is rather sub-optimal for cp/tar/''ls
-l''/etc
> 
> Can you explain how to get 5,000 *random* metadata lookup/s from
> disks that can do 50-100 random IOP/s each?
> 
> > That leaves me wondering 2 things. How can we get 5,000
> > files/sec for anything and why is our performance dropping off
> > so suddenly at after 20k files?
> 
> Why do you need to wonder?
I would expect that performance drops off after in between 100K and 1 million 
files per directory, but not 20000 yet. 
> > They are running RHEL 4.5, Lustre 6.7.2-ddn3, kernel
> > 2.6.18-128.7.1.el5.ddn1.l1.6.7.2.ddn3smp
> 
> Why are you running a very old version of Lustre (and on RHEL45
> of all things, but that is less relevant)?
1.6.7.2-ddnX  is still maintained and 1.8 also does not provide better 
metadata performance. Tests and new systems show that 1.8.3-ddn3.2 runs rather 
stable and vanilla 1.8.4 also so far seems to be mostly fine. So we start to 
encourage people to update. However, from my personal point of view, 1.8.2 was 
a draw-back for stability compared to 1.8.1.1 and it took some time to find 
out all issues. Some bugs CSM is running sometimes into, are also not yet 
fixed in 1.8. Introducing possible and unknown new issues is mostly not an 
option for production systems.
> 
> Are your running the servers in 32b or 64b mode?
> 
> > As a side note the users code is Parflow, developed at LLNL.
> > The files are SILO files. We have as many as 1.4 million files
> > in a single directory
> 
> Why hasn''t LLNL hired consultants who understand the differences
> between file systems and DBMSes to help design ParFlow?
With all the knowledgeable people at LLNL, I have no idea how such an 
application ever could be written.
> > The code has already been modified to split the files on newer
> > runs until multiple subdirectories, but we''re still dealing
> > with 10s of thousands of files in a single directory.
> 
> To me that''s still appalling. There are very good reasons why
> file systems and DBMSes both exist, and they are not the same.
> 
> > The users have been able to run these data sets on Lustre
> > systems at LLNL 3 orders of magnitude faster.
> 
> Do you think that LLNL have metadata storage and caches as weak
> as yours?
I definitely know that LLNL was working and pushing lru resize into 1.6.5. 
That might explain why. Unfortunately, as I said before, that brought up some 
serious new issues not solved yet until now.

I also entirely agree, that the application is not suitable for a Lustre 
filesystem, even if LLNL should have found some workarounds.

Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks

Michael Robbert

2010-Sep-13 22:48 UTC

head link

[Lustre-discuss] Large directory performance

On Sep 11, 2010, at 2:41 PM, Michael Robbert wrote:> 
>> Mike, is there a chance you can try which rate acp reports?
>> 
>> http://oss.oracle.com/~mason/acp/
>> 
>> Also could you please send me your exact bonnie line or script? We
could try
>> to reproduce it on and idle test 9550 with a 6620 for metada (the 6620
is
>> slower for that than the ef3010).
> 
> I have downloaded and compiled acp. I have started a copy of one of 1.6
million file directories. After 1 hour it is still reading files from a top
level directory with only 122k files and hasn''t written anything. The
only option used on the command line was -v so I could see what it was doing.
> 
> What exactly is it that we''re trying to get out of acp? Yesterday one
of my "tar pipe" copies finished earlier than expected. It happened
while acp was running on another directory which I know should have nothing to
do with the other, but then I started another copy yesterday and it finished by
this morning (should have taken 2 days). At some point in this process I
realized that the write portion of acp appears to not be implemented so all it
does is read data. I am wondering if it is causing data to be cached, at a
faster rate than tar can read, and therefore helping with the speed of my
copying of data. On the other hand processes that I''ve started today
appear to be going just as slow as before (maybe a little faster 300-500 files
per minute). I''m also beginning to wonder how much of an impact the
work of other users is affecting this. If that is the case I can bring some of
it to a halt since some of it is the users with this large data as they are
attempting to clean up their old data. I would like to know how I can monitor
that. In the past I''ve seen the load average of the MDS to go up to 20
or 30. It is only at about 5 right now. How high does it have to go before
overall performance is affected? or is that even an indicator I should be
looking at?
I''m trying to read as much Lustre documentation as I can, mostly the
Lustre Operations Manual and old mailing list entries, but most of it is about
OSS/OST performance and our problem seems to only be with the MDS/MDT. Any
pointers to where I can learn more about what happens on the MDS. Especially
anything about how it caches data.

Thanks,
Mike

Maybe Matching Threads

Search for more maybe matching threads

Lustre discuss - Sep 2010 - Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

[Lustre-discuss] Large directory performance

Maybe Matching Threads