thr3ads.net - Lustre discuss - [Lustre-discuss] MDT overloaded when writting small files in large number [Dec 2008]

If this information is useful, please help other people find it:
Share via:

siva murugan

2008-Dec-05 05:28 UTC

[Lustre-discuss] MDT overloaded when writting small files in large number

We are trying to uptake Lustre in one of our heavy read/write intensive
infrastructure(daily writes - 8million files, 1TB ). Average size of files
written is 1KB ( I know , Lustre can''t scale well for small size files,
but
just wanted to analyze the possibility of uptaking )

Following are some of the tests conducted  to see the difference in large
and small file size writting,

MDT - 1
OST  - 13 ( also act as nfsserver)
Clients access lustrefs via NFS ( not patchless clients)

Test 1 :

Number of clients  - 10
Dataset size read/written - 971M (per client)
Number of files in the dataset- 14000
Total data written - 10gb
Time taken - 1390s

Test2 :

Number of clients  - 10
Dataset size read/written -1001M (per client)
Number of files in the dataset - 4
Total data written - 10gb

Time taken - 215s


Test3 :

Number of clients  - 10
Dataset size read.written- 53MB (per client)
Number of files in the dataset- 14000
Total data written - 530MB
Time taken - 1027s
MDT was heavily loaded during  Test3 ( Load average > 25 ). Since the file
size in Test 3 is small(1kb) and number of files written is too large(14000
x 10 clients ), obvisouly mdt gets loaded in allocating inodes, total data
written in test3 is only 530MB.

Question  : Is there any parameter that I can tune in MDT to increase the
performance when writting large number of small files ?

Please help
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081205/a06bcaa9/attachment-0001.html

Balagopal Pillai

2008-Dec-05 11:21 UTC

head link

[Lustre-discuss] MDT overloaded when writting small files in large number

"OST  - 13 ( also act as nfsserver)"

              Then I am assuming that your OSS is also a Lustre client. 
It might be useful to search through this
list to find out the potential pitfalls of mounting Lustre volumes on OSS. 



siva murugan wrote:> We are trying to uptake Lustre in one of our heavy read/write 
> intensive infrastructure(daily writes - 8million files, 1TB ). Average 
> size of files written is 1KB ( I know , Lustre can''t scale well
for
> small size files, but just wanted to analyze the possibility of 
> uptaking )
>  
> Following are some of the tests conducted  to see the difference in 
> large and small file size writting,
>  
> MDT - 1
> OST  - 13 ( also act as nfsserver)
> Clients access lustrefs via NFS ( not patchless clients)
>  
> Test 1 :
>
> Number of clients  - 10
> Dataset size read/written - 971M (per client)
> Number of files in the dataset- 14000
> Total data written - 10gb
> Time taken - 1390s
>
> Test2 :
>
> Number of clients  - 10
> Dataset size read/written -1001M (per client)
> Number of files in the dataset - 4
> Total data written - 10gb
>
> Time taken - 215s
>
>
> Test3 :
>
> Number of clients  - 10
> Dataset size read.written- 53MB (per client)
> Number of files in the dataset- 14000
> Total data written - 530MB
> Time taken - 1027s
> MDT was heavily loaded during  Test3 ( Load average > 25 ). Since the 
> file size in Test 3 is small(1kb) and number of files written is too 
> large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating 
> inodes, total data written in test3 is only 530MB.
>  
> Question  : Is there any parameter that I can tune in MDT to increase 
> the performance when writting large number of small files ?
>
> Please help
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

anil kumar

2008-Dec-08 04:18 UTC

head link

[Lustre-discuss] MDT overloaded when writting small files in large number

Hi,

Most of the issues related to insufficient memory while using client & OSS
on same Server,  in our case more than 50% of the memory is alway free on
OSS as we have 32GB on MDT/OSS nodes.

We do''nt see the performance issues while the filesize is large &
number of
files are less; performance issues starts as number of files increase.  So
it might be the turn around time which is causing the problem as each
transaction need to go to MGS/MDT & then again back to OSS.

Example:
Case1 : If we try to write data set of 1GB with 200Files ; write, delete &
read is faster
Case2 : if we try to write data set of 1GB with 14000Files ; write, delete &
read is very slow.

 Even we see most of the issues reported related to small files & work
around to improve performance by disable debug mode.  But in our case
disabling debug mode did not help to improve the performance.

Please let us know if there is any other alternate options to improve
performance while using more number of file with small size.

Thanks,
Anil





On Fri, Dec 5, 2008 at 4:51 PM, Balagopal Pillai <pillai at
mathstat.dal.ca>wrote:
> "OST  - 13 ( also act as nfsserver)"
>
>              Then I am assuming that your OSS is also a Lustre client.
> It might be useful to search through this
> list to find out the potential pitfalls of mounting Lustre volumes on OSS.
>
>
>
> siva murugan wrote:
> > We are trying to uptake Lustre in one of our heavy read/write
> > intensive infrastructure(daily writes - 8million files, 1TB ). Average
> > size of files written is 1KB ( I know , Lustre can''t scale
well for
> > small size files, but just wanted to analyze the possibility of
> > uptaking )
> >
> > Following are some of the tests conducted  to see the difference in
> > large and small file size writting,
> >
> > MDT - 1
> > OST  - 13 ( also act as nfsserver)
> > Clients access lustrefs via NFS ( not patchless clients)
> >
> > Test 1 :
> >
> > Number of clients  - 10
> > Dataset size read/written - 971M (per client)
> > Number of files in the dataset- 14000
> > Total data written - 10gb
> > Time taken - 1390s
> >
> > Test2 :
> >
> > Number of clients  - 10
> > Dataset size read/written -1001M (per client)
> > Number of files in the dataset - 4
> > Total data written - 10gb
> >
> > Time taken - 215s
> >
> >
> > Test3 :
> >
> > Number of clients  - 10
> > Dataset size read.written- 53MB (per client)
> > Number of files in the dataset- 14000
> > Total data written - 530MB
> > Time taken - 1027s
> > MDT was heavily loaded during  Test3 ( Load average > 25 ). Since
the
> > file size in Test 3 is small(1kb) and number of files written is too
> > large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating
> > inodes, total data written in test3 is only 530MB.
> >
> > Question  : Is there any parameter that I can tune in MDT to increase
> > the performance when writting large number of small files ?
> >
> > Please help
> >
> >
------------------------------------------------------------------------
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081208/d33e9208/attachment-0001.html

Brian J. Murrell

2008-Dec-08 14:17 UTC

head link

[Lustre-discuss] MDT overloaded when writting small files in large number

On Mon, 2008-12-08 at 09:48 +0530, anil kumar wrote:>  
> Example: 
> Case1 : If we try to write data set of 1GB with 200Files ; write,
> delete & read is faster 
> Case2 : if we try to write data set of 1GB with 14000Files ; write,
> delete & read is very slow.
Yes, this is not surprising for various values of "slow".  Lustre is
known to perform much better on large files as that is the typical HPC
workload.
> Even we see most of the issues reported related to small files & work
> around to improve performance by disable debug mode.  But in our case
> disabling debug mode did not help to improve the performance.
There is a bug in our BZ tracking small file performance issues.  I''m
not sure if it''s seen much action lately though.  I don''t
recall the
number but you might want to subscribe to that bug.
 > Please let us know if there is any other alternate options to improve
> performance while using more number of file with small size.
If you have already reviewed the archives of this list and applied all
of the various remedies for small files, and you have plenty of memory
in your MDS, then there is not much else you can do I''m, afraid.

We do recognize that small files is an area in which we don''t perform
as
well as we do on large files.  As/if demand for small file performance
increases, it will bubble up our list of priorities.

b.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081208/201fb10e/attachment.bin

Kevin Van Maren

2008-Dec-08 14:58 UTC

head link

[Lustre-discuss] MDT overloaded when writting small files in large number

Brian J. Murrell wrote:> On Mon, 2008-12-08 at 09:48 +0530, anil kumar wrote:
>   
>>  
>> Example: 
>> Case1 : If we try to write data set of 1GB with 200Files ; write,
>> delete & read is faster 
>> Case2 : if we try to write data set of 1GB with 14000Files ; write,
>> delete & read is very slow.
>>     
Just to be clear, 1GB / 14000 = ~70KB/file, while the first case is 
5MB/file?

What does your hardware look like (interconnect, MDT & OST raid 
configurations, OST/OSS count, how many clients, etc)?

Do you have quantitative results, or just "faster" and
"slow"?

Kevin

Daire Byrne

2008-Dec-08 15:23 UTC

head link

[Lustre-discuss] MDT overloaded when writting small files in large number

Siva,

In the past when we''ve had workloads with lots of small files where
each client/job has a unique dataset we have used disk image files on top of a
Lustre filesystem to store them. This file is then stored on a single OST and so
it reduces the overhead of going to the MDS every time - it becomes a file seek
operation. We''ve even used squashfs archives before for write once read
often small file workloads which has the added benefit of saving on disk space.
However, if the dataset needs write access to many clients simultaneously then
this isn''t going to work.

Another trick we used for small files was to cache them on a Lustre client which
then exported it over NFS. Putting plenty of RAM in the NFS exporter meant that
we could hold a lot of metadata and file data in memory. We would then
"bind" mount this over the desired branch of the actual Lustre
filesystem tree. This kind of defeats the purpose of Lustre somewhat but can be
useful for the rare cases when it can''t compete with NFS (like small
files).

Daire

----- "siva murugan" <siva.murugan at gmail.com> wrote:
> We are trying to uptake Lustre in one of our heavy read/write
> intensive infrastructure(daily writes - 8million files, 1TB ). Average
> size of files written is 1KB ( I know , Lustre can''t scale well
for
> small size files, but just wanted to analyze the possibility of
> uptaking )
> 
> Following are some of the tests conducted to see the difference in
> large and small file size writting,
> 
> MDT - 1
> OST - 13 ( also act as nfsserver)
> Clients access lustrefs via NFS ( not patchless clients)
> 
> Test 1 :
> 
> Number of clients - 10
> Dataset size read/written - 971M (per client)
> Number of files in the dataset- 14000
> Total data written - 10gb
> 
> Time taken - 1390s
> 
> Test2 :
> 
> Number of clients - 10
> Dataset size read/written -1001M (per client)
> Number of files in the dataset - 4
> Total data written - 10gb
> 
> Time taken - 215s
> 
> 
> Test3 :
> 
> Number of clients - 10
> Dataset size read.written- 53MB (per client)
> Number of files in the dataset- 14000
> Total data written - 530MB
> Time taken - 1027s
> 
> MDT was heavily loaded during Test3 ( Load average > 25 ). Since the
> file size in Test 3 is small(1kb) and number of files written is too
> large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating
> inodes, total data written in test3 is only 530MB.
> 
> Question : Is there any parameter that I can tune in MDT to increase
> the performance when writting large number of small files ?
> 
> Please help
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Jeff Darcy

2008-Dec-08 16:21 UTC

head link

[Lustre-discuss] MDT overloaded when writting small files in large number

Daire Byrne wrote:> In the past when we''ve had workloads with lots of small files
where each client/job has a unique dataset we have used disk image files on top
of a Lustre filesystem to store them. This file is then stored on a single OST
and so it reduces the overhead of going to the MDS every time - it becomes a
file seek operation. We''ve even used squashfs archives before for write
once read often small file workloads which has the added benefit of saving on
disk space. However, if the dataset needs write access to many clients
simultaneously then this isn''t going to work.
>   We''ve used the same trick for read-only stuff too.  In one case, using 
NBD to serve images that lived in a Lustre FS yielded a >12x improvement 
vs. using Lustre directly for an application that read ~80K (very) small 
files.  Unfortunately, there''s no equivalent solution for writes, even 
if the writes are only occasional.  If people wanted to brainstorm about 
combinations of NFS, NBD, unionfs, and other tricks that can be combined 
with Lustre to good effect in many-small-file cases (which aren''t as 
uncommon in HPC as you''d think), it might be a very worthwhile 
exercise.  I think many users and vendors hit this sooner or later, and 
agreeing on some recipes that could also be use/test cases would be to 
everyone''s benefit.

Andreas Dilger

2008-Dec-10 17:32 UTC

head link

[Lustre-discuss] MDT overloaded when writting small files in large number

On Dec 08, 2008  15:23 +0000, Daire Byrne wrote:> In the past when we''ve had workloads with lots of small files
where
> each client/job has a unique dataset we have used disk image files on
> top of a Lustre filesystem to store them. This file is then stored on a
> single OST and so it reduces the overhead of going to the MDS every time
> - it becomes a file seek operation. We''ve even used squashfs
archives
> before for write once read often small file workloads which has the
> added benefit of saving on disk space. However, if the dataset needs
> write access to many clients simultaneously then this isn''t going
to work.
Daire, this seems like a very useful trick - thanks for sharing.  One
could even think of this as delegating a whole sub-tree to the OST,
though of course due to the nature of such local filesystems they could
not be used by more than one client if they are being changed.

As an FYI, Lustre supports the "immutable" attribute on files (set via
"chattr +i {filename}", so that "read-only" files can
prevent clients
from modifying the file accidentally.  It requires root to set and
clear this flag, but it will prevent even root from modifying or
deleting the file until it is cleared.
> Another trick we used for small files was to cache them on a Lustre
> client which then exported it over NFS. Putting plenty of RAM in the
> NFS exporter meant that we could hold a lot of metadata and file data
> in memory. We would then "bind" mount this over the desired
branch of
> the actual Lustre filesystem tree. This kind of defeats the purpose
> of Lustre somewhat but can be useful for the rare cases when it
can''t
> compete with NFS (like small files).
Unfortunately, metadata write-caching proxies are still down the road
a ways for Lustre, but it is interesting to see this as a workaround.
Do you use this in a directory where lots of files are being read, or
also in case of lots of small file writes?  It would be highly strange
if you could write faster into an NFS export of Lustre than to the
native client.

Also, what version is the client in this case?  With 1.6.6 clients and
servers the clients can grow their MDT + OST lock counts on demand,
and the read cache limit is by default 3/4 of RAM, so one would expect
that the native client could cache as much needed already.  The 1.8.0
OST will also have read cache, as you know, and it would be interesting
to know if this improves the small-file performance to NFS levels.



One of the things that we also discussed internally in the past is to
allow storing of small files (<= 64kB for example) entirely in the MDT.
This would allow all of the file attributes to be accessible in a single
place, instead of the current requirement of doing 2+ RPCs to get all
of the file attributes (MDS + OSS).

It might even be possible to do whole-file readahead from the MDS -
when the file is opened for read the MDS returns the full file contents
along with the attributes and a lock to the client, avoiding any other
RPCs.


Having feedback from you for particular weaknesses makes it much more
likely that they will be implemented in the future.  Thanks for letting
keeping in touch.
> ----- "siva murugan" <siva.murugan at gmail.com> wrote:
> > We are trying to uptake Lustre in one of our heavy read/write
> > intensive infrastructure(daily writes - 8million files, 1TB ). Average
> > size of files written is 1KB ( I know , Lustre can''t scale
well for
> > small size files, but just wanted to analyze the possibility of
> > uptaking )
> > 
> > Following are some of the tests conducted to see the difference in
> > large and small file size writting,
> > 
> > MDT - 1
> > OST - 13 ( also act as nfsserver)
> > Clients access lustrefs via NFS ( not patchless clients)
> > 
> > Test 1 :
> > 
> > Number of clients - 10
> > Dataset size read/written - 971M (per client)
> > Number of files in the dataset- 14000
> > Total data written - 10gb
> > 
> > Time taken - 1390s
> > 
> > Test2 :
> > 
> > Number of clients - 10
> > Dataset size read/written -1001M (per client)
> > Number of files in the dataset - 4
> > Total data written - 10gb
> > 
> > Time taken - 215s
> > 
> > 
> > Test3 :
> > 
> > Number of clients - 10
> > Dataset size read.written- 53MB (per client)
> > Number of files in the dataset- 14000
> > Total data written - 530MB
> > Time taken - 1027s
> > 
> > MDT was heavily loaded during Test3 ( Load average > 25 ). Since
the
> > file size in Test 3 is small(1kb) and number of files written is too
> > large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating
> > inodes, total data written in test3 is only 530MB.
> > 
> > Question : Is there any parameter that I can tune in MDT to increase
> > the performance when writting large number of small files ?
> > 
> > Please help
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Daire Byrne

2008-Dec-11 10:13 UTC

head link

[Lustre-discuss] MDT overloaded when writting small files in large number

Andreas,

----- "Andreas Dilger" <adilger at sun.com> wrote:
> As an FYI, Lustre supports the "immutable" attribute on files
(set
> via "chattr +i {filename}", so that "read-only" files
can prevent clients
> from modifying the file accidentally.  It requires root to set and
> clear this flag, but it will prevent even root from modifying or
> deleting the file until it is cleared.
Good to know - thanks.
> > Another trick we used for small files was to cache them on a Lustre
> > client which then exported it over NFS. Putting plenty of RAM in the
> > NFS exporter meant that we could hold a lot of metadata and file data
> > in memory. We would then "bind" mount this over the desired
branch of
> > the actual Lustre filesystem tree. This kind of defeats the purpose
> > of Lustre somewhat but can be useful for the rare cases when it
can''t
> > compete with NFS (like small files).
> 
> Unfortunately, metadata write-caching proxies are still down the road
> a ways for Lustre, but it is interesting to see this as a workaround.
> Do you use this in a directory where lots of files are being read, or
> also in case of lots of small file writes?  It would be highly
> strange if you could write faster into an NFS export of Lustre than to the
> native client.
We only really use this for write once/read many types of workloads. In fact we
tend to use this trick only on user workstations where waiting for 30,000 small
files to load into Maya can be a ten minute operation (down to around 2 minutes
via an NFS cache). On the compute farm we just leave everything as native
Lustre.
> Also, what version is the client in this case?  With 1.6.6 clients
> and servers the clients can grow their MDT + OST lock counts on demand,
> and the read cache limit is by default 3/4 of RAM, so one would
> expect that the native client could cache as much needed already.  The
1.8.0
> OST will also have read cache, as you know, and it would be
> interesting to know if this improves the small-file performance to NFS
levels.
The client was 1.6.5 which I thought had the dynamic lock count tuning enabled
too. Either way I would bump up the MDT lock count by hand to cover the likely
number of small files. I am also interested to see how small file performance is
effected by OST caching. However it is my experience that the slow small file
performance has more to do with waiting for both the MDS and OSS RPCs to return
for every file and that the read speed from disk is a small percentage of the
overall time. Of course once the disk starts seeking all over the place under
heavy load then it has a much greater effect.
> One of the things that we also discussed internally in the past is to
> allow storing of small files (<= 64kB for example) entirely in the
> MDT. This would allow all of the file attributes to be accessible in a
> single place, instead of the current requirement of doing 2+ RPCs to get
all
> of the file attributes (MDS + OSS).
As I said it *seems* to me (scientific huh?) that much of the time between files
is waiting on both of these RPCs to return. In the same way that "ls
-l" type operations are always going to be slower in Lustre than a single
NFS server. Putting small files on the MDT would probably help but it will make
sizing up the MDT a bit trickier.
> Having feedback from you for particular weaknesses makes it much more
> likely that they will be implemented in the future.  Thanks for
> letting keeping in touch.
Here is a quick and dirty comparison of the stat() type performance of ~30,000
files using various setups:

 # find /net/epsilon/tests/meshCache -printf ''%kk\t|%T@|%P\n''
> /dev/null
 files in dir (Lustre):     41.54 seconds
 files in squashfs loopback: 1.73 seconds
 files in xfs loopback:      2.75 seconds
 files from NFS cache:       5.37 seconds

Daire

Lustre discuss - Dec 2008 - MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number

[Lustre-discuss] MDT overloaded when writting small files in large number