siva murugan
2008-Dec-05 05:28 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
We are trying to uptake Lustre in one of our heavy read/write intensive infrastructure(daily writes - 8million files, 1TB ). Average size of files written is 1KB ( I know , Lustre can''t scale well for small size files, but just wanted to analyze the possibility of uptaking ) Following are some of the tests conducted to see the difference in large and small file size writting, MDT - 1 OST - 13 ( also act as nfsserver) Clients access lustrefs via NFS ( not patchless clients) Test 1 : Number of clients - 10 Dataset size read/written - 971M (per client) Number of files in the dataset- 14000 Total data written - 10gb Time taken - 1390s Test2 : Number of clients - 10 Dataset size read/written -1001M (per client) Number of files in the dataset - 4 Total data written - 10gb Time taken - 215s Test3 : Number of clients - 10 Dataset size read.written- 53MB (per client) Number of files in the dataset- 14000 Total data written - 530MB Time taken - 1027s MDT was heavily loaded during Test3 ( Load average > 25 ). Since the file size in Test 3 is small(1kb) and number of files written is too large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating inodes, total data written in test3 is only 530MB. Question : Is there any parameter that I can tune in MDT to increase the performance when writting large number of small files ? Please help -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081205/a06bcaa9/attachment-0001.html
Balagopal Pillai
2008-Dec-05 11:21 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
"OST - 13 ( also act as nfsserver)" Then I am assuming that your OSS is also a Lustre client. It might be useful to search through this list to find out the potential pitfalls of mounting Lustre volumes on OSS. siva murugan wrote:> We are trying to uptake Lustre in one of our heavy read/write > intensive infrastructure(daily writes - 8million files, 1TB ). Average > size of files written is 1KB ( I know , Lustre can''t scale well for > small size files, but just wanted to analyze the possibility of > uptaking ) > > Following are some of the tests conducted to see the difference in > large and small file size writting, > > MDT - 1 > OST - 13 ( also act as nfsserver) > Clients access lustrefs via NFS ( not patchless clients) > > Test 1 : > > Number of clients - 10 > Dataset size read/written - 971M (per client) > Number of files in the dataset- 14000 > Total data written - 10gb > Time taken - 1390s > > Test2 : > > Number of clients - 10 > Dataset size read/written -1001M (per client) > Number of files in the dataset - 4 > Total data written - 10gb > > Time taken - 215s > > > Test3 : > > Number of clients - 10 > Dataset size read.written- 53MB (per client) > Number of files in the dataset- 14000 > Total data written - 530MB > Time taken - 1027s > MDT was heavily loaded during Test3 ( Load average > 25 ). Since the > file size in Test 3 is small(1kb) and number of files written is too > large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating > inodes, total data written in test3 is only 530MB. > > Question : Is there any parameter that I can tune in MDT to increase > the performance when writting large number of small files ? > > Please help > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
anil kumar
2008-Dec-08 04:18 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
Hi, Most of the issues related to insufficient memory while using client & OSS on same Server, in our case more than 50% of the memory is alway free on OSS as we have 32GB on MDT/OSS nodes. We do''nt see the performance issues while the filesize is large & number of files are less; performance issues starts as number of files increase. So it might be the turn around time which is causing the problem as each transaction need to go to MGS/MDT & then again back to OSS. Example: Case1 : If we try to write data set of 1GB with 200Files ; write, delete & read is faster Case2 : if we try to write data set of 1GB with 14000Files ; write, delete & read is very slow. Even we see most of the issues reported related to small files & work around to improve performance by disable debug mode. But in our case disabling debug mode did not help to improve the performance. Please let us know if there is any other alternate options to improve performance while using more number of file with small size. Thanks, Anil On Fri, Dec 5, 2008 at 4:51 PM, Balagopal Pillai <pillai at mathstat.dal.ca>wrote:> "OST - 13 ( also act as nfsserver)" > > Then I am assuming that your OSS is also a Lustre client. > It might be useful to search through this > list to find out the potential pitfalls of mounting Lustre volumes on OSS. > > > > siva murugan wrote: > > We are trying to uptake Lustre in one of our heavy read/write > > intensive infrastructure(daily writes - 8million files, 1TB ). Average > > size of files written is 1KB ( I know , Lustre can''t scale well for > > small size files, but just wanted to analyze the possibility of > > uptaking ) > > > > Following are some of the tests conducted to see the difference in > > large and small file size writting, > > > > MDT - 1 > > OST - 13 ( also act as nfsserver) > > Clients access lustrefs via NFS ( not patchless clients) > > > > Test 1 : > > > > Number of clients - 10 > > Dataset size read/written - 971M (per client) > > Number of files in the dataset- 14000 > > Total data written - 10gb > > Time taken - 1390s > > > > Test2 : > > > > Number of clients - 10 > > Dataset size read/written -1001M (per client) > > Number of files in the dataset - 4 > > Total data written - 10gb > > > > Time taken - 215s > > > > > > Test3 : > > > > Number of clients - 10 > > Dataset size read.written- 53MB (per client) > > Number of files in the dataset- 14000 > > Total data written - 530MB > > Time taken - 1027s > > MDT was heavily loaded during Test3 ( Load average > 25 ). Since the > > file size in Test 3 is small(1kb) and number of files written is too > > large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating > > inodes, total data written in test3 is only 530MB. > > > > Question : Is there any parameter that I can tune in MDT to increase > > the performance when writting large number of small files ? > > > > Please help > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081208/d33e9208/attachment-0001.html
Brian J. Murrell
2008-Dec-08 14:17 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
On Mon, 2008-12-08 at 09:48 +0530, anil kumar wrote:> > Example: > Case1 : If we try to write data set of 1GB with 200Files ; write, > delete & read is faster > Case2 : if we try to write data set of 1GB with 14000Files ; write, > delete & read is very slow.Yes, this is not surprising for various values of "slow". Lustre is known to perform much better on large files as that is the typical HPC workload.> Even we see most of the issues reported related to small files & work > around to improve performance by disable debug mode. But in our case > disabling debug mode did not help to improve the performance.There is a bug in our BZ tracking small file performance issues. I''m not sure if it''s seen much action lately though. I don''t recall the number but you might want to subscribe to that bug.> Please let us know if there is any other alternate options to improve > performance while using more number of file with small size.If you have already reviewed the archives of this list and applied all of the various remedies for small files, and you have plenty of memory in your MDS, then there is not much else you can do I''m, afraid. We do recognize that small files is an area in which we don''t perform as well as we do on large files. As/if demand for small file performance increases, it will bubble up our list of priorities. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081208/201fb10e/attachment.bin
Kevin Van Maren
2008-Dec-08 14:58 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
Brian J. Murrell wrote:> On Mon, 2008-12-08 at 09:48 +0530, anil kumar wrote: > >> >> Example: >> Case1 : If we try to write data set of 1GB with 200Files ; write, >> delete & read is faster >> Case2 : if we try to write data set of 1GB with 14000Files ; write, >> delete & read is very slow. >>Just to be clear, 1GB / 14000 = ~70KB/file, while the first case is 5MB/file? What does your hardware look like (interconnect, MDT & OST raid configurations, OST/OSS count, how many clients, etc)? Do you have quantitative results, or just "faster" and "slow"? Kevin
Daire Byrne
2008-Dec-08 15:23 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
Siva, In the past when we''ve had workloads with lots of small files where each client/job has a unique dataset we have used disk image files on top of a Lustre filesystem to store them. This file is then stored on a single OST and so it reduces the overhead of going to the MDS every time - it becomes a file seek operation. We''ve even used squashfs archives before for write once read often small file workloads which has the added benefit of saving on disk space. However, if the dataset needs write access to many clients simultaneously then this isn''t going to work. Another trick we used for small files was to cache them on a Lustre client which then exported it over NFS. Putting plenty of RAM in the NFS exporter meant that we could hold a lot of metadata and file data in memory. We would then "bind" mount this over the desired branch of the actual Lustre filesystem tree. This kind of defeats the purpose of Lustre somewhat but can be useful for the rare cases when it can''t compete with NFS (like small files). Daire ----- "siva murugan" <siva.murugan at gmail.com> wrote:> We are trying to uptake Lustre in one of our heavy read/write > intensive infrastructure(daily writes - 8million files, 1TB ). Average > size of files written is 1KB ( I know , Lustre can''t scale well for > small size files, but just wanted to analyze the possibility of > uptaking ) > > Following are some of the tests conducted to see the difference in > large and small file size writting, > > MDT - 1 > OST - 13 ( also act as nfsserver) > Clients access lustrefs via NFS ( not patchless clients) > > Test 1 : > > Number of clients - 10 > Dataset size read/written - 971M (per client) > Number of files in the dataset- 14000 > Total data written - 10gb > > Time taken - 1390s > > Test2 : > > Number of clients - 10 > Dataset size read/written -1001M (per client) > Number of files in the dataset - 4 > Total data written - 10gb > > Time taken - 215s > > > Test3 : > > Number of clients - 10 > Dataset size read.written- 53MB (per client) > Number of files in the dataset- 14000 > Total data written - 530MB > Time taken - 1027s > > MDT was heavily loaded during Test3 ( Load average > 25 ). Since the > file size in Test 3 is small(1kb) and number of files written is too > large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating > inodes, total data written in test3 is only 530MB. > > Question : Is there any parameter that I can tune in MDT to increase > the performance when writting large number of small files ? > > Please help > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Jeff Darcy
2008-Dec-08 16:21 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
Daire Byrne wrote:> In the past when we''ve had workloads with lots of small files where each client/job has a unique dataset we have used disk image files on top of a Lustre filesystem to store them. This file is then stored on a single OST and so it reduces the overhead of going to the MDS every time - it becomes a file seek operation. We''ve even used squashfs archives before for write once read often small file workloads which has the added benefit of saving on disk space. However, if the dataset needs write access to many clients simultaneously then this isn''t going to work. >We''ve used the same trick for read-only stuff too. In one case, using NBD to serve images that lived in a Lustre FS yielded a >12x improvement vs. using Lustre directly for an application that read ~80K (very) small files. Unfortunately, there''s no equivalent solution for writes, even if the writes are only occasional. If people wanted to brainstorm about combinations of NFS, NBD, unionfs, and other tricks that can be combined with Lustre to good effect in many-small-file cases (which aren''t as uncommon in HPC as you''d think), it might be a very worthwhile exercise. I think many users and vendors hit this sooner or later, and agreeing on some recipes that could also be use/test cases would be to everyone''s benefit.
Andreas Dilger
2008-Dec-10 17:32 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
On Dec 08, 2008 15:23 +0000, Daire Byrne wrote:> In the past when we''ve had workloads with lots of small files where > each client/job has a unique dataset we have used disk image files on > top of a Lustre filesystem to store them. This file is then stored on a > single OST and so it reduces the overhead of going to the MDS every time > - it becomes a file seek operation. We''ve even used squashfs archives > before for write once read often small file workloads which has the > added benefit of saving on disk space. However, if the dataset needs > write access to many clients simultaneously then this isn''t going to work.Daire, this seems like a very useful trick - thanks for sharing. One could even think of this as delegating a whole sub-tree to the OST, though of course due to the nature of such local filesystems they could not be used by more than one client if they are being changed. As an FYI, Lustre supports the "immutable" attribute on files (set via "chattr +i {filename}", so that "read-only" files can prevent clients from modifying the file accidentally. It requires root to set and clear this flag, but it will prevent even root from modifying or deleting the file until it is cleared.> Another trick we used for small files was to cache them on a Lustre > client which then exported it over NFS. Putting plenty of RAM in the > NFS exporter meant that we could hold a lot of metadata and file data > in memory. We would then "bind" mount this over the desired branch of > the actual Lustre filesystem tree. This kind of defeats the purpose > of Lustre somewhat but can be useful for the rare cases when it can''t > compete with NFS (like small files).Unfortunately, metadata write-caching proxies are still down the road a ways for Lustre, but it is interesting to see this as a workaround. Do you use this in a directory where lots of files are being read, or also in case of lots of small file writes? It would be highly strange if you could write faster into an NFS export of Lustre than to the native client. Also, what version is the client in this case? With 1.6.6 clients and servers the clients can grow their MDT + OST lock counts on demand, and the read cache limit is by default 3/4 of RAM, so one would expect that the native client could cache as much needed already. The 1.8.0 OST will also have read cache, as you know, and it would be interesting to know if this improves the small-file performance to NFS levels. One of the things that we also discussed internally in the past is to allow storing of small files (<= 64kB for example) entirely in the MDT. This would allow all of the file attributes to be accessible in a single place, instead of the current requirement of doing 2+ RPCs to get all of the file attributes (MDS + OSS). It might even be possible to do whole-file readahead from the MDS - when the file is opened for read the MDS returns the full file contents along with the attributes and a lock to the client, avoiding any other RPCs. Having feedback from you for particular weaknesses makes it much more likely that they will be implemented in the future. Thanks for letting keeping in touch.> ----- "siva murugan" <siva.murugan at gmail.com> wrote: > > We are trying to uptake Lustre in one of our heavy read/write > > intensive infrastructure(daily writes - 8million files, 1TB ). Average > > size of files written is 1KB ( I know , Lustre can''t scale well for > > small size files, but just wanted to analyze the possibility of > > uptaking ) > > > > Following are some of the tests conducted to see the difference in > > large and small file size writting, > > > > MDT - 1 > > OST - 13 ( also act as nfsserver) > > Clients access lustrefs via NFS ( not patchless clients) > > > > Test 1 : > > > > Number of clients - 10 > > Dataset size read/written - 971M (per client) > > Number of files in the dataset- 14000 > > Total data written - 10gb > > > > Time taken - 1390s > > > > Test2 : > > > > Number of clients - 10 > > Dataset size read/written -1001M (per client) > > Number of files in the dataset - 4 > > Total data written - 10gb > > > > Time taken - 215s > > > > > > Test3 : > > > > Number of clients - 10 > > Dataset size read.written- 53MB (per client) > > Number of files in the dataset- 14000 > > Total data written - 530MB > > Time taken - 1027s > > > > MDT was heavily loaded during Test3 ( Load average > 25 ). Since the > > file size in Test 3 is small(1kb) and number of files written is too > > large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating > > inodes, total data written in test3 is only 530MB. > > > > Question : Is there any parameter that I can tune in MDT to increase > > the performance when writting large number of small files ? > > > > Please help > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Daire Byrne
2008-Dec-11 10:13 UTC
[Lustre-discuss] MDT overloaded when writting small files in large number
Andreas, ----- "Andreas Dilger" <adilger at sun.com> wrote:> As an FYI, Lustre supports the "immutable" attribute on files (set > via "chattr +i {filename}", so that "read-only" files can prevent clients > from modifying the file accidentally. It requires root to set and > clear this flag, but it will prevent even root from modifying or > deleting the file until it is cleared.Good to know - thanks.> > Another trick we used for small files was to cache them on a Lustre > > client which then exported it over NFS. Putting plenty of RAM in the > > NFS exporter meant that we could hold a lot of metadata and file data > > in memory. We would then "bind" mount this over the desired branch of > > the actual Lustre filesystem tree. This kind of defeats the purpose > > of Lustre somewhat but can be useful for the rare cases when it can''t > > compete with NFS (like small files). > > Unfortunately, metadata write-caching proxies are still down the road > a ways for Lustre, but it is interesting to see this as a workaround. > Do you use this in a directory where lots of files are being read, or > also in case of lots of small file writes? It would be highly > strange if you could write faster into an NFS export of Lustre than to the > native client.We only really use this for write once/read many types of workloads. In fact we tend to use this trick only on user workstations where waiting for 30,000 small files to load into Maya can be a ten minute operation (down to around 2 minutes via an NFS cache). On the compute farm we just leave everything as native Lustre.> Also, what version is the client in this case? With 1.6.6 clients > and servers the clients can grow their MDT + OST lock counts on demand, > and the read cache limit is by default 3/4 of RAM, so one would > expect that the native client could cache as much needed already. The 1.8.0 > OST will also have read cache, as you know, and it would be > interesting to know if this improves the small-file performance to NFS levels.The client was 1.6.5 which I thought had the dynamic lock count tuning enabled too. Either way I would bump up the MDT lock count by hand to cover the likely number of small files. I am also interested to see how small file performance is effected by OST caching. However it is my experience that the slow small file performance has more to do with waiting for both the MDS and OSS RPCs to return for every file and that the read speed from disk is a small percentage of the overall time. Of course once the disk starts seeking all over the place under heavy load then it has a much greater effect.> One of the things that we also discussed internally in the past is to > allow storing of small files (<= 64kB for example) entirely in the > MDT. This would allow all of the file attributes to be accessible in a > single place, instead of the current requirement of doing 2+ RPCs to get all > of the file attributes (MDS + OSS).As I said it *seems* to me (scientific huh?) that much of the time between files is waiting on both of these RPCs to return. In the same way that "ls -l" type operations are always going to be slower in Lustre than a single NFS server. Putting small files on the MDT would probably help but it will make sizing up the MDT a bit trickier.> Having feedback from you for particular weaknesses makes it much more > likely that they will be implemented in the future. Thanks for > letting keeping in touch.Here is a quick and dirty comparison of the stat() type performance of ~30,000 files using various setups: # find /net/epsilon/tests/meshCache -printf ''%kk\t|%T@|%P\n'' > /dev/null files in dir (Lustre): 41.54 seconds files in squashfs loopback: 1.73 seconds files in xfs loopback: 2.75 seconds files from NFS cache: 5.37 seconds Daire