kmehta at cs.uh.edu
2011-May-23 21:28 UTC
[Lustre-community] Poor multithreaded I/O performance
Hello, I am running a multithreaded application that writes to a common shared file on lustre fs, and this is what I see: If I have a single thread in my application, I get a bandwidth of approx. 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 threads such that all of them write to the same file (non-overlapping locations), without explicitly synchronizing the writes (i.e. I dont lock the file handle), I still get the same bandwidth. Now, instead of writing to a shared file, if these threads write to separate files, the bandwidth obtained is approx. 700 Mbytes/sec. I would ideally like my multithreaded application to see similar scaling. Any ideas why the performance is limited and any workarounds? Thank you, Kshitij
What is your stripe count on the file, if your default is 1, you are only writing to one of the OST''s. you can check with the lfs getstripe command, you can set the stripe bigger, and hopefully your wide-stripped file with threaded writes will be faster. Evan -----Original Message----- From: lustre-community-bounces at lists.lustre.org [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of kmehta at cs.uh.edu Sent: Monday, May 23, 2011 2:28 PM To: lustre-community at lists.lustre.org Subject: [Lustre-community] Poor multithreaded I/O performance Hello, I am running a multithreaded application that writes to a common shared file on lustre fs, and this is what I see: If I have a single thread in my application, I get a bandwidth of approx. 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 threads such that all of them write to the same file (non-overlapping locations), without explicitly synchronizing the writes (i.e. I dont lock the file handle), I still get the same bandwidth. Now, instead of writing to a shared file, if these threads write to separate files, the bandwidth obtained is approx. 700 Mbytes/sec. I would ideally like my multithreaded application to see similar scaling. Any ideas why the performance is limited and any workarounds? Thank you, Kshitij _______________________________________________ Lustre-community mailing list Lustre-community at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-community
Kevin Van Maren
2011-May-23 21:34 UTC
[Lustre-community] Poor multithreaded I/O performance
kmehta at cs.uh.edu wrote:> The stripe count is 48. >With only 11 OSTs?> Just fyi, this is what my application does: > A simple I/O test where threads continually write blocks of size 64Kbytes > or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is > created. > > Thanks, > Kshitij > > >> What is your stripe count on the file, if your default is 1, you are only >> writing to one of the OST''s. you can check with the lfs getstripe >> command, you can set the stripe bigger, and hopefully your wide-stripped >> file with threaded writes will be faster. >> >> Evan >> >> -----Original Message----- >> From: lustre-community-bounces at lists.lustre.org >> [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of >> kmehta at cs.uh.edu >> Sent: Monday, May 23, 2011 2:28 PM >> To: lustre-community at lists.lustre.org >> Subject: [Lustre-community] Poor multithreaded I/O performance >> >> Hello, >> I am running a multithreaded application that writes to a common shared >> file on lustre fs, and this is what I see: >> >> If I have a single thread in my application, I get a bandwidth of approx. >> 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 >> threads such that all of them write to the same file (non-overlapping >> locations), without explicitly synchronizing the writes (i.e. I dont lock >> the file handle), I still get the same bandwidth. >> >> Now, instead of writing to a shared file, if these threads write to >> separate files, the bandwidth obtained is approx. 700 Mbytes/sec. >> >> I would ideally like my multithreaded application to see similar scaling. >> Any ideas why the performance is limited and any workarounds? >> >> Thank you, >> Kshitij >> >> >> _______________________________________________ >> Lustre-community mailing list >> Lustre-community at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-community >> >> > > > _______________________________________________ > Lustre-community mailing list > Lustre-community at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-community >
kmehta at cs.uh.edu
2011-May-23 21:35 UTC
[Lustre-community] Poor multithreaded I/O performance
The stripe count is 48. Just fyi, this is what my application does: A simple I/O test where threads continually write blocks of size 64Kbytes or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is created. Thanks, Kshitij> What is your stripe count on the file, if your default is 1, you are only > writing to one of the OST''s. you can check with the lfs getstripe > command, you can set the stripe bigger, and hopefully your wide-stripped > file with threaded writes will be faster. > > Evan > > -----Original Message----- > From: lustre-community-bounces at lists.lustre.org > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of > kmehta at cs.uh.edu > Sent: Monday, May 23, 2011 2:28 PM > To: lustre-community at lists.lustre.org > Subject: [Lustre-community] Poor multithreaded I/O performance > > Hello, > I am running a multithreaded application that writes to a common shared > file on lustre fs, and this is what I see: > > If I have a single thread in my application, I get a bandwidth of approx. > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 > threads such that all of them write to the same file (non-overlapping > locations), without explicitly synchronizing the writes (i.e. I dont lock > the file handle), I still get the same bandwidth. > > Now, instead of writing to a shared file, if these threads write to > separate files, the bandwidth obtained is approx. 700 Mbytes/sec. > > I would ideally like my multithreaded application to see similar scaling. > Any ideas why the performance is limited and any workarounds? > > Thank you, > Kshitij > > > _______________________________________________ > Lustre-community mailing list > Lustre-community at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-community >
Wojciech Turek
2011-May-23 21:37 UTC
[Lustre-community] Poor multithreaded I/O performance
Run lfs getstripe <your_output_file> and paste the output of that command to the mailing list. Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe count will be 11) If your striping is correct, the bottleneck can be your client network. regards, Wojciech On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote:> The stripe count is 48. > > Just fyi, this is what my application does: > A simple I/O test where threads continually write blocks of size 64Kbytes > or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is > created. > > Thanks, > Kshitij > > > What is your stripe count on the file, if your default is 1, you are > only > > writing to one of the OST''s. you can check with the lfs getstripe > > command, you can set the stripe bigger, and hopefully your wide-stripped > > file with threaded writes will be faster. > > > > Evan > > > > -----Original Message----- > > From: lustre-community-bounces at lists.lustre.org > > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of > > kmehta at cs.uh.edu > > Sent: Monday, May 23, 2011 2:28 PM > > To: lustre-community at lists.lustre.org > > Subject: [Lustre-community] Poor multithreaded I/O performance > > > > Hello, > > I am running a multithreaded application that writes to a common shared > > file on lustre fs, and this is what I see: > > > > If I have a single thread in my application, I get a bandwidth of approx. > > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 > > threads such that all of them write to the same file (non-overlapping > > locations), without explicitly synchronizing the writes (i.e. I dont lock > > the file handle), I still get the same bandwidth. > > > > Now, instead of writing to a shared file, if these threads write to > > separate files, the bandwidth obtained is approx. 700 Mbytes/sec. > > > > I would ideally like my multithreaded application to see similar scaling. > > Any ideas why the performance is limited and any workarounds? > > > > Thank you, > > Kshitij > > > > > > _______________________________________________ > > Lustre-community mailing list > > Lustre-community at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-community > > > > > _______________________________________________ > Lustre-community mailing list > Lustre-community at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-community >-- Wojciech Turek Senior System Architect High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-community/attachments/20110523/c59fb882/attachment.html
kmehta at cs.uh.edu
2011-May-23 21:46 UTC
[Lustre-community] Poor multithreaded I/O performance
This is what my system documentation says: "Lustre filesystem is exported by 11 servers via Infiniband". I guess this means 11 OSTs (my apologies if it doesn''t). This is the output of the lfs getstripe command: $ lfs getstripe my_output_1824315 --quiet --verbose OBDS: 0: fastfs-OST0000_UUID ACTIVE 1: fastfs-OST0001_UUID ACTIVE 2: fastfs-OST0002_UUID ACTIVE 3: fastfs-OST0003_UUID ACTIVE 4: fastfs-OST0004_UUID ACTIVE 5: fastfs-OST0005_UUID ACTIVE 6: fastfs-OST0006_UUID ACTIVE 7: fastfs-OST0007_UUID ACTIVE 8: fastfs-OST0008_UUID ACTIVE 9: fastfs-OST0009_UUID ACTIVE 10: fastfs-OST000a_UUID ACTIVE 11: fastfs-OST000b_UUID ACTIVE 12: fastfs-OST000c_UUID ACTIVE 13: fastfs-OST000d_UUID ACTIVE 14: fastfs-OST000e_UUID ACTIVE 15: fastfs-OST000f_UUID ACTIVE 16: fastfs-OST0010_UUID ACTIVE 17: fastfs-OST0011_UUID ACTIVE 18: fastfs-OST0012_UUID ACTIVE 19: fastfs-OST0013_UUID ACTIVE 20: fastfs-OST0014_UUID ACTIVE 21: fastfs-OST0015_UUID ACTIVE 22: fastfs-OST0016_UUID ACTIVE 23: fastfs-OST0017_UUID ACTIVE 24: fastfs-OST0018_UUID ACTIVE 25: fastfs-OST0019_UUID ACTIVE 26: fastfs-OST001a_UUID ACTIVE 27: fastfs-OST001b_UUID ACTIVE 28: fastfs-OST001c_UUID ACTIVE 29: fastfs-OST001d_UUID ACTIVE 30: fastfs-OST001e_UUID ACTIVE 31: fastfs-OST001f_UUID ACTIVE 32: fastfs-OST0020_UUID ACTIVE 33: fastfs-OST0021_UUID ACTIVE 34: fastfs-OST0022_UUID ACTIVE 35: fastfs-OST0023_UUID ACTIVE 36: fastfs-OST0024_UUID ACTIVE 37: fastfs-OST0025_UUID ACTIVE 38: fastfs-OST0026_UUID ACTIVE 39: fastfs-OST0027_UUID ACTIVE 40: fastfs-OST0028_UUID ACTIVE 41: fastfs-OST0029_UUID ACTIVE 42: fastfs-OST002a_UUID ACTIVE 43: fastfs-OST002b_UUID ACTIVE 44: fastfs-OST002c_UUID ACTIVE 45: fastfs-OST002d_UUID ACTIVE 46: fastfs-OST002e_UUID ACTIVE 47: fastfs-OST002f_UUID ACTIVE 48: fastfs-OST0030_UUID ACTIVE 49: fastfs-OST0031_UUID ACTIVE 50: fastfs-OST0032_UUID ACTIVE 51: fastfs-OST0033_UUID ACTIVE 52: fastfs-OST0034_UUID ACTIVE 53: fastfs-OST0035_UUID ACTIVE 54: fastfs-OST0036_UUID ACTIVE 55: fastfs-OST0037_UUID ACTIVE 56: fastfs-OST0038_UUID ACTIVE 57: fastfs-OST0039_UUID ACTIVE 58: fastfs-OST003a_UUID ACTIVE 59: fastfs-OST003b_UUID ACTIVE 60: fastfs-OST003c_UUID ACTIVE 61: fastfs-OST003d_UUID ACTIVE 62: fastfs-OST003e_UUID ACTIVE 63: fastfs-OST003f_UUID ACTIVE my_output_1824315 lmm_magic: 0x0BD10BD0 lmm_object_gr: 0 lmm_object_id: 0x3c3839d lmm_stripe_count: 48 lmm_stripe_size: 1048576 lmm_stripe_pattern: 1 obdidx objid objid group 5 6096574 0x5d06be 0 25 6216932 0x5edce4 0 9 6428932 0x621904 0 27 6275058 0x5fbff2 0 19 6290046 0x5ffa7e 0 48 6082133 0x5cce55 0 58 6223558 0x5ef6c6 0 40 6153492 0x5de514 0 59 6269987 0x5fac23 0 15 5587155 0x5540d3 0 46 6191301 0x5e78c5 0 26 6444958 0x62579e 0 54 6421150 0x61fa9e 0 34 6222465 0x5ef281 0 55 6288603 0x5ff4db 0 13 6360247 0x610cb7 0 8 5921168 0x5a5990 0 29 6144665 0x5dc299 0 63 5799435 0x587e0b 0 53 6356594 0x60fe72 0 6 6214509 0x5ed36d 0 61 6319347 0x606cf3 0 43 6414677 0x61e155 0 36 5790422 0x585ad6 0 18 6222532 0x5ef2c4 0 28 5921782 0x5a5bf6 0 1 6361844 0x6112f4 0 41 5746110 0x57adbe 0 35 6043439 0x5c372f 0 45 6122676 0x5d6cb4 0 2 6193223 0x5e8047 0 62 5902764 0x5a11ac 0 56 6511354 0x635afa 0 23 5576293 0x551665 0 14 6258551 0x5f7f77 0 12 6109474 0x5d3922 0 60 6407726 0x61c62e 0 57 6243713 0x5f4581 0 20 6249079 0x5f5a77 0 3 5639606 0x560db6 0 50 5982718 0x5b49fe 0 31 6372788 0x613db4 0 52 6502335 0x6337bf 0 32 4738970 0x484f9a 0 38 5440109 0x53026d 0 51 4683453 0x4776bd 0 39 6391955 0x618893 0 16 5755161 0x57d119 0 ----------------------------------------------------------------------------> Run lfs getstripe <your_output_file> and paste the output of that command > to > the mailing list. > Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe > count will be 11) > If your striping is correct, the bottleneck can be your client network. > > regards, > > Wojciech > > > > On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote: > >> The stripe count is 48. >> >> Just fyi, this is what my application does: >> A simple I/O test where threads continually write blocks of size >> 64Kbytes >> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes >> is >> created. >> >> Thanks, >> Kshitij >> >> > What is your stripe count on the file, if your default is 1, you are >> only >> > writing to one of the OST''s. you can check with the lfs getstripe >> > command, you can set the stripe bigger, and hopefully your >> wide-stripped >> > file with threaded writes will be faster. >> > >> > Evan >> > >> > -----Original Message----- >> > From: lustre-community-bounces at lists.lustre.org >> > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of >> > kmehta at cs.uh.edu >> > Sent: Monday, May 23, 2011 2:28 PM >> > To: lustre-community at lists.lustre.org >> > Subject: [Lustre-community] Poor multithreaded I/O performance >> > >> > Hello, >> > I am running a multithreaded application that writes to a common >> shared >> > file on lustre fs, and this is what I see: >> > >> > If I have a single thread in my application, I get a bandwidth of >> approx. >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 >> > threads such that all of them write to the same file (non-overlapping >> > locations), without explicitly synchronizing the writes (i.e. I dont >> lock >> > the file handle), I still get the same bandwidth. >> > >> > Now, instead of writing to a shared file, if these threads write to >> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec. >> > >> > I would ideally like my multithreaded application to see similar >> scaling. >> > Any ideas why the performance is limited and any workarounds? >> > >> > Thank you, >> > Kshitij >> > >> > >> > _______________________________________________ >> > Lustre-community mailing list >> > Lustre-community at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-community >> > >> >> >> _______________________________________________ >> Lustre-community mailing list >> Lustre-community at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-community >> > > > > -- > Wojciech Turek > > Senior System Architect > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517 >
kmehta at cs.uh.edu
2011-May-23 22:04 UTC
[Lustre-community] Poor multithreaded I/O performance
So I think there are 11 servers (OSSs and not OSTs. Sorry). Running ''lfs check osts'' returns 64 entries, so I think the system has been configured with 64 OSTs. - Kshitij> Run lfs getstripe <your_output_file> and paste the output of that command > to > the mailing list. > Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe > count will be 11) > If your striping is correct, the bottleneck can be your client network. > > regards, > > Wojciech > > > > On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote: > >> The stripe count is 48. >> >> Just fyi, this is what my application does: >> A simple I/O test where threads continually write blocks of size >> 64Kbytes >> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes >> is >> created. >> >> Thanks, >> Kshitij >> >> > What is your stripe count on the file, if your default is 1, you are >> only >> > writing to one of the OST''s. you can check with the lfs getstripe >> > command, you can set the stripe bigger, and hopefully your >> wide-stripped >> > file with threaded writes will be faster. >> > >> > Evan >> > >> > -----Original Message----- >> > From: lustre-community-bounces at lists.lustre.org >> > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of >> > kmehta at cs.uh.edu >> > Sent: Monday, May 23, 2011 2:28 PM >> > To: lustre-community at lists.lustre.org >> > Subject: [Lustre-community] Poor multithreaded I/O performance >> > >> > Hello, >> > I am running a multithreaded application that writes to a common >> shared >> > file on lustre fs, and this is what I see: >> > >> > If I have a single thread in my application, I get a bandwidth of >> approx. >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 >> > threads such that all of them write to the same file (non-overlapping >> > locations), without explicitly synchronizing the writes (i.e. I dont >> lock >> > the file handle), I still get the same bandwidth. >> > >> > Now, instead of writing to a shared file, if these threads write to >> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec. >> > >> > I would ideally like my multithreaded application to see similar >> scaling. >> > Any ideas why the performance is limited and any workarounds? >> > >> > Thank you, >> > Kshitij >> > >> > >> > _______________________________________________ >> > Lustre-community mailing list >> > Lustre-community at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-community >> > >> >> >> _______________________________________________ >> Lustre-community mailing list >> Lustre-community at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-community >> > > > > -- > Wojciech Turek > > Senior System Architect > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517 >
kmehta at cs.uh.edu
2011-May-23 22:09 UTC
[Lustre-community] Poor multithreaded I/O performance
Actually, ''lfs check servers'' returns 64 entries as well, so I presume the system documentation is out of date. Again, I am sorry the basic information had been incorrect. - Kshitij> Run lfs getstripe <your_output_file> and paste the output of that command > to > the mailing list. > Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe > count will be 11) > If your striping is correct, the bottleneck can be your client network. > > regards, > > Wojciech > > > > On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote: > >> The stripe count is 48. >> >> Just fyi, this is what my application does: >> A simple I/O test where threads continually write blocks of size >> 64Kbytes >> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes >> is >> created. >> >> Thanks, >> Kshitij >> >> > What is your stripe count on the file, if your default is 1, you are >> only >> > writing to one of the OST''s. you can check with the lfs getstripe >> > command, you can set the stripe bigger, and hopefully your >> wide-stripped >> > file with threaded writes will be faster. >> > >> > Evan >> > >> > -----Original Message----- >> > From: lustre-community-bounces at lists.lustre.org >> > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of >> > kmehta at cs.uh.edu >> > Sent: Monday, May 23, 2011 2:28 PM >> > To: lustre-community at lists.lustre.org >> > Subject: [Lustre-community] Poor multithreaded I/O performance >> > >> > Hello, >> > I am running a multithreaded application that writes to a common >> shared >> > file on lustre fs, and this is what I see: >> > >> > If I have a single thread in my application, I get a bandwidth of >> approx. >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 >> > threads such that all of them write to the same file (non-overlapping >> > locations), without explicitly synchronizing the writes (i.e. I dont >> lock >> > the file handle), I still get the same bandwidth. >> > >> > Now, instead of writing to a shared file, if these threads write to >> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec. >> > >> > I would ideally like my multithreaded application to see similar >> scaling. >> > Any ideas why the performance is limited and any workarounds? >> > >> > Thank you, >> > Kshitij >> > >> > >> > _______________________________________________ >> > Lustre-community mailing list >> > Lustre-community at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-community >> > >> >> >> _______________________________________________ >> Lustre-community mailing list >> Lustre-community at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-community >> > > > > -- > Wojciech Turek > > Senior System Architect > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517 >
Wojciech Turek
2011-May-23 23:52 UTC
[Lustre-community] Poor multithreaded I/O performance
Ok so it looks like you have in total 64 OSTs and your output file is striped across 48 of them. May I suggest that you limit number of stripes, lets say a good number to start with would be 8 stripes and also for best results use OST pools feature to arrange that each stripe goes to OST owned by different OSS. regards, Wojciech On 23 May 2011 23:09, <kmehta at cs.uh.edu> wrote:> Actually, ''lfs check servers'' returns 64 entries as well, so I presume the > system documentation is out of date. > > Again, I am sorry the basic information had been incorrect. > > - Kshitij > > > Run lfs getstripe <your_output_file> and paste the output of that command > > to > > the mailing list. > > Stripe count of 48 is not possible if you have max 11 OSTs (the max > stripe > > count will be 11) > > If your striping is correct, the bottleneck can be your client network. > > > > regards, > > > > Wojciech > > > > > > > > On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote: > > > >> The stripe count is 48. > >> > >> Just fyi, this is what my application does: > >> A simple I/O test where threads continually write blocks of size > >> 64Kbytes > >> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes > >> is > >> created. > >> > >> Thanks, > >> Kshitij > >> > >> > What is your stripe count on the file, if your default is 1, you are > >> only > >> > writing to one of the OST''s. you can check with the lfs getstripe > >> > command, you can set the stripe bigger, and hopefully your > >> wide-stripped > >> > file with threaded writes will be faster. > >> > > >> > Evan > >> > > >> > -----Original Message----- > >> > From: lustre-community-bounces at lists.lustre.org > >> > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of > >> > kmehta at cs.uh.edu > >> > Sent: Monday, May 23, 2011 2:28 PM > >> > To: lustre-community at lists.lustre.org > >> > Subject: [Lustre-community] Poor multithreaded I/O performance > >> > > >> > Hello, > >> > I am running a multithreaded application that writes to a common > >> shared > >> > file on lustre fs, and this is what I see: > >> > > >> > If I have a single thread in my application, I get a bandwidth of > >> approx. > >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 > >> > threads such that all of them write to the same file (non-overlapping > >> > locations), without explicitly synchronizing the writes (i.e. I dont > >> lock > >> > the file handle), I still get the same bandwidth. > >> > > >> > Now, instead of writing to a shared file, if these threads write to > >> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec. > >> > > >> > I would ideally like my multithreaded application to see similar > >> scaling. > >> > Any ideas why the performance is limited and any workarounds? > >> > > >> > Thank you, > >> > Kshitij > >> > > >> > > >> > _______________________________________________ > >> > Lustre-community mailing list > >> > Lustre-community at lists.lustre.org > >> > http://lists.lustre.org/mailman/listinfo/lustre-community > >> > > >> > >> > >> _______________________________________________ > >> Lustre-community mailing list > >> Lustre-community at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-community > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-community/attachments/20110524/7f257371/attachment.html
Kevin Van Maren
2011-May-24 14:16 UTC
[Lustre-community] Poor multithreaded I/O performance
[Moved to Lustre-discuss] "However, if I spawn 8 threads such that all of them write to the same file (non-overlapping locations), without explicitly synchronizing the writes (i.e. I dont lock the file handle)" How exactly does your multi-threaded application write the data? Are you using pwrite to ensure non-overlapping regions or are they all just doing unlocked write() operations on the same fd to each write (each just transferring size/8)? If it divides the file into N pieces, and each thread does pwrite on its piece, then what each OST sees are multiple streams at wide offsets to the same object, which could impact performance. If on the other hand the file is written sequentially, where each thread grabs the next piece to be written (locking normally used for the current_offset value, so you know where each chunk is actually going), then you get a more sequential pattern at the OST. If the number of threads maps to the number of OSTs (or some modulo, like in your case 6 OSTs per thread), and each thread "owns" the piece of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then you''ve eliminated the need for application locks (assuming the use of pwrite) and ensured each OST object is being written sequentially. It''s quite possible there is some bottleneck on the shared fd. So perhaps the question is not why you aren''t scaling with more threads, but why the single file is not able to saturate the client, or why the file BW is not scaling with more OSTs. It is somewhat common for multiple processes (on different nodes) to write non-overlapping regions of the same file; does performance improve if each thread opens its own file descriptor? Kevin Wojciech Turek wrote:> Ok so it looks like you have in total 64 OSTs and your output file is > striped across 48 of them. May I suggest that you limit number of > stripes, lets say a good number to start with would be 8 stripes and > also for best results use OST pools feature to arrange that each > stripe goes to OST owned by different OSS. > > regards, > > Wojciech > > On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>> wrote: > > Actually, ''lfs check servers'' returns 64 entries as well, so I > presume the > system documentation is out of date. > > Again, I am sorry the basic information had been incorrect. > > - Kshitij > > > Run lfs getstripe <your_output_file> and paste the output of > that command > > to > > the mailing list. > > Stripe count of 48 is not possible if you have max 11 OSTs (the > max stripe > > count will be 11) > > If your striping is correct, the bottleneck can be your client > network. > > > > regards, > > > > Wojciech > > > > > > > > On 23 May 2011 22:35, <kmehta at cs.uh.edu > <mailto:kmehta at cs.uh.edu>> wrote: > > > >> The stripe count is 48. > >> > >> Just fyi, this is what my application does: > >> A simple I/O test where threads continually write blocks of size > >> 64Kbytes > >> or 1Mbyte (decided at compile time) till a large file of say, > 16Gbytes > >> is > >> created. > >> > >> Thanks, > >> Kshitij > >> > >> > What is your stripe count on the file, if your default is 1, > you are > >> only > >> > writing to one of the OST''s. you can check with the lfs > getstripe > >> > command, you can set the stripe bigger, and hopefully your > >> wide-stripped > >> > file with threaded writes will be faster. > >> > > >> > Evan > >> > > >> > -----Original Message----- > >> > From: lustre-community-bounces at lists.lustre.org > <mailto:lustre-community-bounces at lists.lustre.org> > >> > [mailto:lustre-community-bounces at lists.lustre.org > <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of > >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu> > >> > Sent: Monday, May 23, 2011 2:28 PM > >> > To: lustre-community at lists.lustre.org > <mailto:lustre-community at lists.lustre.org> > >> > Subject: [Lustre-community] Poor multithreaded I/O performance > >> > > >> > Hello, > >> > I am running a multithreaded application that writes to a common > >> shared > >> > file on lustre fs, and this is what I see: > >> > > >> > If I have a single thread in my application, I get a bandwidth of > >> approx. > >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I > spawn 8 > >> > threads such that all of them write to the same file > (non-overlapping > >> > locations), without explicitly synchronizing the writes (i.e. > I dont > >> lock > >> > the file handle), I still get the same bandwidth. > >> > > >> > Now, instead of writing to a shared file, if these threads > write to > >> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec. > >> > > >> > I would ideally like my multithreaded application to see similar > >> scaling. > >> > Any ideas why the performance is limited and any workarounds? > >> > > >> > Thank you, > >> > Kshitij > >> > > >> > > >> > _______________________________________________ > >> > Lustre-community mailing list > >> > Lustre-community at lists.lustre.org > <mailto:Lustre-community at lists.lustre.org> > >> > http://lists.lustre.org/mailman/listinfo/lustre-community > >> > > >> > >> > >> _______________________________________________ > >> Lustre-community mailing list > >> Lustre-community at lists.lustre.org > <mailto:Lustre-community at lists.lustre.org> > >> http://lists.lustre.org/mailman/listinfo/lustre-community > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-community mailing list > Lustre-community at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-community >
kmehta at cs.uh.edu
2011-May-24 15:36 UTC
[Lustre-discuss] Poor multithreaded I/O performance
This is what my application does: Each thread has its own file descriptor to the file. I use pwrite to ensure non-overlapping regions, as follows: Thread 0, data_size: 1MB, offset: 0 Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, offset: 2MB Thread 3, data_size: 1MB, offset: 3MB <repeat cycle> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in parallel, I dont wait for one cycle to end before the next one begins). I am gonna try the following: a) Instead of a round-robin distribution of offsets, test with sequential offsets: Thread 0, data_size: 1MB, offset:0 Thread 0, data_size: 1MB, offset:1MB Thread 0, data_size: 1MB, offset:2MB Thread 0, data_size: 1MB, offset:3MB Thread 1, data_size: 1MB, offset:4MB and so on. (I am gonna keep these separate pwrite I/O requests instead of merging them or using writev) b) Map the threads to the no. of OSTs using some modulo, as suggested in the email below. c) Experiment with fewer no. of OSTs (I currently have 48). I shall report back with my findings. Thanks, Kshitij> [Moved to Lustre-discuss] > > > "However, if I spawn 8 threads such that all of them write to the same > file (non-overlapping locations), without explicitly synchronizing the > writes (i.e. I dont lock the file handle)" > > > How exactly does your multi-threaded application write the data? Are > you using pwrite to ensure non-overlapping regions or are they all just > doing unlocked write() operations on the same fd to each write (each > just transferring size/8)? If it divides the file into N pieces, and > each thread does pwrite on its piece, then what each OST sees are > multiple streams at wide offsets to the same object, which could impact > performance. > > If on the other hand the file is written sequentially, where each thread > grabs the next piece to be written (locking normally used for the > current_offset value, so you know where each chunk is actually going), > then you get a more sequential pattern at the OST. > > If the number of threads maps to the number of OSTs (or some modulo, > like in your case 6 OSTs per thread), and each thread "owns" the piece > of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; > offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then > you''ve eliminated the need for application locks (assuming the use of > pwrite) and ensured each OST object is being written sequentially. > > It''s quite possible there is some bottleneck on the shared fd. So > perhaps the question is not why you aren''t scaling with more threads, > but why the single file is not able to saturate the client, or why the > file BW is not scaling with more OSTs. It is somewhat common for > multiple processes (on different nodes) to write non-overlapping regions > of the same file; does performance improve if each thread opens its own > file descriptor? > > Kevin > > > Wojciech Turek wrote: >> Ok so it looks like you have in total 64 OSTs and your output file is >> striped across 48 of them. May I suggest that you limit number of >> stripes, lets say a good number to start with would be 8 stripes and >> also for best results use OST pools feature to arrange that each >> stripe goes to OST owned by different OSS. >> >> regards, >> >> Wojciech >> >> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>> >> wrote: >> >> Actually, ''lfs check servers'' returns 64 entries as well, so I >> presume the >> system documentation is out of date. >> >> Again, I am sorry the basic information had been incorrect. >> >> - Kshitij >> >> > Run lfs getstripe <your_output_file> and paste the output of >> that command >> > to >> > the mailing list. >> > Stripe count of 48 is not possible if you have max 11 OSTs (the >> max stripe >> > count will be 11) >> > If your striping is correct, the bottleneck can be your client >> network. >> > >> > regards, >> > >> > Wojciech >> > >> > >> > >> > On 23 May 2011 22:35, <kmehta at cs.uh.edu >> <mailto:kmehta at cs.uh.edu>> wrote: >> > >> >> The stripe count is 48. >> >> >> >> Just fyi, this is what my application does: >> >> A simple I/O test where threads continually write blocks of size >> >> 64Kbytes >> >> or 1Mbyte (decided at compile time) till a large file of say, >> 16Gbytes >> >> is >> >> created. >> >> >> >> Thanks, >> >> Kshitij >> >> >> >> > What is your stripe count on the file, if your default is 1, >> you are >> >> only >> >> > writing to one of the OST''s. you can check with the lfs >> getstripe >> >> > command, you can set the stripe bigger, and hopefully your >> >> wide-stripped >> >> > file with threaded writes will be faster. >> >> > >> >> > Evan >> >> > >> >> > -----Original Message----- >> >> > From: lustre-community-bounces at lists.lustre.org >> <mailto:lustre-community-bounces at lists.lustre.org> >> >> > [mailto:lustre-community-bounces at lists.lustre.org >> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu> >> >> > Sent: Monday, May 23, 2011 2:28 PM >> >> > To: lustre-community at lists.lustre.org >> <mailto:lustre-community at lists.lustre.org> >> >> > Subject: [Lustre-community] Poor multithreaded I/O performance >> >> > >> >> > Hello, >> >> > I am running a multithreaded application that writes to a >> common >> >> shared >> >> > file on lustre fs, and this is what I see: >> >> > >> >> > If I have a single thread in my application, I get a bandwidth >> of >> >> approx. >> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >> spawn 8 >> >> > threads such that all of them write to the same file >> (non-overlapping >> >> > locations), without explicitly synchronizing the writes (i.e. >> I dont >> >> lock >> >> > the file handle), I still get the same bandwidth. >> >> > >> >> > Now, instead of writing to a shared file, if these threads >> write to >> >> > separate files, the bandwidth obtained is approx. 700 >> Mbytes/sec. >> >> > >> >> > I would ideally like my multithreaded application to see >> similar >> >> scaling. >> >> > Any ideas why the performance is limited and any workarounds? >> >> > >> >> > Thank you, >> >> > Kshitij >> >> > >> >> > >> >> > _______________________________________________ >> >> > Lustre-community mailing list >> >> > Lustre-community at lists.lustre.org >> <mailto:Lustre-community at lists.lustre.org> >> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >> >> > >> >> >> >> >> >> _______________________________________________ >> >> Lustre-community mailing list >> >> Lustre-community at lists.lustre.org >> <mailto:Lustre-community at lists.lustre.org> >> >> http://lists.lustre.org/mailman/listinfo/lustre-community >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-community mailing list >> Lustre-community at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-community >> >
kmehta at cs.uh.edu
2011-May-26 19:02 UTC
[Lustre-discuss] Poor multithreaded I/O performance
Ok I ran the following tests: [1] Application spawns 8 threads. I write to Lustre having 8 OSTs. Each thread writes data in blocks of 1 Mbyte in a round robin fashion, i.e. T0 writes to offsets 0, 8MB, 16MB, etc. T1 writes to offsets 1MB, 9MB, 17MB, etc. The stripe size being 1MByte, every thread ends up writing to only 1 OST. I see a bandwidth of 280 Mbytes/sec, similar to the single thread performance. [2] I also ran the same test such that every thread writes data in blocks of 8 Mbytes for the same stripe size. (Thus, every thread will write to every OST). I still get similar performance, ~280Mbytes/sec, so essentially I see no difference between each thread writing to a single OST vs each thread writing to all OSTs. And as I said before, if all threads write to their own separate file, the resulting bandwidth is ~700Mbytes/sec. I have attached my C file (simple_io_test.c) herewith. Maybe you could run it and see where the bottleneck is. Comments and instructions for compilation have been included in the file. Do let me know if you need any clarification on that. Your help is appreciated, Kshitij> This is what my application does: > > Each thread has its own file descriptor to the file. > I use pwrite to ensure non-overlapping regions, as follows: > > Thread 0, data_size: 1MB, offset: 0 > Thread 1, data_size: 1MB, offset: 1MB > Thread 2, data_size: 1MB, offset: 2MB > Thread 3, data_size: 1MB, offset: 3MB > > <repeat cycle> > Thread 0, data_size: 1MB, offset: 4MB > and so on (This happens in parallel, I dont wait for one cycle to end > before the next one begins). > > I am gonna try the following: > a) > Instead of a round-robin distribution of offsets, test with sequential > offsets: > Thread 0, data_size: 1MB, offset:0 > Thread 0, data_size: 1MB, offset:1MB > Thread 0, data_size: 1MB, offset:2MB > Thread 0, data_size: 1MB, offset:3MB > > Thread 1, data_size: 1MB, offset:4MB > and so on. (I am gonna keep these separate pwrite I/O requests instead of > merging them or using writev) > > b) > Map the threads to the no. of OSTs using some modulo, as suggested in the > email below. > > c) > Experiment with fewer no. of OSTs (I currently have 48). > > I shall report back with my findings. > > Thanks, > Kshitij > >> [Moved to Lustre-discuss] >> >> >> "However, if I spawn 8 threads such that all of them write to the same >> file (non-overlapping locations), without explicitly synchronizing the >> writes (i.e. I dont lock the file handle)" >> >> >> How exactly does your multi-threaded application write the data? Are >> you using pwrite to ensure non-overlapping regions or are they all just >> doing unlocked write() operations on the same fd to each write (each >> just transferring size/8)? If it divides the file into N pieces, and >> each thread does pwrite on its piece, then what each OST sees are >> multiple streams at wide offsets to the same object, which could impact >> performance. >> >> If on the other hand the file is written sequentially, where each thread >> grabs the next piece to be written (locking normally used for the >> current_offset value, so you know where each chunk is actually going), >> then you get a more sequential pattern at the OST. >> >> If the number of threads maps to the number of OSTs (or some modulo, >> like in your case 6 OSTs per thread), and each thread "owns" the piece >> of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; >> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then >> you''ve eliminated the need for application locks (assuming the use of >> pwrite) and ensured each OST object is being written sequentially. >> >> It''s quite possible there is some bottleneck on the shared fd. So >> perhaps the question is not why you aren''t scaling with more threads, >> but why the single file is not able to saturate the client, or why the >> file BW is not scaling with more OSTs. It is somewhat common for >> multiple processes (on different nodes) to write non-overlapping regions >> of the same file; does performance improve if each thread opens its own >> file descriptor? >> >> Kevin >> >> >> Wojciech Turek wrote: >>> Ok so it looks like you have in total 64 OSTs and your output file is >>> striped across 48 of them. May I suggest that you limit number of >>> stripes, lets say a good number to start with would be 8 stripes and >>> also for best results use OST pools feature to arrange that each >>> stripe goes to OST owned by different OSS. >>> >>> regards, >>> >>> Wojciech >>> >>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>> >>> wrote: >>> >>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>> presume the >>> system documentation is out of date. >>> >>> Again, I am sorry the basic information had been incorrect. >>> >>> - Kshitij >>> >>> > Run lfs getstripe <your_output_file> and paste the output of >>> that command >>> > to >>> > the mailing list. >>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>> max stripe >>> > count will be 11) >>> > If your striping is correct, the bottleneck can be your client >>> network. >>> > >>> > regards, >>> > >>> > Wojciech >>> > >>> > >>> > >>> > On 23 May 2011 22:35, <kmehta at cs.uh.edu >>> <mailto:kmehta at cs.uh.edu>> wrote: >>> > >>> >> The stripe count is 48. >>> >> >>> >> Just fyi, this is what my application does: >>> >> A simple I/O test where threads continually write blocks of size >>> >> 64Kbytes >>> >> or 1Mbyte (decided at compile time) till a large file of say, >>> 16Gbytes >>> >> is >>> >> created. >>> >> >>> >> Thanks, >>> >> Kshitij >>> >> >>> >> > What is your stripe count on the file, if your default is 1, >>> you are >>> >> only >>> >> > writing to one of the OST''s. you can check with the lfs >>> getstripe >>> >> > command, you can set the stripe bigger, and hopefully your >>> >> wide-stripped >>> >> > file with threaded writes will be faster. >>> >> > >>> >> > Evan >>> >> > >>> >> > -----Original Message----- >>> >> > From: lustre-community-bounces at lists.lustre.org >>> <mailto:lustre-community-bounces at lists.lustre.org> >>> >> > [mailto:lustre-community-bounces at lists.lustre.org >>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu> >>> >> > Sent: Monday, May 23, 2011 2:28 PM >>> >> > To: lustre-community at lists.lustre.org >>> <mailto:lustre-community at lists.lustre.org> >>> >> > Subject: [Lustre-community] Poor multithreaded I/O performance >>> >> > >>> >> > Hello, >>> >> > I am running a multithreaded application that writes to a >>> common >>> >> shared >>> >> > file on lustre fs, and this is what I see: >>> >> > >>> >> > If I have a single thread in my application, I get a bandwidth >>> of >>> >> approx. >>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>> spawn 8 >>> >> > threads such that all of them write to the same file >>> (non-overlapping >>> >> > locations), without explicitly synchronizing the writes (i.e. >>> I dont >>> >> lock >>> >> > the file handle), I still get the same bandwidth. >>> >> > >>> >> > Now, instead of writing to a shared file, if these threads >>> write to >>> >> > separate files, the bandwidth obtained is approx. 700 >>> Mbytes/sec. >>> >> > >>> >> > I would ideally like my multithreaded application to see >>> similar >>> >> scaling. >>> >> > Any ideas why the performance is limited and any workarounds? >>> >> > >>> >> > Thank you, >>> >> > Kshitij >>> >> > >>> >> > >>> >> > _______________________________________________ >>> >> > Lustre-community mailing list >>> >> > Lustre-community at lists.lustre.org >>> <mailto:Lustre-community at lists.lustre.org> >>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>> >> > >>> >> >>> >> >>> >> _______________________________________________ >>> >> Lustre-community mailing list >>> >> Lustre-community at lists.lustre.org >>> <mailto:Lustre-community at lists.lustre.org> >>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-community mailing list >>> Lustre-community at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-community >>> >> > >-------------- next part -------------- A non-text attachment was scrubbed... Name: simple_io_test.c Type: text/x-csrc Size: 9579 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110526/16b2680f/attachment.bin
Hi David, I am writing to an existing directory that has supposedly been configured on 8 OSTs. I have shown below the output of lfs getstripe on the directory and on the output file generated by the program. It seems that the file is correctly striped across 8 OSTs. In one of the previous emails, Wojciech suggested I make sure that the OSTs belong to different OSSes using the OST pools feature. Can someone suggest how I can verify that an existing directory is configured on OSTs belonging to different OSSes (though I have a hint that that''s not the problem, since writing to separate files on the same directory does give me ~700Mbytes/sec) lfs getstripe ../../ss_8/ --verbose ---------------------------------------------------------------------------- ------------------------------------- OBDS: 0: fastfs-OST0000_UUID ACTIVE 1: fastfs-OST0001_UUID ACTIVE 2: fastfs-OST0002_UUID ACTIVE 3: fastfs-OST0003_UUID ACTIVE 4: fastfs-OST0004_UUID ACTIVE 5: fastfs-OST0005_UUID ACTIVE 6: fastfs-OST0006_UUID ACTIVE 7: fastfs-OST0007_UUID ACTIVE 8: fastfs-OST0008_UUID ACTIVE 9: fastfs-OST0009_UUID ACTIVE 10: fastfs-OST000a_UUID ACTIVE 11: fastfs-OST000b_UUID ACTIVE 12: fastfs-OST000c_UUID ACTIVE 13: fastfs-OST000d_UUID ACTIVE 14: fastfs-OST000e_UUID ACTIVE 15: fastfs-OST000f_UUID ACTIVE 16: fastfs-OST0010_UUID ACTIVE 17: fastfs-OST0011_UUID ACTIVE 18: fastfs-OST0012_UUID ACTIVE 19: fastfs-OST0013_UUID ACTIVE 20: fastfs-OST0014_UUID ACTIVE 21: fastfs-OST0015_UUID ACTIVE 22: fastfs-OST0016_UUID ACTIVE 23: fastfs-OST0017_UUID ACTIVE 24: fastfs-OST0018_UUID ACTIVE 25: fastfs-OST0019_UUID ACTIVE 26: fastfs-OST001a_UUID ACTIVE 27: fastfs-OST001b_UUID ACTIVE 28: fastfs-OST001c_UUID ACTIVE 29: fastfs-OST001d_UUID ACTIVE 30: fastfs-OST001e_UUID ACTIVE 31: fastfs-OST001f_UUID ACTIVE 32: fastfs-OST0020_UUID ACTIVE 33: fastfs-OST0021_UUID ACTIVE 34: fastfs-OST0022_UUID ACTIVE 35: fastfs-OST0023_UUID ACTIVE 36: fastfs-OST0024_UUID ACTIVE 37: fastfs-OST0025_UUID ACTIVE 38: fastfs-OST0026_UUID ACTIVE 39: fastfs-OST0027_UUID ACTIVE 40: fastfs-OST0028_UUID ACTIVE 41: fastfs-OST0029_UUID ACTIVE 42: fastfs-OST002a_UUID ACTIVE 43: fastfs-OST002b_UUID ACTIVE 44: fastfs-OST002c_UUID ACTIVE 45: fastfs-OST002d_UUID ACTIVE 46: fastfs-OST002e_UUID ACTIVE 47: fastfs-OST002f_UUID ACTIVE 48: fastfs-OST0030_UUID ACTIVE 49: fastfs-OST0031_UUID ACTIVE 50: fastfs-OST0032_UUID ACTIVE 51: fastfs-OST0033_UUID ACTIVE 52: fastfs-OST0034_UUID ACTIVE 53: fastfs-OST0035_UUID ACTIVE 54: fastfs-OST0036_UUID ACTIVE 55: fastfs-OST0037_UUID ACTIVE 56: fastfs-OST0038_UUID ACTIVE 57: fastfs-OST0039_UUID ACTIVE 58: fastfs-OST003a_UUID ACTIVE 59: fastfs-OST003b_UUID ACTIVE 60: fastfs-OST003c_UUID ACTIVE 61: fastfs-OST003d_UUID ACTIVE 62: fastfs-OST003e_UUID ACTIVE 63: fastfs-OST003f_UUID ACTIVE ../../ss_8 stripe_count: 8 stripe_size: 0 stripe_offset: -1 ---------------------------------------------------------------------------- ------------------------------------ Running lfs getstripe on the 16Gbyte file generated by the program shows this: ---------------------------------------------------------------------------- ------------------------------------ lfs getstripe ../../ss_8/kmtest.txt --verbose OBDS: 0: fastfs-OST0000_UUID ACTIVE 1: fastfs-OST0001_UUID ACTIVE 2: fastfs-OST0002_UUID ACTIVE 3: fastfs-OST0003_UUID ACTIVE 4: fastfs-OST0004_UUID ACTIVE 5: fastfs-OST0005_UUID ACTIVE 6: fastfs-OST0006_UUID ACTIVE 7: fastfs-OST0007_UUID ACTIVE 8: fastfs-OST0008_UUID ACTIVE 9: fastfs-OST0009_UUID ACTIVE 10: fastfs-OST000a_UUID ACTIVE 11: fastfs-OST000b_UUID ACTIVE 12: fastfs-OST000c_UUID ACTIVE 13: fastfs-OST000d_UUID ACTIVE 14: fastfs-OST000e_UUID ACTIVE 15: fastfs-OST000f_UUID ACTIVE 16: fastfs-OST0010_UUID ACTIVE 17: fastfs-OST0011_UUID ACTIVE 18: fastfs-OST0012_UUID ACTIVE 19: fastfs-OST0013_UUID ACTIVE 20: fastfs-OST0014_UUID ACTIVE 21: fastfs-OST0015_UUID ACTIVE 22: fastfs-OST0016_UUID ACTIVE 23: fastfs-OST0017_UUID ACTIVE 24: fastfs-OST0018_UUID ACTIVE 25: fastfs-OST0019_UUID ACTIVE 26: fastfs-OST001a_UUID ACTIVE 27: fastfs-OST001b_UUID ACTIVE 28: fastfs-OST001c_UUID ACTIVE 29: fastfs-OST001d_UUID ACTIVE 30: fastfs-OST001e_UUID ACTIVE 31: fastfs-OST001f_UUID ACTIVE 32: fastfs-OST0020_UUID ACTIVE 33: fastfs-OST0021_UUID ACTIVE 34: fastfs-OST0022_UUID ACTIVE 35: fastfs-OST0023_UUID ACTIVE 36: fastfs-OST0024_UUID ACTIVE 37: fastfs-OST0025_UUID ACTIVE 38: fastfs-OST0026_UUID ACTIVE 39: fastfs-OST0027_UUID ACTIVE 40: fastfs-OST0028_UUID ACTIVE 41: fastfs-OST0029_UUID ACTIVE 42: fastfs-OST002a_UUID ACTIVE 43: fastfs-OST002b_UUID ACTIVE 44: fastfs-OST002c_UUID ACTIVE 45: fastfs-OST002d_UUID ACTIVE 46: fastfs-OST002e_UUID ACTIVE 47: fastfs-OST002f_UUID ACTIVE 48: fastfs-OST0030_UUID ACTIVE 49: fastfs-OST0031_UUID ACTIVE 50: fastfs-OST0032_UUID ACTIVE 51: fastfs-OST0033_UUID ACTIVE 52: fastfs-OST0034_UUID ACTIVE 53: fastfs-OST0035_UUID ACTIVE 54: fastfs-OST0036_UUID ACTIVE 55: fastfs-OST0037_UUID ACTIVE 56: fastfs-OST0038_UUID ACTIVE 57: fastfs-OST0039_UUID ACTIVE 58: fastfs-OST003a_UUID ACTIVE 59: fastfs-OST003b_UUID ACTIVE 60: fastfs-OST003c_UUID ACTIVE 61: fastfs-OST003d_UUID ACTIVE 62: fastfs-OST003e_UUID ACTIVE 63: fastfs-OST003f_UUID ACTIVE ../../ss_8/kmtest.txt lmm_magic: 0x0BD10BD0 lmm_object_gr: 0 lmm_object_id: 0x3c52894 lmm_stripe_count: 8 lmm_stripe_size: 1048576 lmm_stripe_pattern: 1 obdidx objid objid group 10 6352973 0x60f04d 0 20 6260051 0x5f8553 0 4 5733251 0x577b83 0 22 6381603 0x616023 0 17 6265103 0x5f990f 0 45 6133999 0x5d98ef 0 31 6383869 0x6168fd 0 58 6234719 0x5f225f 0 ---------------------------------------------------------------------------- ------------------------------------ Thanks, Kshitij -----Original Message----- From: David Vasil [mailto:dvasil at ddn.com] Sent: Thursday, May 26, 2011 3:01 PM To: kmehta at cs.uh.edu Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance Hi Kshitij, Did you create your files with ''lfs setstripe -c <stripe count> <file>'' before writing to it or did you create a directory with a default stripe size greater than 1? It sounds like you are only striping across 1 file. After writing your file out, perform an: lfs getstripe <file> Try pre-creating a more widely striped file with: lfs setstripe -c N <file> where N is > 1. You can create a directory where all files under the hierarchy will be striped using more OSTs in the same manner with lfs setstripe. _____ David Vasil DataDirect Networks 615.307.0865 dvasil at ddn.com On 05/26/2011 02:02 PM, kmehta at cs.uh.edu wrote:> Ok I ran the following tests: > > [1] > Application spawns 8 threads. I write to Lustre having 8 OSTs. > Each thread writes data in blocks of 1 Mbyte in a round robin fashion,i.e.> > T0 writes to offsets 0, 8MB, 16MB, etc. > T1 writes to offsets 1MB, 9MB, 17MB, etc. > The stripe size being 1MByte, every thread ends up writing to only 1 OST. > > I see a bandwidth of 280 Mbytes/sec, similar to the single thread > performance. > > [2] > I also ran the same test such that every thread writes data in blocks > of 8 Mbytes for the same stripe size. (Thus, every thread will write > to every OST). I still get similar performance, ~280Mbytes/sec, so > essentially I see no difference between each thread writing to a > single OST vs each thread writing to all OSTs. > > And as I said before, if all threads write to their own separate file, > the resulting bandwidth is ~700Mbytes/sec. > > I have attached my C file (simple_io_test.c) herewith. Maybe you could > run it and see where the bottleneck is. Comments and instructions for > compilation have been included in the file. Do let me know if you need > any clarification on that. > > Your help is appreciated, > Kshitij
kmehta at cs.uh.edu
2011-Jun-03 00:06 UTC
[Lustre-discuss] Poor multithreaded I/O performance
Hello, I was wondering if anyone could replicate the performance of the multithreaded application using the C file that I posted in my previous email. Thanks, Kshitij> Ok I ran the following tests: > > [1] > Application spawns 8 threads. I write to Lustre having 8 OSTs. > Each thread writes data in blocks of 1 Mbyte in a round robin fashion, > i.e. > > T0 writes to offsets 0, 8MB, 16MB, etc. > T1 writes to offsets 1MB, 9MB, 17MB, etc. > The stripe size being 1MByte, every thread ends up writing to only 1 OST. > > I see a bandwidth of 280 Mbytes/sec, similar to the single thread > performance. > > [2] > I also ran the same test such that every thread writes data in blocks of 8 > Mbytes for the same stripe size. (Thus, every thread will write to every > OST). I still get similar performance, ~280Mbytes/sec, so essentially I > see no difference between each thread writing to a single OST vs each > thread writing to all OSTs. > > And as I said before, if all threads write to their own separate file, the > resulting bandwidth is ~700Mbytes/sec. > > I have attached my C file (simple_io_test.c) herewith. Maybe you could run > it and see where the bottleneck is. Comments and instructions for > compilation have been included in the file. Do let me know if you need any > clarification on that. > > Your help is appreciated, > Kshitij > >> This is what my application does: >> >> Each thread has its own file descriptor to the file. >> I use pwrite to ensure non-overlapping regions, as follows: >> >> Thread 0, data_size: 1MB, offset: 0 >> Thread 1, data_size: 1MB, offset: 1MB >> Thread 2, data_size: 1MB, offset: 2MB >> Thread 3, data_size: 1MB, offset: 3MB >> >> <repeat cycle> >> Thread 0, data_size: 1MB, offset: 4MB >> and so on (This happens in parallel, I dont wait for one cycle to end >> before the next one begins). >> >> I am gonna try the following: >> a) >> Instead of a round-robin distribution of offsets, test with sequential >> offsets: >> Thread 0, data_size: 1MB, offset:0 >> Thread 0, data_size: 1MB, offset:1MB >> Thread 0, data_size: 1MB, offset:2MB >> Thread 0, data_size: 1MB, offset:3MB >> >> Thread 1, data_size: 1MB, offset:4MB >> and so on. (I am gonna keep these separate pwrite I/O requests instead >> of >> merging them or using writev) >> >> b) >> Map the threads to the no. of OSTs using some modulo, as suggested in >> the >> email below. >> >> c) >> Experiment with fewer no. of OSTs (I currently have 48). >> >> I shall report back with my findings. >> >> Thanks, >> Kshitij >> >>> [Moved to Lustre-discuss] >>> >>> >>> "However, if I spawn 8 threads such that all of them write to the same >>> file (non-overlapping locations), without explicitly synchronizing the >>> writes (i.e. I dont lock the file handle)" >>> >>> >>> How exactly does your multi-threaded application write the data? Are >>> you using pwrite to ensure non-overlapping regions or are they all just >>> doing unlocked write() operations on the same fd to each write (each >>> just transferring size/8)? If it divides the file into N pieces, and >>> each thread does pwrite on its piece, then what each OST sees are >>> multiple streams at wide offsets to the same object, which could impact >>> performance. >>> >>> If on the other hand the file is written sequentially, where each >>> thread >>> grabs the next piece to be written (locking normally used for the >>> current_offset value, so you know where each chunk is actually going), >>> then you get a more sequential pattern at the OST. >>> >>> If the number of threads maps to the number of OSTs (or some modulo, >>> like in your case 6 OSTs per thread), and each thread "owns" the piece >>> of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; >>> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then >>> you''ve eliminated the need for application locks (assuming the use of >>> pwrite) and ensured each OST object is being written sequentially. >>> >>> It''s quite possible there is some bottleneck on the shared fd. So >>> perhaps the question is not why you aren''t scaling with more threads, >>> but why the single file is not able to saturate the client, or why the >>> file BW is not scaling with more OSTs. It is somewhat common for >>> multiple processes (on different nodes) to write non-overlapping >>> regions >>> of the same file; does performance improve if each thread opens its own >>> file descriptor? >>> >>> Kevin >>> >>> >>> Wojciech Turek wrote: >>>> Ok so it looks like you have in total 64 OSTs and your output file is >>>> striped across 48 of them. May I suggest that you limit number of >>>> stripes, lets say a good number to start with would be 8 stripes and >>>> also for best results use OST pools feature to arrange that each >>>> stripe goes to OST owned by different OSS. >>>> >>>> regards, >>>> >>>> Wojciech >>>> >>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>> >>>> wrote: >>>> >>>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>>> presume the >>>> system documentation is out of date. >>>> >>>> Again, I am sorry the basic information had been incorrect. >>>> >>>> - Kshitij >>>> >>>> > Run lfs getstripe <your_output_file> and paste the output of >>>> that command >>>> > to >>>> > the mailing list. >>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>> max stripe >>>> > count will be 11) >>>> > If your striping is correct, the bottleneck can be your client >>>> network. >>>> > >>>> > regards, >>>> > >>>> > Wojciech >>>> > >>>> > >>>> > >>>> > On 23 May 2011 22:35, <kmehta at cs.uh.edu >>>> <mailto:kmehta at cs.uh.edu>> wrote: >>>> > >>>> >> The stripe count is 48. >>>> >> >>>> >> Just fyi, this is what my application does: >>>> >> A simple I/O test where threads continually write blocks of >>>> size >>>> >> 64Kbytes >>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>> 16Gbytes >>>> >> is >>>> >> created. >>>> >> >>>> >> Thanks, >>>> >> Kshitij >>>> >> >>>> >> > What is your stripe count on the file, if your default is 1, >>>> you are >>>> >> only >>>> >> > writing to one of the OST''s. you can check with the lfs >>>> getstripe >>>> >> > command, you can set the stripe bigger, and hopefully your >>>> >> wide-stripped >>>> >> > file with threaded writes will be faster. >>>> >> > >>>> >> > Evan >>>> >> > >>>> >> > -----Original Message----- >>>> >> > From: lustre-community-bounces at lists.lustre.org >>>> <mailto:lustre-community-bounces at lists.lustre.org> >>>> >> > [mailto:lustre-community-bounces at lists.lustre.org >>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>>> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu> >>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>> >> > To: lustre-community at lists.lustre.org >>>> <mailto:lustre-community at lists.lustre.org> >>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>> performance >>>> >> > >>>> >> > Hello, >>>> >> > I am running a multithreaded application that writes to a >>>> common >>>> >> shared >>>> >> > file on lustre fs, and this is what I see: >>>> >> > >>>> >> > If I have a single thread in my application, I get a >>>> bandwidth >>>> of >>>> >> approx. >>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>> spawn 8 >>>> >> > threads such that all of them write to the same file >>>> (non-overlapping >>>> >> > locations), without explicitly synchronizing the writes (i.e. >>>> I dont >>>> >> lock >>>> >> > the file handle), I still get the same bandwidth. >>>> >> > >>>> >> > Now, instead of writing to a shared file, if these threads >>>> write to >>>> >> > separate files, the bandwidth obtained is approx. 700 >>>> Mbytes/sec. >>>> >> > >>>> >> > I would ideally like my multithreaded application to see >>>> similar >>>> >> scaling. >>>> >> > Any ideas why the performance is limited and any workarounds? >>>> >> > >>>> >> > Thank you, >>>> >> > Kshitij >>>> >> > >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Lustre-community mailing list >>>> >> > Lustre-community at lists.lustre.org >>>> <mailto:Lustre-community at lists.lustre.org> >>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >> > >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> Lustre-community mailing list >>>> >> Lustre-community at lists.lustre.org >>>> <mailto:Lustre-community at lists.lustre.org> >>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Lustre-community mailing list >>>> Lustre-community at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >>> >> >> >
What file sizes and segment sizes are you using for your tests? Evan -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of kmehta at cs.uh.edu Sent: Thursday, June 02, 2011 5:07 PM To: kmehta at cs.uh.edu Cc: kmehta at cs.uh.edu; Lustre discuss Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance Hello, I was wondering if anyone could replicate the performance of the multithreaded application using the C file that I posted in my previous email. Thanks, Kshitij> Ok I ran the following tests: > > [1] > Application spawns 8 threads. I write to Lustre having 8 OSTs. > Each thread writes data in blocks of 1 Mbyte in a round robin fashion, > i.e. > > T0 writes to offsets 0, 8MB, 16MB, etc. > T1 writes to offsets 1MB, 9MB, 17MB, etc. > The stripe size being 1MByte, every thread ends up writing to only 1 OST. > > I see a bandwidth of 280 Mbytes/sec, similar to the single thread > performance. > > [2] > I also ran the same test such that every thread writes data in blocks > of 8 Mbytes for the same stripe size. (Thus, every thread will write > to every OST). I still get similar performance, ~280Mbytes/sec, so > essentially I see no difference between each thread writing to a > single OST vs each thread writing to all OSTs. > > And as I said before, if all threads write to their own separate file, > the resulting bandwidth is ~700Mbytes/sec. > > I have attached my C file (simple_io_test.c) herewith. Maybe you could > run it and see where the bottleneck is. Comments and instructions for > compilation have been included in the file. Do let me know if you need > any clarification on that. > > Your help is appreciated, > Kshitij > >> This is what my application does: >> >> Each thread has its own file descriptor to the file. >> I use pwrite to ensure non-overlapping regions, as follows: >> >> Thread 0, data_size: 1MB, offset: 0 >> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >> >> <repeat cycle> >> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >> parallel, I dont wait for one cycle to end before the next one >> begins). >> >> I am gonna try the following: >> a) >> Instead of a round-robin distribution of offsets, test with >> sequential >> offsets: >> Thread 0, data_size: 1MB, offset:0 >> Thread 0, data_size: 1MB, offset:1MB >> Thread 0, data_size: 1MB, offset:2MB >> Thread 0, data_size: 1MB, offset:3MB >> >> Thread 1, data_size: 1MB, offset:4MB >> and so on. (I am gonna keep these separate pwrite I/O requests >> instead of merging them or using writev) >> >> b) >> Map the threads to the no. of OSTs using some modulo, as suggested in >> the email below. >> >> c) >> Experiment with fewer no. of OSTs (I currently have 48). >> >> I shall report back with my findings. >> >> Thanks, >> Kshitij >> >>> [Moved to Lustre-discuss] >>> >>> >>> "However, if I spawn 8 threads such that all of them write to the >>> same file (non-overlapping locations), without explicitly >>> synchronizing the writes (i.e. I dont lock the file handle)" >>> >>> >>> How exactly does your multi-threaded application write the data? >>> Are you using pwrite to ensure non-overlapping regions or are they >>> all just doing unlocked write() operations on the same fd to each >>> write (each just transferring size/8)? If it divides the file into >>> N pieces, and each thread does pwrite on its piece, then what each >>> OST sees are multiple streams at wide offsets to the same object, >>> which could impact performance. >>> >>> If on the other hand the file is written sequentially, where each >>> thread grabs the next piece to be written (locking normally used for >>> the current_offset value, so you know where each chunk is actually >>> going), then you get a more sequential pattern at the OST. >>> >>> If the number of threads maps to the number of OSTs (or some modulo, >>> like in your case 6 OSTs per thread), and each thread "owns" the >>> piece of the file that belongs to an OST (ie: for (offset = >>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf, >>> 6MB, offset); ), then you''ve eliminated the need for application >>> locks (assuming the use of >>> pwrite) and ensured each OST object is being written sequentially. >>> >>> It''s quite possible there is some bottleneck on the shared fd. So >>> perhaps the question is not why you aren''t scaling with more >>> threads, but why the single file is not able to saturate the client, >>> or why the file BW is not scaling with more OSTs. It is somewhat >>> common for multiple processes (on different nodes) to write >>> non-overlapping regions of the same file; does performance improve >>> if each thread opens its own file descriptor? >>> >>> Kevin >>> >>> >>> Wojciech Turek wrote: >>>> Ok so it looks like you have in total 64 OSTs and your output file >>>> is striped across 48 of them. May I suggest that you limit number >>>> of stripes, lets say a good number to start with would be 8 stripes >>>> and also for best results use OST pools feature to arrange that >>>> each stripe goes to OST owned by different OSS. >>>> >>>> regards, >>>> >>>> Wojciech >>>> >>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>> >>>> wrote: >>>> >>>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>>> presume the >>>> system documentation is out of date. >>>> >>>> Again, I am sorry the basic information had been incorrect. >>>> >>>> - Kshitij >>>> >>>> > Run lfs getstripe <your_output_file> and paste the output of >>>> that command >>>> > to >>>> > the mailing list. >>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>> max stripe >>>> > count will be 11) >>>> > If your striping is correct, the bottleneck can be your client >>>> network. >>>> > >>>> > regards, >>>> > >>>> > Wojciech >>>> > >>>> > >>>> > >>>> > On 23 May 2011 22:35, <kmehta at cs.uh.edu >>>> <mailto:kmehta at cs.uh.edu>> wrote: >>>> > >>>> >> The stripe count is 48. >>>> >> >>>> >> Just fyi, this is what my application does: >>>> >> A simple I/O test where threads continually write blocks of >>>> size >>>> >> 64Kbytes >>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>> 16Gbytes >>>> >> is >>>> >> created. >>>> >> >>>> >> Thanks, >>>> >> Kshitij >>>> >> >>>> >> > What is your stripe count on the file, if your default is 1, >>>> you are >>>> >> only >>>> >> > writing to one of the OST''s. you can check with the lfs >>>> getstripe >>>> >> > command, you can set the stripe bigger, and hopefully your >>>> >> wide-stripped >>>> >> > file with threaded writes will be faster. >>>> >> > >>>> >> > Evan >>>> >> > >>>> >> > -----Original Message----- >>>> >> > From: lustre-community-bounces at lists.lustre.org >>>> <mailto:lustre-community-bounces at lists.lustre.org> >>>> >> > [mailto:lustre-community-bounces at lists.lustre.org >>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>>> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu> >>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>> >> > To: lustre-community at lists.lustre.org >>>> <mailto:lustre-community at lists.lustre.org> >>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>> performance >>>> >> > >>>> >> > Hello, >>>> >> > I am running a multithreaded application that writes to a >>>> common >>>> >> shared >>>> >> > file on lustre fs, and this is what I see: >>>> >> > >>>> >> > If I have a single thread in my application, I get a >>>> bandwidth of >>>> >> approx. >>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>> spawn 8 >>>> >> > threads such that all of them write to the same file >>>> (non-overlapping >>>> >> > locations), without explicitly synchronizing the writes (i.e. >>>> I dont >>>> >> lock >>>> >> > the file handle), I still get the same bandwidth. >>>> >> > >>>> >> > Now, instead of writing to a shared file, if these threads >>>> write to >>>> >> > separate files, the bandwidth obtained is approx. 700 >>>> Mbytes/sec. >>>> >> > >>>> >> > I would ideally like my multithreaded application to see >>>> similar >>>> >> scaling. >>>> >> > Any ideas why the performance is limited and any workarounds? >>>> >> > >>>> >> > Thank you, >>>> >> > Kshitij >>>> >> > >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Lustre-community mailing list >>>> >> > Lustre-community at lists.lustre.org >>>> <mailto:Lustre-community at lists.lustre.org> >>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >> > >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> Lustre-community mailing list >>>> >> Lustre-community at lists.lustre.org >>>> <mailto:Lustre-community at lists.lustre.org> >>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >>>> >>>> ------------------------------------------------------------------- >>>> ----- >>>> >>>> _______________________________________________ >>>> Lustre-community mailing list >>>> Lustre-community at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >>> >> >> >_______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
kmehta at cs.uh.edu
2011-Jun-03 16:53 UTC
[Lustre-discuss] Poor multithreaded I/O performance
I ran the test with 16Gbytes file size and segment sizes of 64Kbytes, 1Mbyte, 8Mbytes. Thanks, Kshitij> What file sizes and segment sizes are you using for your tests? > > Evan > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of > kmehta at cs.uh.edu > Sent: Thursday, June 02, 2011 5:07 PM > To: kmehta at cs.uh.edu > Cc: kmehta at cs.uh.edu; Lustre discuss > Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance > > Hello, > I was wondering if anyone could replicate the performance of the > multithreaded application using the C file that I posted in my previous > email. > > Thanks, > Kshitij > > >> Ok I ran the following tests: >> >> [1] >> Application spawns 8 threads. I write to Lustre having 8 OSTs. >> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, >> i.e. >> >> T0 writes to offsets 0, 8MB, 16MB, etc. >> T1 writes to offsets 1MB, 9MB, 17MB, etc. >> The stripe size being 1MByte, every thread ends up writing to only 1 >> OST. >> >> I see a bandwidth of 280 Mbytes/sec, similar to the single thread >> performance. >> >> [2] >> I also ran the same test such that every thread writes data in blocks >> of 8 Mbytes for the same stripe size. (Thus, every thread will write >> to every OST). I still get similar performance, ~280Mbytes/sec, so >> essentially I see no difference between each thread writing to a >> single OST vs each thread writing to all OSTs. >> >> And as I said before, if all threads write to their own separate file, >> the resulting bandwidth is ~700Mbytes/sec. >> >> I have attached my C file (simple_io_test.c) herewith. Maybe you could >> run it and see where the bottleneck is. Comments and instructions for >> compilation have been included in the file. Do let me know if you need >> any clarification on that. >> >> Your help is appreciated, >> Kshitij >> >>> This is what my application does: >>> >>> Each thread has its own file descriptor to the file. >>> I use pwrite to ensure non-overlapping regions, as follows: >>> >>> Thread 0, data_size: 1MB, offset: 0 >>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >>> >>> <repeat cycle> >>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >>> parallel, I dont wait for one cycle to end before the next one >>> begins). >>> >>> I am gonna try the following: >>> a) >>> Instead of a round-robin distribution of offsets, test with >>> sequential >>> offsets: >>> Thread 0, data_size: 1MB, offset:0 >>> Thread 0, data_size: 1MB, offset:1MB >>> Thread 0, data_size: 1MB, offset:2MB >>> Thread 0, data_size: 1MB, offset:3MB >>> >>> Thread 1, data_size: 1MB, offset:4MB >>> and so on. (I am gonna keep these separate pwrite I/O requests >>> instead of merging them or using writev) >>> >>> b) >>> Map the threads to the no. of OSTs using some modulo, as suggested in >>> the email below. >>> >>> c) >>> Experiment with fewer no. of OSTs (I currently have 48). >>> >>> I shall report back with my findings. >>> >>> Thanks, >>> Kshitij >>> >>>> [Moved to Lustre-discuss] >>>> >>>> >>>> "However, if I spawn 8 threads such that all of them write to the >>>> same file (non-overlapping locations), without explicitly >>>> synchronizing the writes (i.e. I dont lock the file handle)" >>>> >>>> >>>> How exactly does your multi-threaded application write the data? >>>> Are you using pwrite to ensure non-overlapping regions or are they >>>> all just doing unlocked write() operations on the same fd to each >>>> write (each just transferring size/8)? If it divides the file into >>>> N pieces, and each thread does pwrite on its piece, then what each >>>> OST sees are multiple streams at wide offsets to the same object, >>>> which could impact performance. >>>> >>>> If on the other hand the file is written sequentially, where each >>>> thread grabs the next piece to be written (locking normally used for >>>> the current_offset value, so you know where each chunk is actually >>>> going), then you get a more sequential pattern at the OST. >>>> >>>> If the number of threads maps to the number of OSTs (or some modulo, >>>> like in your case 6 OSTs per thread), and each thread "owns" the >>>> piece of the file that belongs to an OST (ie: for (offset >>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf, >>>> 6MB, offset); ), then you''ve eliminated the need for application >>>> locks (assuming the use of >>>> pwrite) and ensured each OST object is being written sequentially. >>>> >>>> It''s quite possible there is some bottleneck on the shared fd. So >>>> perhaps the question is not why you aren''t scaling with more >>>> threads, but why the single file is not able to saturate the client, >>>> or why the file BW is not scaling with more OSTs. It is somewhat >>>> common for multiple processes (on different nodes) to write >>>> non-overlapping regions of the same file; does performance improve >>>> if each thread opens its own file descriptor? >>>> >>>> Kevin >>>> >>>> >>>> Wojciech Turek wrote: >>>>> Ok so it looks like you have in total 64 OSTs and your output file >>>>> is striped across 48 of them. May I suggest that you limit number >>>>> of stripes, lets say a good number to start with would be 8 stripes >>>>> and also for best results use OST pools feature to arrange that >>>>> each stripe goes to OST owned by different OSS. >>>>> >>>>> regards, >>>>> >>>>> Wojciech >>>>> >>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>> >>>>> wrote: >>>>> >>>>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>>>> presume the >>>>> system documentation is out of date. >>>>> >>>>> Again, I am sorry the basic information had been incorrect. >>>>> >>>>> - Kshitij >>>>> >>>>> > Run lfs getstripe <your_output_file> and paste the output of >>>>> that command >>>>> > to >>>>> > the mailing list. >>>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>>> max stripe >>>>> > count will be 11) >>>>> > If your striping is correct, the bottleneck can be your client >>>>> network. >>>>> > >>>>> > regards, >>>>> > >>>>> > Wojciech >>>>> > >>>>> > >>>>> > >>>>> > On 23 May 2011 22:35, <kmehta at cs.uh.edu >>>>> <mailto:kmehta at cs.uh.edu>> wrote: >>>>> > >>>>> >> The stripe count is 48. >>>>> >> >>>>> >> Just fyi, this is what my application does: >>>>> >> A simple I/O test where threads continually write blocks of >>>>> size >>>>> >> 64Kbytes >>>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>>> 16Gbytes >>>>> >> is >>>>> >> created. >>>>> >> >>>>> >> Thanks, >>>>> >> Kshitij >>>>> >> >>>>> >> > What is your stripe count on the file, if your default is >>>>> 1, >>>>> you are >>>>> >> only >>>>> >> > writing to one of the OST''s. you can check with the lfs >>>>> getstripe >>>>> >> > command, you can set the stripe bigger, and hopefully your >>>>> >> wide-stripped >>>>> >> > file with threaded writes will be faster. >>>>> >> > >>>>> >> > Evan >>>>> >> > >>>>> >> > -----Original Message----- >>>>> >> > From: lustre-community-bounces at lists.lustre.org >>>>> <mailto:lustre-community-bounces at lists.lustre.org> >>>>> >> > [mailto:lustre-community-bounces at lists.lustre.org >>>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>>>> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu> >>>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>>> >> > To: lustre-community at lists.lustre.org >>>>> <mailto:lustre-community at lists.lustre.org> >>>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>>> performance >>>>> >> > >>>>> >> > Hello, >>>>> >> > I am running a multithreaded application that writes to a >>>>> common >>>>> >> shared >>>>> >> > file on lustre fs, and this is what I see: >>>>> >> > >>>>> >> > If I have a single thread in my application, I get a >>>>> bandwidth of >>>>> >> approx. >>>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>>> spawn 8 >>>>> >> > threads such that all of them write to the same file >>>>> (non-overlapping >>>>> >> > locations), without explicitly synchronizing the writes >>>>> (i.e. >>>>> I dont >>>>> >> lock >>>>> >> > the file handle), I still get the same bandwidth. >>>>> >> > >>>>> >> > Now, instead of writing to a shared file, if these threads >>>>> write to >>>>> >> > separate files, the bandwidth obtained is approx. 700 >>>>> Mbytes/sec. >>>>> >> > >>>>> >> > I would ideally like my multithreaded application to see >>>>> similar >>>>> >> scaling. >>>>> >> > Any ideas why the performance is limited and any >>>>> workarounds? >>>>> >> > >>>>> >> > Thank you, >>>>> >> > Kshitij >>>>> >> > >>>>> >> > >>>>> >> > _______________________________________________ >>>>> >> > Lustre-community mailing list >>>>> >> > Lustre-community at lists.lustre.org >>>>> <mailto:Lustre-community at lists.lustre.org> >>>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >> > >>>>> >> >>>>> >> >>>>> >> _______________________________________________ >>>>> >> Lustre-community mailing list >>>>> >> Lustre-community at lists.lustre.org >>>>> <mailto:Lustre-community at lists.lustre.org> >>>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >>>>> >>>>> ------------------------------------------------------------------- >>>>> ----- >>>>> >>>>> _______________________________________________ >>>>> Lustre-community mailing list >>>>> Lustre-community at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >>>> >>> >>> >> > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
I''ve been trying to test this, but not finding an obvious error... so more questions: How much RAM do you have on your client, and how much on the OST''s some of my smaller tests go much faster, but I believe that it is cache based effects. My larger test at 32GB gives pretty consistent results. The other thing to consider: are the separate files being striped 8 ways? Because that would allow them to hit possibly all 64 OST''s, while the shared file case will only hit 8. Evan -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix, Evan J Sent: Friday, June 03, 2011 9:09 AM To: kmehta at cs.uh.edu Cc: Lustre discuss Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance What file sizes and segment sizes are you using for your tests? Evan -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of kmehta at cs.uh.edu Sent: Thursday, June 02, 2011 5:07 PM To: kmehta at cs.uh.edu Cc: kmehta at cs.uh.edu; Lustre discuss Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance Hello, I was wondering if anyone could replicate the performance of the multithreaded application using the C file that I posted in my previous email. Thanks, Kshitij> Ok I ran the following tests: > > [1] > Application spawns 8 threads. I write to Lustre having 8 OSTs. > Each thread writes data in blocks of 1 Mbyte in a round robin fashion, > i.e. > > T0 writes to offsets 0, 8MB, 16MB, etc. > T1 writes to offsets 1MB, 9MB, 17MB, etc. > The stripe size being 1MByte, every thread ends up writing to only 1 OST. > > I see a bandwidth of 280 Mbytes/sec, similar to the single thread > performance. > > [2] > I also ran the same test such that every thread writes data in blocks > of 8 Mbytes for the same stripe size. (Thus, every thread will write > to every OST). I still get similar performance, ~280Mbytes/sec, so > essentially I see no difference between each thread writing to a > single OST vs each thread writing to all OSTs. > > And as I said before, if all threads write to their own separate file, > the resulting bandwidth is ~700Mbytes/sec. > > I have attached my C file (simple_io_test.c) herewith. Maybe you could > run it and see where the bottleneck is. Comments and instructions for > compilation have been included in the file. Do let me know if you need > any clarification on that. > > Your help is appreciated, > Kshitij > >> This is what my application does: >> >> Each thread has its own file descriptor to the file. >> I use pwrite to ensure non-overlapping regions, as follows: >> >> Thread 0, data_size: 1MB, offset: 0 >> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >> >> <repeat cycle> >> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >> parallel, I dont wait for one cycle to end before the next one >> begins). >> >> I am gonna try the following: >> a) >> Instead of a round-robin distribution of offsets, test with >> sequential >> offsets: >> Thread 0, data_size: 1MB, offset:0 >> Thread 0, data_size: 1MB, offset:1MB >> Thread 0, data_size: 1MB, offset:2MB >> Thread 0, data_size: 1MB, offset:3MB >> >> Thread 1, data_size: 1MB, offset:4MB >> and so on. (I am gonna keep these separate pwrite I/O requests >> instead of merging them or using writev) >> >> b) >> Map the threads to the no. of OSTs using some modulo, as suggested in >> the email below. >> >> c) >> Experiment with fewer no. of OSTs (I currently have 48). >> >> I shall report back with my findings. >> >> Thanks, >> Kshitij >> >>> [Moved to Lustre-discuss] >>> >>> >>> "However, if I spawn 8 threads such that all of them write to the >>> same file (non-overlapping locations), without explicitly >>> synchronizing the writes (i.e. I dont lock the file handle)" >>> >>> >>> How exactly does your multi-threaded application write the data? >>> Are you using pwrite to ensure non-overlapping regions or are they >>> all just doing unlocked write() operations on the same fd to each >>> write (each just transferring size/8)? If it divides the file into >>> N pieces, and each thread does pwrite on its piece, then what each >>> OST sees are multiple streams at wide offsets to the same object, >>> which could impact performance. >>> >>> If on the other hand the file is written sequentially, where each >>> thread grabs the next piece to be written (locking normally used for >>> the current_offset value, so you know where each chunk is actually >>> going), then you get a more sequential pattern at the OST. >>> >>> If the number of threads maps to the number of OSTs (or some modulo, >>> like in your case 6 OSTs per thread), and each thread "owns" the >>> piece of the file that belongs to an OST (ie: for (offset = >>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf, >>> 6MB, offset); ), then you''ve eliminated the need for application >>> locks (assuming the use of >>> pwrite) and ensured each OST object is being written sequentially. >>> >>> It''s quite possible there is some bottleneck on the shared fd. So >>> perhaps the question is not why you aren''t scaling with more >>> threads, but why the single file is not able to saturate the client, >>> or why the file BW is not scaling with more OSTs. It is somewhat >>> common for multiple processes (on different nodes) to write >>> non-overlapping regions of the same file; does performance improve >>> if each thread opens its own file descriptor? >>> >>> Kevin >>> >>> >>> Wojciech Turek wrote: >>>> Ok so it looks like you have in total 64 OSTs and your output file >>>> is striped across 48 of them. May I suggest that you limit number >>>> of stripes, lets say a good number to start with would be 8 stripes >>>> and also for best results use OST pools feature to arrange that >>>> each stripe goes to OST owned by different OSS. >>>> >>>> regards, >>>> >>>> Wojciech >>>> >>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>> >>>> wrote: >>>> >>>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>>> presume the >>>> system documentation is out of date. >>>> >>>> Again, I am sorry the basic information had been incorrect. >>>> >>>> - Kshitij >>>> >>>> > Run lfs getstripe <your_output_file> and paste the output of >>>> that command >>>> > to >>>> > the mailing list. >>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>> max stripe >>>> > count will be 11) >>>> > If your striping is correct, the bottleneck can be your client >>>> network. >>>> > >>>> > regards, >>>> > >>>> > Wojciech >>>> > >>>> > >>>> > >>>> > On 23 May 2011 22:35, <kmehta at cs.uh.edu >>>> <mailto:kmehta at cs.uh.edu>> wrote: >>>> > >>>> >> The stripe count is 48. >>>> >> >>>> >> Just fyi, this is what my application does: >>>> >> A simple I/O test where threads continually write blocks of >>>> size >>>> >> 64Kbytes >>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>> 16Gbytes >>>> >> is >>>> >> created. >>>> >> >>>> >> Thanks, >>>> >> Kshitij >>>> >> >>>> >> > What is your stripe count on the file, if your default is 1, >>>> you are >>>> >> only >>>> >> > writing to one of the OST''s. you can check with the lfs >>>> getstripe >>>> >> > command, you can set the stripe bigger, and hopefully your >>>> >> wide-stripped >>>> >> > file with threaded writes will be faster. >>>> >> > >>>> >> > Evan >>>> >> > >>>> >> > -----Original Message----- >>>> >> > From: lustre-community-bounces at lists.lustre.org >>>> <mailto:lustre-community-bounces at lists.lustre.org> >>>> >> > [mailto:lustre-community-bounces at lists.lustre.org >>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>>> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu> >>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>> >> > To: lustre-community at lists.lustre.org >>>> <mailto:lustre-community at lists.lustre.org> >>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>> performance >>>> >> > >>>> >> > Hello, >>>> >> > I am running a multithreaded application that writes to a >>>> common >>>> >> shared >>>> >> > file on lustre fs, and this is what I see: >>>> >> > >>>> >> > If I have a single thread in my application, I get a >>>> bandwidth of >>>> >> approx. >>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>> spawn 8 >>>> >> > threads such that all of them write to the same file >>>> (non-overlapping >>>> >> > locations), without explicitly synchronizing the writes (i.e. >>>> I dont >>>> >> lock >>>> >> > the file handle), I still get the same bandwidth. >>>> >> > >>>> >> > Now, instead of writing to a shared file, if these threads >>>> write to >>>> >> > separate files, the bandwidth obtained is approx. 700 >>>> Mbytes/sec. >>>> >> > >>>> >> > I would ideally like my multithreaded application to see >>>> similar >>>> >> scaling. >>>> >> > Any ideas why the performance is limited and any workarounds? >>>> >> > >>>> >> > Thank you, >>>> >> > Kshitij >>>> >> > >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Lustre-community mailing list >>>> >> > Lustre-community at lists.lustre.org >>>> <mailto:Lustre-community at lists.lustre.org> >>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >> > >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> Lustre-community mailing list >>>> >> Lustre-community at lists.lustre.org >>>> <mailto:Lustre-community at lists.lustre.org> >>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >>>> >>>> ------------------------------------------------------------------- >>>> ----- >>>> >>>> _______________________________________________ >>>> Lustre-community mailing list >>>> Lustre-community at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >>> >> >> >_______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
kmehta at cs.uh.edu
2011-Jun-06 18:20 UTC
[Lustre-discuss] Poor multithreaded I/O performance
> are the separate files being striped 8 ways? > Because that would allow them to hit possibly all 64 OST''s, while the > shared file case will only hit 8Yes, I found out that the files are getting striped 8 ways, so we end up hitting 64 OSTs. This is what I tried next: 1. Ran a test case where 6 threads write separate files, each of size 6 GB, to a directory configured over 8 OSTs. Thus the application writes 36GB of data in total, over 48 OSTs. 2. Ran a test case where 8 threads write a common file of size 36GB to a directory configured over 48 OSTs. Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1 (separate files). Thanks, Kshitij> I''ve been trying to test this, but not finding an obvious error... so > more questions: > > How much RAM do you have on your client, and how much on the OST''s some > of my smaller tests go much faster, but I believe that it is cache based > effects. My larger test at 32GB gives pretty consistent results. > > The other thing to consider: are the separate files being striped 8 ways? > Because that would allow them to hit possibly all 64 OST''s, while the > shared file case will only hit 8. > > Evan > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix, Evan > J > Sent: Friday, June 03, 2011 9:09 AM > To: kmehta at cs.uh.edu > Cc: Lustre discuss > Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance > > What file sizes and segment sizes are you using for your tests? > > Evan > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of > kmehta at cs.uh.edu > Sent: Thursday, June 02, 2011 5:07 PM > To: kmehta at cs.uh.edu > Cc: kmehta at cs.uh.edu; Lustre discuss > Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance > > Hello, > I was wondering if anyone could replicate the performance of the > multithreaded application using the C file that I posted in my previous > email. > > Thanks, > Kshitij > > >> Ok I ran the following tests: >> >> [1] >> Application spawns 8 threads. I write to Lustre having 8 OSTs. >> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, >> i.e. >> >> T0 writes to offsets 0, 8MB, 16MB, etc. >> T1 writes to offsets 1MB, 9MB, 17MB, etc. >> The stripe size being 1MByte, every thread ends up writing to only 1 >> OST. >> >> I see a bandwidth of 280 Mbytes/sec, similar to the single thread >> performance. >> >> [2] >> I also ran the same test such that every thread writes data in blocks >> of 8 Mbytes for the same stripe size. (Thus, every thread will write >> to every OST). I still get similar performance, ~280Mbytes/sec, so >> essentially I see no difference between each thread writing to a >> single OST vs each thread writing to all OSTs. >> >> And as I said before, if all threads write to their own separate file, >> the resulting bandwidth is ~700Mbytes/sec. >> >> I have attached my C file (simple_io_test.c) herewith. Maybe you could >> run it and see where the bottleneck is. Comments and instructions for >> compilation have been included in the file. Do let me know if you need >> any clarification on that. >> >> Your help is appreciated, >> Kshitij >> >>> This is what my application does: >>> >>> Each thread has its own file descriptor to the file. >>> I use pwrite to ensure non-overlapping regions, as follows: >>> >>> Thread 0, data_size: 1MB, offset: 0 >>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >>> >>> <repeat cycle> >>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >>> parallel, I dont wait for one cycle to end before the next one >>> begins). >>> >>> I am gonna try the following: >>> a) >>> Instead of a round-robin distribution of offsets, test with >>> sequential >>> offsets: >>> Thread 0, data_size: 1MB, offset:0 >>> Thread 0, data_size: 1MB, offset:1MB >>> Thread 0, data_size: 1MB, offset:2MB >>> Thread 0, data_size: 1MB, offset:3MB >>> >>> Thread 1, data_size: 1MB, offset:4MB >>> and so on. (I am gonna keep these separate pwrite I/O requests >>> instead of merging them or using writev) >>> >>> b) >>> Map the threads to the no. of OSTs using some modulo, as suggested in >>> the email below. >>> >>> c) >>> Experiment with fewer no. of OSTs (I currently have 48). >>> >>> I shall report back with my findings. >>> >>> Thanks, >>> Kshitij >>> >>>> [Moved to Lustre-discuss] >>>> >>>> >>>> "However, if I spawn 8 threads such that all of them write to the >>>> same file (non-overlapping locations), without explicitly >>>> synchronizing the writes (i.e. I dont lock the file handle)" >>>> >>>> >>>> How exactly does your multi-threaded application write the data? >>>> Are you using pwrite to ensure non-overlapping regions or are they >>>> all just doing unlocked write() operations on the same fd to each >>>> write (each just transferring size/8)? If it divides the file into >>>> N pieces, and each thread does pwrite on its piece, then what each >>>> OST sees are multiple streams at wide offsets to the same object, >>>> which could impact performance. >>>> >>>> If on the other hand the file is written sequentially, where each >>>> thread grabs the next piece to be written (locking normally used for >>>> the current_offset value, so you know where each chunk is actually >>>> going), then you get a more sequential pattern at the OST. >>>> >>>> If the number of threads maps to the number of OSTs (or some modulo, >>>> like in your case 6 OSTs per thread), and each thread "owns" the >>>> piece of the file that belongs to an OST (ie: for (offset >>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf, >>>> 6MB, offset); ), then you''ve eliminated the need for application >>>> locks (assuming the use of >>>> pwrite) and ensured each OST object is being written sequentially. >>>> >>>> It''s quite possible there is some bottleneck on the shared fd. So >>>> perhaps the question is not why you aren''t scaling with more >>>> threads, but why the single file is not able to saturate the client, >>>> or why the file BW is not scaling with more OSTs. It is somewhat >>>> common for multiple processes (on different nodes) to write >>>> non-overlapping regions of the same file; does performance improve >>>> if each thread opens its own file descriptor? >>>> >>>> Kevin >>>> >>>> >>>> Wojciech Turek wrote: >>>>> Ok so it looks like you have in total 64 OSTs and your output file >>>>> is striped across 48 of them. May I suggest that you limit number >>>>> of stripes, lets say a good number to start with would be 8 stripes >>>>> and also for best results use OST pools feature to arrange that >>>>> each stripe goes to OST owned by different OSS. >>>>> >>>>> regards, >>>>> >>>>> Wojciech >>>>> >>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>> >>>>> wrote: >>>>> >>>>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>>>> presume the >>>>> system documentation is out of date. >>>>> >>>>> Again, I am sorry the basic information had been incorrect. >>>>> >>>>> - Kshitij >>>>> >>>>> > Run lfs getstripe <your_output_file> and paste the output of >>>>> that command >>>>> > to >>>>> > the mailing list. >>>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>>> max stripe >>>>> > count will be 11) >>>>> > If your striping is correct, the bottleneck can be your client >>>>> network. >>>>> > >>>>> > regards, >>>>> > >>>>> > Wojciech >>>>> > >>>>> > >>>>> > >>>>> > On 23 May 2011 22:35, <kmehta at cs.uh.edu >>>>> <mailto:kmehta at cs.uh.edu>> wrote: >>>>> > >>>>> >> The stripe count is 48. >>>>> >> >>>>> >> Just fyi, this is what my application does: >>>>> >> A simple I/O test where threads continually write blocks of >>>>> size >>>>> >> 64Kbytes >>>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>>> 16Gbytes >>>>> >> is >>>>> >> created. >>>>> >> >>>>> >> Thanks, >>>>> >> Kshitij >>>>> >> >>>>> >> > What is your stripe count on the file, if your default is >>>>> 1, >>>>> you are >>>>> >> only >>>>> >> > writing to one of the OST''s. you can check with the lfs >>>>> getstripe >>>>> >> > command, you can set the stripe bigger, and hopefully your >>>>> >> wide-stripped >>>>> >> > file with threaded writes will be faster. >>>>> >> > >>>>> >> > Evan >>>>> >> > >>>>> >> > -----Original Message----- >>>>> >> > From: lustre-community-bounces at lists.lustre.org >>>>> <mailto:lustre-community-bounces at lists.lustre.org> >>>>> >> > [mailto:lustre-community-bounces at lists.lustre.org >>>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>>>> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu> >>>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>>> >> > To: lustre-community at lists.lustre.org >>>>> <mailto:lustre-community at lists.lustre.org> >>>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>>> performance >>>>> >> > >>>>> >> > Hello, >>>>> >> > I am running a multithreaded application that writes to a >>>>> common >>>>> >> shared >>>>> >> > file on lustre fs, and this is what I see: >>>>> >> > >>>>> >> > If I have a single thread in my application, I get a >>>>> bandwidth of >>>>> >> approx. >>>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>>> spawn 8 >>>>> >> > threads such that all of them write to the same file >>>>> (non-overlapping >>>>> >> > locations), without explicitly synchronizing the writes >>>>> (i.e. >>>>> I dont >>>>> >> lock >>>>> >> > the file handle), I still get the same bandwidth. >>>>> >> > >>>>> >> > Now, instead of writing to a shared file, if these threads >>>>> write to >>>>> >> > separate files, the bandwidth obtained is approx. 700 >>>>> Mbytes/sec. >>>>> >> > >>>>> >> > I would ideally like my multithreaded application to see >>>>> similar >>>>> >> scaling. >>>>> >> > Any ideas why the performance is limited and any >>>>> workarounds? >>>>> >> > >>>>> >> > Thank you, >>>>> >> > Kshitij >>>>> >> > >>>>> >> > >>>>> >> > _______________________________________________ >>>>> >> > Lustre-community mailing list >>>>> >> > Lustre-community at lists.lustre.org >>>>> <mailto:Lustre-community at lists.lustre.org> >>>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >> > >>>>> >> >>>>> >> >>>>> >> _______________________________________________ >>>>> >> Lustre-community mailing list >>>>> >> Lustre-community at lists.lustre.org >>>>> <mailto:Lustre-community at lists.lustre.org> >>>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >>>>> >>>>> ------------------------------------------------------------------- >>>>> ----- >>>>> >>>>> _______________________________________________ >>>>> Lustre-community mailing list >>>>> Lustre-community at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >>>> >>> >>> >> > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
I read in a research paper (http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about Lustre''s ability to join files in place. Can someone point me to sample code and documentation on this? I couldnt find information in the manual. Being able to join files in place could be a potential solution to the issue I have. Thanks, Kshitij On 06/06/2011 01:20 PM, kmehta at cs.uh.edu wrote:>> are the separate files being striped 8 ways? >> Because that would allow them to hit possibly all 64 OST''s, while the >> shared file case will only hit 8 > Yes, I found out that the files are getting striped 8 ways, so we end up > hitting 64 OSTs. This is what I tried next: > > 1. Ran a test case where 6 threads write separate files, each of size 6 > GB, to a directory configured over 8 OSTs. Thus the application writes > 36GB of data in total, over 48 OSTs. > > 2. Ran a test case where 8 threads write a common file of size 36GB to a > directory configured over 48 OSTs. > > Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a > b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1 > (separate files). > > Thanks, > Kshitij > >> I''ve been trying to test this, but not finding an obvious error... so >> more questions: >> >> How much RAM do you have on your client, and how much on the OST''s some >> of my smaller tests go much faster, but I believe that it is cache based >> effects. My larger test at 32GB gives pretty consistent results. >> >> The other thing to consider: are the separate files being striped 8 ways? >> Because that would allow them to hit possibly all 64 OST''s, while the >> shared file case will only hit 8. >> >> Evan >> >> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org >> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix, Evan >> J >> Sent: Friday, June 03, 2011 9:09 AM >> To: kmehta at cs.uh.edu >> Cc: Lustre discuss >> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >> >> What file sizes and segment sizes are you using for your tests? >> >> Evan >> >> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org >> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of >> kmehta at cs.uh.edu >> Sent: Thursday, June 02, 2011 5:07 PM >> To: kmehta at cs.uh.edu >> Cc: kmehta at cs.uh.edu; Lustre discuss >> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >> >> Hello, >> I was wondering if anyone could replicate the performance of the >> multithreaded application using the C file that I posted in my previous >> email. >> >> Thanks, >> Kshitij >> >> >>> Ok I ran the following tests: >>> >>> [1] >>> Application spawns 8 threads. I write to Lustre having 8 OSTs. >>> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, >>> i.e. >>> >>> T0 writes to offsets 0, 8MB, 16MB, etc. >>> T1 writes to offsets 1MB, 9MB, 17MB, etc. >>> The stripe size being 1MByte, every thread ends up writing to only 1 >>> OST. >>> >>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread >>> performance. >>> >>> [2] >>> I also ran the same test such that every thread writes data in blocks >>> of 8 Mbytes for the same stripe size. (Thus, every thread will write >>> to every OST). I still get similar performance, ~280Mbytes/sec, so >>> essentially I see no difference between each thread writing to a >>> single OST vs each thread writing to all OSTs. >>> >>> And as I said before, if all threads write to their own separate file, >>> the resulting bandwidth is ~700Mbytes/sec. >>> >>> I have attached my C file (simple_io_test.c) herewith. Maybe you could >>> run it and see where the bottleneck is. Comments and instructions for >>> compilation have been included in the file. Do let me know if you need >>> any clarification on that. >>> >>> Your help is appreciated, >>> Kshitij >>> >>>> This is what my application does: >>>> >>>> Each thread has its own file descriptor to the file. >>>> I use pwrite to ensure non-overlapping regions, as follows: >>>> >>>> Thread 0, data_size: 1MB, offset: 0 >>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >>>> >>>> <repeat cycle> >>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >>>> parallel, I dont wait for one cycle to end before the next one >>>> begins). >>>> >>>> I am gonna try the following: >>>> a) >>>> Instead of a round-robin distribution of offsets, test with >>>> sequential >>>> offsets: >>>> Thread 0, data_size: 1MB, offset:0 >>>> Thread 0, data_size: 1MB, offset:1MB >>>> Thread 0, data_size: 1MB, offset:2MB >>>> Thread 0, data_size: 1MB, offset:3MB >>>> >>>> Thread 1, data_size: 1MB, offset:4MB >>>> and so on. (I am gonna keep these separate pwrite I/O requests >>>> instead of merging them or using writev) >>>> >>>> b) >>>> Map the threads to the no. of OSTs using some modulo, as suggested in >>>> the email below. >>>> >>>> c) >>>> Experiment with fewer no. of OSTs (I currently have 48). >>>> >>>> I shall report back with my findings. >>>> >>>> Thanks, >>>> Kshitij >>>> >>>>> [Moved to Lustre-discuss] >>>>> >>>>> >>>>> "However, if I spawn 8 threads such that all of them write to the >>>>> same file (non-overlapping locations), without explicitly >>>>> synchronizing the writes (i.e. I dont lock the file handle)" >>>>> >>>>> >>>>> How exactly does your multi-threaded application write the data? >>>>> Are you using pwrite to ensure non-overlapping regions or are they >>>>> all just doing unlocked write() operations on the same fd to each >>>>> write (each just transferring size/8)? If it divides the file into >>>>> N pieces, and each thread does pwrite on its piece, then what each >>>>> OST sees are multiple streams at wide offsets to the same object, >>>>> which could impact performance. >>>>> >>>>> If on the other hand the file is written sequentially, where each >>>>> thread grabs the next piece to be written (locking normally used for >>>>> the current_offset value, so you know where each chunk is actually >>>>> going), then you get a more sequential pattern at the OST. >>>>> >>>>> If the number of threads maps to the number of OSTs (or some modulo, >>>>> like in your case 6 OSTs per thread), and each thread "owns" the >>>>> piece of the file that belongs to an OST (ie: for (offset >>>>> thread_num * 6MB; offset< size; offset += 48MB) pwrite(fd, buf, >>>>> 6MB, offset); ), then you''ve eliminated the need for application >>>>> locks (assuming the use of >>>>> pwrite) and ensured each OST object is being written sequentially. >>>>> >>>>> It''s quite possible there is some bottleneck on the shared fd. So >>>>> perhaps the question is not why you aren''t scaling with more >>>>> threads, but why the single file is not able to saturate the client, >>>>> or why the file BW is not scaling with more OSTs. It is somewhat >>>>> common for multiple processes (on different nodes) to write >>>>> non-overlapping regions of the same file; does performance improve >>>>> if each thread opens its own file descriptor? >>>>> >>>>> Kevin >>>>> >>>>> >>>>> Wojciech Turek wrote: >>>>>> Ok so it looks like you have in total 64 OSTs and your output file >>>>>> is striped across 48 of them. May I suggest that you limit number >>>>>> of stripes, lets say a good number to start with would be 8 stripes >>>>>> and also for best results use OST pools feature to arrange that >>>>>> each stripe goes to OST owned by different OSS. >>>>>> >>>>>> regards, >>>>>> >>>>>> Wojciech >>>>>> >>>>>> On 23 May 2011 23:09,<kmehta at cs.uh.edu<mailto:kmehta at cs.uh.edu>> >>>>>> wrote: >>>>>> >>>>>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>>>>> presume the >>>>>> system documentation is out of date. >>>>>> >>>>>> Again, I am sorry the basic information had been incorrect. >>>>>> >>>>>> - Kshitij >>>>>> >>>>>> > Run lfs getstripe<your_output_file> and paste the output of >>>>>> that command >>>>>> > to >>>>>> > the mailing list. >>>>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>>>> max stripe >>>>>> > count will be 11) >>>>>> > If your striping is correct, the bottleneck can be your client >>>>>> network. >>>>>> > >>>>>> > regards, >>>>>> > >>>>>> > Wojciech >>>>>> > >>>>>> > >>>>>> > >>>>>> > On 23 May 2011 22:35,<kmehta at cs.uh.edu >>>>>> <mailto:kmehta at cs.uh.edu>> wrote: >>>>>> > >>>>>> >> The stripe count is 48. >>>>>> >> >>>>>> >> Just fyi, this is what my application does: >>>>>> >> A simple I/O test where threads continually write blocks of >>>>>> size >>>>>> >> 64Kbytes >>>>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>>>> 16Gbytes >>>>>> >> is >>>>>> >> created. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Kshitij >>>>>> >> >>>>>> >> > What is your stripe count on the file, if your default is >>>>>> 1, >>>>>> you are >>>>>> >> only >>>>>> >> > writing to one of the OST''s. you can check with the lfs >>>>>> getstripe >>>>>> >> > command, you can set the stripe bigger, and hopefully your >>>>>> >> wide-stripped >>>>>> >> > file with threaded writes will be faster. >>>>>> >> > >>>>>> >> > Evan >>>>>> >> > >>>>>> >> > -----Original Message----- >>>>>> >> > From: lustre-community-bounces at lists.lustre.org >>>>>> <mailto:lustre-community-bounces at lists.lustre.org> >>>>>> >> > [mailto:lustre-community-bounces at lists.lustre.org >>>>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>>>>> >> > kmehta at cs.uh.edu<mailto:kmehta at cs.uh.edu> >>>>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>>>> >> > To: lustre-community at lists.lustre.org >>>>>> <mailto:lustre-community at lists.lustre.org> >>>>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>>>> performance >>>>>> >> > >>>>>> >> > Hello, >>>>>> >> > I am running a multithreaded application that writes to a >>>>>> common >>>>>> >> shared >>>>>> >> > file on lustre fs, and this is what I see: >>>>>> >> > >>>>>> >> > If I have a single thread in my application, I get a >>>>>> bandwidth of >>>>>> >> approx. >>>>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>>>> spawn 8 >>>>>> >> > threads such that all of them write to the same file >>>>>> (non-overlapping >>>>>> >> > locations), without explicitly synchronizing the writes >>>>>> (i.e. >>>>>> I dont >>>>>> >> lock >>>>>> >> > the file handle), I still get the same bandwidth. >>>>>> >> > >>>>>> >> > Now, instead of writing to a shared file, if these threads >>>>>> write to >>>>>> >> > separate files, the bandwidth obtained is approx. 700 >>>>>> Mbytes/sec. >>>>>> >> > >>>>>> >> > I would ideally like my multithreaded application to see >>>>>> similar >>>>>> >> scaling. >>>>>> >> > Any ideas why the performance is limited and any >>>>>> workarounds? >>>>>> >> > >>>>>> >> > Thank you, >>>>>> >> > Kshitij >>>>>> >> > >>>>>> >> > >>>>>> >> > _______________________________________________ >>>>>> >> > Lustre-community mailing list >>>>>> >> > Lustre-community at lists.lustre.org >>>>>> <mailto:Lustre-community at lists.lustre.org> >>>>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >> > >>>>>> >> >>>>>> >> >>>>>> >> _______________________________________________ >>>>>> >> Lustre-community mailing list >>>>>> >> Lustre-community at lists.lustre.org >>>>>> <mailto:Lustre-community at lists.lustre.org> >>>>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------- >>>>>> ----- >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-community mailing list >>>>>> Lustre-community at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >>>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>
Its part of the lfs lustre tool, I have not used it myself.. try ''lfs help join'' Evan -----Original Message----- From: Kshitij Mehta [mailto:kmehta at cs.uh.edu] Sent: Thursday, June 09, 2011 10:58 AM To: kmehta at cs.uh.edu Cc: Felix, Evan J; Lustre discuss Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance I read in a research paper (http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about Lustre''s ability to join files in place. Can someone point me to sample code and documentation on this? I couldnt find information in the manual. Being able to join files in place could be a potential solution to the issue I have. Thanks, Kshitij On 06/06/2011 01:20 PM, kmehta at cs.uh.edu wrote:>> are the separate files being striped 8 ways? >> Because that would allow them to hit possibly all 64 OST''s, while >> the shared file case will only hit 8 > Yes, I found out that the files are getting striped 8 ways, so we end > up hitting 64 OSTs. This is what I tried next: > > 1. Ran a test case where 6 threads write separate files, each of size > 6 GB, to a directory configured over 8 OSTs. Thus the application > writes 36GB of data in total, over 48 OSTs. > > 2. Ran a test case where 8 threads write a common file of size 36GB to > a directory configured over 48 OSTs. > > Thus both tests ultimately write 36GB of data over 48 OSTS. I still > see a b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for > test 1 (separate files). > > Thanks, > Kshitij > >> I''ve been trying to test this, but not finding an obvious error... >> so more questions: >> >> How much RAM do you have on your client, and how much on the OST''s >> some of my smaller tests go much faster, but I believe that it is >> cache based effects. My larger test at 32GB gives pretty consistent results. >> >> The other thing to consider: are the separate files being striped 8 ways? >> Because that would allow them to hit possibly all 64 OST''s, while >> the shared file case will only hit 8. >> >> Evan >> >> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org >> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix, >> Evan J >> Sent: Friday, June 03, 2011 9:09 AM >> To: kmehta at cs.uh.edu >> Cc: Lustre discuss >> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >> >> What file sizes and segment sizes are you using for your tests? >> >> Evan >> >> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org >> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of >> kmehta at cs.uh.edu >> Sent: Thursday, June 02, 2011 5:07 PM >> To: kmehta at cs.uh.edu >> Cc: kmehta at cs.uh.edu; Lustre discuss >> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >> >> Hello, >> I was wondering if anyone could replicate the performance of the >> multithreaded application using the C file that I posted in my >> previous email. >> >> Thanks, >> Kshitij >> >> >>> Ok I ran the following tests: >>> >>> [1] >>> Application spawns 8 threads. I write to Lustre having 8 OSTs. >>> Each thread writes data in blocks of 1 Mbyte in a round robin >>> fashion, i.e. >>> >>> T0 writes to offsets 0, 8MB, 16MB, etc. >>> T1 writes to offsets 1MB, 9MB, 17MB, etc. >>> The stripe size being 1MByte, every thread ends up writing to only 1 >>> OST. >>> >>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread >>> performance. >>> >>> [2] >>> I also ran the same test such that every thread writes data in >>> blocks of 8 Mbytes for the same stripe size. (Thus, every thread >>> will write to every OST). I still get similar performance, >>> ~280Mbytes/sec, so essentially I see no difference between each >>> thread writing to a single OST vs each thread writing to all OSTs. >>> >>> And as I said before, if all threads write to their own separate >>> file, the resulting bandwidth is ~700Mbytes/sec. >>> >>> I have attached my C file (simple_io_test.c) herewith. Maybe you >>> could run it and see where the bottleneck is. Comments and >>> instructions for compilation have been included in the file. Do let >>> me know if you need any clarification on that. >>> >>> Your help is appreciated, >>> Kshitij >>> >>>> This is what my application does: >>>> >>>> Each thread has its own file descriptor to the file. >>>> I use pwrite to ensure non-overlapping regions, as follows: >>>> >>>> Thread 0, data_size: 1MB, offset: 0 Thread 1, data_size: 1MB, >>>> offset: 1MB Thread 2, data_size: 1MB, >>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >>>> >>>> <repeat cycle> >>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >>>> parallel, I dont wait for one cycle to end before the next one >>>> begins). >>>> >>>> I am gonna try the following: >>>> a) >>>> Instead of a round-robin distribution of offsets, test with >>>> sequential >>>> offsets: >>>> Thread 0, data_size: 1MB, offset:0 >>>> Thread 0, data_size: 1MB, offset:1MB Thread 0, data_size: 1MB, >>>> offset:2MB Thread 0, data_size: 1MB, offset:3MB >>>> >>>> Thread 1, data_size: 1MB, offset:4MB and so on. (I am gonna keep >>>> these separate pwrite I/O requests instead of merging them or using >>>> writev) >>>> >>>> b) >>>> Map the threads to the no. of OSTs using some modulo, as suggested >>>> in the email below. >>>> >>>> c) >>>> Experiment with fewer no. of OSTs (I currently have 48). >>>> >>>> I shall report back with my findings. >>>> >>>> Thanks, >>>> Kshitij >>>> >>>>> [Moved to Lustre-discuss] >>>>> >>>>> >>>>> "However, if I spawn 8 threads such that all of them write to the >>>>> same file (non-overlapping locations), without explicitly >>>>> synchronizing the writes (i.e. I dont lock the file handle)" >>>>> >>>>> >>>>> How exactly does your multi-threaded application write the data? >>>>> Are you using pwrite to ensure non-overlapping regions or are they >>>>> all just doing unlocked write() operations on the same fd to each >>>>> write (each just transferring size/8)? If it divides the file >>>>> into N pieces, and each thread does pwrite on its piece, then what >>>>> each OST sees are multiple streams at wide offsets to the same >>>>> object, which could impact performance. >>>>> >>>>> If on the other hand the file is written sequentially, where each >>>>> thread grabs the next piece to be written (locking normally used >>>>> for the current_offset value, so you know where each chunk is >>>>> actually going), then you get a more sequential pattern at the OST. >>>>> >>>>> If the number of threads maps to the number of OSTs (or some >>>>> modulo, like in your case 6 OSTs per thread), and each thread >>>>> "owns" the piece of the file that belongs to an OST (ie: for >>>>> (offset = thread_num * 6MB; offset< size; offset += 48MB) >>>>> pwrite(fd, buf, 6MB, offset); ), then you''ve eliminated the need >>>>> for application locks (assuming the use of >>>>> pwrite) and ensured each OST object is being written sequentially. >>>>> >>>>> It''s quite possible there is some bottleneck on the shared fd. So >>>>> perhaps the question is not why you aren''t scaling with more >>>>> threads, but why the single file is not able to saturate the >>>>> client, or why the file BW is not scaling with more OSTs. It is >>>>> somewhat common for multiple processes (on different nodes) to >>>>> write non-overlapping regions of the same file; does performance >>>>> improve if each thread opens its own file descriptor? >>>>> >>>>> Kevin >>>>> >>>>> >>>>> Wojciech Turek wrote: >>>>>> Ok so it looks like you have in total 64 OSTs and your output >>>>>> file is striped across 48 of them. May I suggest that you limit >>>>>> number of stripes, lets say a good number to start with would be >>>>>> 8 stripes and also for best results use OST pools feature to >>>>>> arrange that each stripe goes to OST owned by different OSS. >>>>>> >>>>>> regards, >>>>>> >>>>>> Wojciech >>>>>> >>>>>> On 23 May 2011 23:09,<kmehta at cs.uh.edu<mailto:kmehta at cs.uh.edu>> >>>>>> wrote: >>>>>> >>>>>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>>>>> presume the >>>>>> system documentation is out of date. >>>>>> >>>>>> Again, I am sorry the basic information had been incorrect. >>>>>> >>>>>> - Kshitij >>>>>> >>>>>> > Run lfs getstripe<your_output_file> and paste the output of >>>>>> that command >>>>>> > to >>>>>> > the mailing list. >>>>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>>>> max stripe >>>>>> > count will be 11) >>>>>> > If your striping is correct, the bottleneck can be your client >>>>>> network. >>>>>> > >>>>>> > regards, >>>>>> > >>>>>> > Wojciech >>>>>> > >>>>>> > >>>>>> > >>>>>> > On 23 May 2011 22:35,<kmehta at cs.uh.edu >>>>>> <mailto:kmehta at cs.uh.edu>> wrote: >>>>>> > >>>>>> >> The stripe count is 48. >>>>>> >> >>>>>> >> Just fyi, this is what my application does: >>>>>> >> A simple I/O test where threads continually write blocks >>>>>> of size >>>>>> >> 64Kbytes >>>>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>>>> 16Gbytes >>>>>> >> is >>>>>> >> created. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Kshitij >>>>>> >> >>>>>> >> > What is your stripe count on the file, if your >>>>>> default is 1, >>>>>> you are >>>>>> >> only >>>>>> >> > writing to one of the OST''s. you can check with the lfs >>>>>> getstripe >>>>>> >> > command, you can set the stripe bigger, and hopefully your >>>>>> >> wide-stripped >>>>>> >> > file with threaded writes will be faster. >>>>>> >> > >>>>>> >> > Evan >>>>>> >> > >>>>>> >> > -----Original Message----- >>>>>> >> > From: lustre-community-bounces at lists.lustre.org >>>>>> <mailto:lustre-community-bounces at lists.lustre.org> >>>>>> >> > [mailto:lustre-community-bounces at lists.lustre.org >>>>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>>>>> >> > kmehta at cs.uh.edu<mailto:kmehta at cs.uh.edu> >>>>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>>>> >> > To: lustre-community at lists.lustre.org >>>>>> <mailto:lustre-community at lists.lustre.org> >>>>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>>>> performance >>>>>> >> > >>>>>> >> > Hello, >>>>>> >> > I am running a multithreaded application that writes >>>>>> to a common >>>>>> >> shared >>>>>> >> > file on lustre fs, and this is what I see: >>>>>> >> > >>>>>> >> > If I have a single thread in my application, I get a >>>>>> bandwidth of >>>>>> >> approx. >>>>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>>>> spawn 8 >>>>>> >> > threads such that all of them write to the same file >>>>>> (non-overlapping >>>>>> >> > locations), without explicitly synchronizing the >>>>>> writes (i.e. >>>>>> I dont >>>>>> >> lock >>>>>> >> > the file handle), I still get the same bandwidth. >>>>>> >> > >>>>>> >> > Now, instead of writing to a shared file, if these threads >>>>>> write to >>>>>> >> > separate files, the bandwidth obtained is approx. 700 >>>>>> Mbytes/sec. >>>>>> >> > >>>>>> >> > I would ideally like my multithreaded application to >>>>>> see similar >>>>>> >> scaling. >>>>>> >> > Any ideas why the performance is limited and any >>>>>> workarounds? >>>>>> >> > >>>>>> >> > Thank you, >>>>>> >> > Kshitij >>>>>> >> > >>>>>> >> > >>>>>> >> > _______________________________________________ >>>>>> >> > Lustre-community mailing list >>>>>> >> > Lustre-community at lists.lustre.org >>>>>> <mailto:Lustre-community at lists.lustre.org> >>>>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >> > >>>>>> >> >>>>>> >> >>>>>> >> _______________________________________________ >>>>>> >> Lustre-community mailing list >>>>>> >> Lustre-community at lists.lustre.org >>>>>> <mailto:Lustre-community at lists.lustre.org> >>>>>> >> >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >>>>>> >>>>>> ----------------------------------------------------------------- >>>>>> -- >>>>>> ----- >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-community mailing list >>>>>> Lustre-community at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >>>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>
On 2011-06-09, at 11:57 AM, Kshitij Mehta wrote:> I read in a research paper > (http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about > Lustre''s ability to join files in place. Can someone point me to sample > code and documentation on this? I couldnt find information in the > manual. Being able to join files in place could be a potential solution > to the issue I have.That feature was mostly experimental, and has been disabled in newer versions of Lustre.> On 06/06/2011 01:20 PM, kmehta at cs.uh.edu wrote: >>> are the separate files being striped 8 ways? >>> Because that would allow them to hit possibly all 64 OST''s, while the >>> shared file case will only hit 8 >> >> Yes, I found out that the files are getting striped 8 ways, so we end up >> hitting 64 OSTs. This is what I tried next: >> >> 1. Ran a test case where 6 threads write separate files, each of size 6 >> GB, to a directory configured over 8 OSTs. Thus the application writes >> 36GB of data in total, over 48 OSTs. >> >> 2. Ran a test case where 8 threads write a common file of size 36GB to a >> directory configured over 48 OSTs. >> >> Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a >> b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1 >> (separate files). >> >> Thanks, >> Kshitij >> >>> I''ve been trying to test this, but not finding an obvious error... so >>> more questions: >>> >>> How much RAM do you have on your client, and how much on the OST''s some >>> of my smaller tests go much faster, but I believe that it is cache based >>> effects. My larger test at 32GB gives pretty consistent results. >>> >>> The other thing to consider: are the separate files being striped 8 ways? >>> Because that would allow them to hit possibly all 64 OST''s, while the >>> shared file case will only hit 8. >>> >>> Evan >>> >>> -----Original Message----- >>> From: lustre-discuss-bounces at lists.lustre.org >>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix, Evan >>> J >>> Sent: Friday, June 03, 2011 9:09 AM >>> To: kmehta at cs.uh.edu >>> Cc: Lustre discuss >>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >>> >>> What file sizes and segment sizes are you using for your tests? >>> >>> Evan >>> >>> -----Original Message----- >>> From: lustre-discuss-bounces at lists.lustre.org >>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of >>> kmehta at cs.uh.edu >>> Sent: Thursday, June 02, 2011 5:07 PM >>> To: kmehta at cs.uh.edu >>> Cc: kmehta at cs.uh.edu; Lustre discuss >>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >>> >>> Hello, >>> I was wondering if anyone could replicate the performance of the >>> multithreaded application using the C file that I posted in my previous >>> email. >>> >>> Thanks, >>> Kshitij >>> >>> >>>> Ok I ran the following tests: >>>> >>>> [1] >>>> Application spawns 8 threads. I write to Lustre having 8 OSTs. >>>> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, >>>> i.e. >>>> >>>> T0 writes to offsets 0, 8MB, 16MB, etc. >>>> T1 writes to offsets 1MB, 9MB, 17MB, etc. >>>> The stripe size being 1MByte, every thread ends up writing to only 1 >>>> OST. >>>> >>>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread >>>> performance. >>>> >>>> [2] >>>> I also ran the same test such that every thread writes data in blocks >>>> of 8 Mbytes for the same stripe size. (Thus, every thread will write >>>> to every OST). I still get similar performance, ~280Mbytes/sec, so >>>> essentially I see no difference between each thread writing to a >>>> single OST vs each thread writing to all OSTs. >>>> >>>> And as I said before, if all threads write to their own separate file, >>>> the resulting bandwidth is ~700Mbytes/sec. >>>> >>>> I have attached my C file (simple_io_test.c) herewith. Maybe you could >>>> run it and see where the bottleneck is. Comments and instructions for >>>> compilation have been included in the file. Do let me know if you need >>>> any clarification on that. >>>> >>>> Your help is appreciated, >>>> Kshitij >>>> >>>>> This is what my application does: >>>>> >>>>> Each thread has its own file descriptor to the file. >>>>> I use pwrite to ensure non-overlapping regions, as follows: >>>>> >>>>> Thread 0, data_size: 1MB, offset: 0 >>>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >>>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >>>>> >>>>> <repeat cycle> >>>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >>>>> parallel, I dont wait for one cycle to end before the next one >>>>> begins). >>>>> >>>>> I am gonna try the following: >>>>> a) >>>>> Instead of a round-robin distribution of offsets, test with >>>>> sequential >>>>> offsets: >>>>> Thread 0, data_size: 1MB, offset:0 >>>>> Thread 0, data_size: 1MB, offset:1MB >>>>> Thread 0, data_size: 1MB, offset:2MB >>>>> Thread 0, data_size: 1MB, offset:3MB >>>>> >>>>> Thread 1, data_size: 1MB, offset:4MB >>>>> and so on. (I am gonna keep these separate pwrite I/O requests >>>>> instead of merging them or using writev) >>>>> >>>>> b) >>>>> Map the threads to the no. of OSTs using some modulo, as suggested in >>>>> the email below. >>>>> >>>>> c) >>>>> Experiment with fewer no. of OSTs (I currently have 48). >>>>> >>>>> I shall report back with my findings. >>>>> >>>>> Thanks, >>>>> Kshitij >>>>> >>>>>> [Moved to Lustre-discuss] >>>>>> >>>>>> >>>>>> "However, if I spawn 8 threads such that all of them write to the >>>>>> same file (non-overlapping locations), without explicitly >>>>>> synchronizing the writes (i.e. I dont lock the file handle)" >>>>>> >>>>>> >>>>>> How exactly does your multi-threaded application write the data? >>>>>> Are you using pwrite to ensure non-overlapping regions or are they >>>>>> all just doing unlocked write() operations on the same fd to each >>>>>> write (each just transferring size/8)? If it divides the file into >>>>>> N pieces, and each thread does pwrite on its piece, then what each >>>>>> OST sees are multiple streams at wide offsets to the same object, >>>>>> which could impact performance. >>>>>> >>>>>> If on the other hand the file is written sequentially, where each >>>>>> thread grabs the next piece to be written (locking normally used for >>>>>> the current_offset value, so you know where each chunk is actually >>>>>> going), then you get a more sequential pattern at the OST. >>>>>> >>>>>> If the number of threads maps to the number of OSTs (or some modulo, >>>>>> like in your case 6 OSTs per thread), and each thread "owns" the >>>>>> piece of the file that belongs to an OST (ie: for (offset >>>>>> thread_num * 6MB; offset< size; offset += 48MB) pwrite(fd, buf, >>>>>> 6MB, offset); ), then you''ve eliminated the need for application >>>>>> locks (assuming the use of >>>>>> pwrite) and ensured each OST object is being written sequentially. >>>>>> >>>>>> It''s quite possible there is some bottleneck on the shared fd. So >>>>>> perhaps the question is not why you aren''t scaling with more >>>>>> threads, but why the single file is not able to saturate the client, >>>>>> or why the file BW is not scaling with more OSTs. It is somewhat >>>>>> common for multiple processes (on different nodes) to write >>>>>> non-overlapping regions of the same file; does performance improve >>>>>> if each thread opens its own file descriptor? >>>>>> >>>>>> Kevin >>>>>> >>>>>> >>>>>> Wojciech Turek wrote: >>>>>>> Ok so it looks like you have in total 64 OSTs and your output file >>>>>>> is striped across 48 of them. May I suggest that you limit number >>>>>>> of stripes, lets say a good number to start with would be 8 stripes >>>>>>> and also for best results use OST pools feature to arrange that >>>>>>> each stripe goes to OST owned by different OSS. >>>>>>> >>>>>>> regards, >>>>>>> >>>>>>> Wojciech >>>>>>> >>>>>>> On 23 May 2011 23:09,<kmehta at cs.uh.edu<mailto:kmehta at cs.uh.edu>> >>>>>>> wrote: >>>>>>> >>>>>>> Actually, ''lfs check servers'' returns 64 entries as well, so I >>>>>>> presume the >>>>>>> system documentation is out of date. >>>>>>> >>>>>>> Again, I am sorry the basic information had been incorrect. >>>>>>> >>>>>>> - Kshitij >>>>>>> >>>>>>>> Run lfs getstripe<your_output_file> and paste the output of >>>>>>> that command >>>>>>>> to >>>>>>>> the mailing list. >>>>>>>> Stripe count of 48 is not possible if you have max 11 OSTs (the >>>>>>> max stripe >>>>>>>> count will be 11) >>>>>>>> If your striping is correct, the bottleneck can be your client >>>>>>> network. >>>>>>>> >>>>>>>> regards, >>>>>>>> >>>>>>>> Wojciech >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 23 May 2011 22:35,<kmehta at cs.uh.edu >>>>>>> <mailto:kmehta at cs.uh.edu>> wrote: >>>>>>>> >>>>>>>>> The stripe count is 48. >>>>>>>>> >>>>>>>>> Just fyi, this is what my application does: >>>>>>>>> A simple I/O test where threads continually write blocks of >>>>>>> size >>>>>>>>> 64Kbytes >>>>>>>>> or 1Mbyte (decided at compile time) till a large file of say, >>>>>>> 16Gbytes >>>>>>>>> is >>>>>>>>> created. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Kshitij >>>>>>>>> >>>>>>>>>> What is your stripe count on the file, if your default is >>>>>>> 1, >>>>>>> you are >>>>>>>>> only >>>>>>>>>> writing to one of the OST''s. you can check with the lfs >>>>>>> getstripe >>>>>>>>>> command, you can set the stripe bigger, and hopefully your >>>>>>>>> wide-stripped >>>>>>>>>> file with threaded writes will be faster. >>>>>>>>>> >>>>>>>>>> Evan >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: lustre-community-bounces at lists.lustre.org >>>>>>> <mailto:lustre-community-bounces at lists.lustre.org> >>>>>>>>>> [mailto:lustre-community-bounces at lists.lustre.org >>>>>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of >>>>>>>>>> kmehta at cs.uh.edu<mailto:kmehta at cs.uh.edu> >>>>>>>>>> Sent: Monday, May 23, 2011 2:28 PM >>>>>>>>>> To: lustre-community at lists.lustre.org >>>>>>> <mailto:lustre-community at lists.lustre.org> >>>>>>>>>> Subject: [Lustre-community] Poor multithreaded I/O >>>>>>> performance >>>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> I am running a multithreaded application that writes to a >>>>>>> common >>>>>>>>> shared >>>>>>>>>> file on lustre fs, and this is what I see: >>>>>>>>>> >>>>>>>>>> If I have a single thread in my application, I get a >>>>>>> bandwidth of >>>>>>>>> approx. >>>>>>>>>> 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>>>>> spawn 8 >>>>>>>>>> threads such that all of them write to the same file >>>>>>> (non-overlapping >>>>>>>>>> locations), without explicitly synchronizing the writes >>>>>>> (i.e. >>>>>>> I dont >>>>>>>>> lock >>>>>>>>>> the file handle), I still get the same bandwidth. >>>>>>>>>> >>>>>>>>>> Now, instead of writing to a shared file, if these threads >>>>>>> write to >>>>>>>>>> separate files, the bandwidth obtained is approx. 700 >>>>>>> Mbytes/sec. >>>>>>>>>> >>>>>>>>>> I would ideally like my multithreaded application to see >>>>>>> similar >>>>>>>>> scaling. >>>>>>>>>> Any ideas why the performance is limited and any >>>>>>> workarounds? >>>>>>>>>> >>>>>>>>>> Thank you, >>>>>>>>>> Kshitij >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Lustre-community mailing list >>>>>>>>>> Lustre-community at lists.lustre.org >>>>>>> <mailto:Lustre-community at lists.lustre.org> >>>>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Lustre-community mailing list >>>>>>>>> Lustre-community at lists.lustre.org >>>>>>> <mailto:Lustre-community at lists.lustre.org> >>>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------- >>>>>>> ----- >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Lustre-community mailing list >>>>>>> Lustre-community at lists.lustre.org >>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>>> >>>>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.