thr3ads.net - Lustre discuss - [Lustre-community] Poor multithreaded I/O performance [May 2011]

If this information is useful, please help other people find it:
Share via:

kmehta at cs.uh.edu

2011-May-23 21:28 UTC

[Lustre-community] Poor multithreaded I/O performance

Hello,
I am running a multithreaded application that writes to a common shared
file on lustre fs, and this is what I see:

If I have a single thread in my application, I get a bandwidth of approx.
250 MBytes/sec. (11 OSTs, 1MByte stripe size)
However, if I spawn 8 threads such that all of them write to the same file
(non-overlapping locations), without explicitly synchronizing the writes
(i.e. I dont lock the file handle), I still get the same bandwidth.

Now, instead of writing to a shared file, if these threads write to
separate files, the bandwidth obtained is approx. 700 Mbytes/sec.

I would ideally like my multithreaded application to see similar scaling.
Any ideas why the performance is limited and any workarounds?

Thank you,
Kshitij

Felix, Evan J

2011-May-23 21:28 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

What is your stripe count on the file,  if your default is 1, you are only
writing to one of the OST''s.  you can check with the lfs getstripe
command, you can set the stripe bigger, and hopefully your wide-stripped file
with threaded writes will be faster.

Evan

-----Original Message-----
From: lustre-community-bounces at lists.lustre.org
[mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of kmehta at
cs.uh.edu
Sent: Monday, May 23, 2011 2:28 PM
To: lustre-community at lists.lustre.org
Subject: [Lustre-community] Poor multithreaded I/O performance

Hello,
I am running a multithreaded application that writes to a common shared file on
lustre fs, and this is what I see:

If I have a single thread in my application, I get a bandwidth of approx.
250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 threads such
that all of them write to the same file (non-overlapping locations), without
explicitly synchronizing the writes (i.e. I dont lock the file handle), I still
get the same bandwidth.

Now, instead of writing to a shared file, if these threads write to separate
files, the bandwidth obtained is approx. 700 Mbytes/sec.

I would ideally like my multithreaded application to see similar scaling.
Any ideas why the performance is limited and any workarounds?

Thank you,
Kshitij


_______________________________________________
Lustre-community mailing list
Lustre-community at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-community

Kevin Van Maren

2011-May-23 21:34 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

kmehta at cs.uh.edu wrote:> The stripe count is 48.
>   
With only 11 OSTs?
> Just fyi, this is what my application does:
> A simple I/O test where threads continually write blocks of size 64Kbytes
> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is
> created.
>
> Thanks,
> Kshitij
>
>   
>> What is your stripe count on the file,  if your default is 1, you are
only
>> writing to one of the OST''s.  you can check with the lfs
getstripe
>> command, you can set the stripe bigger, and hopefully your
wide-stripped
>> file with threaded writes will be faster.
>>
>> Evan
>>
>> -----Original Message-----
>> From: lustre-community-bounces at lists.lustre.org
>> [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of
>> kmehta at cs.uh.edu
>> Sent: Monday, May 23, 2011 2:28 PM
>> To: lustre-community at lists.lustre.org
>> Subject: [Lustre-community] Poor multithreaded I/O performance
>>
>> Hello,
>> I am running a multithreaded application that writes to a common shared
>> file on lustre fs, and this is what I see:
>>
>> If I have a single thread in my application, I get a bandwidth of
approx.
>> 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8
>> threads such that all of them write to the same file (non-overlapping
>> locations), without explicitly synchronizing the writes (i.e. I dont
lock
>> the file handle), I still get the same bandwidth.
>>
>> Now, instead of writing to a shared file, if these threads write to
>> separate files, the bandwidth obtained is approx. 700 Mbytes/sec.
>>
>> I would ideally like my multithreaded application to see similar
scaling.
>> Any ideas why the performance is limited and any workarounds?
>>
>> Thank you,
>> Kshitij
>>
>>
>> _______________________________________________
>> Lustre-community mailing list
>> Lustre-community at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>
>>     
>
>
> _______________________________________________
> Lustre-community mailing list
> Lustre-community at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-community
>

kmehta at cs.uh.edu

2011-May-23 21:35 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

The stripe count is 48.

Just fyi, this is what my application does:
A simple I/O test where threads continually write blocks of size 64Kbytes
or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is
created.

Thanks,
Kshitij
> What is your stripe count on the file,  if your default is 1, you are only
> writing to one of the OST''s.  you can check with the lfs getstripe
> command, you can set the stripe bigger, and hopefully your wide-stripped
> file with threaded writes will be faster.
>
> Evan
>
> -----Original Message-----
> From: lustre-community-bounces at lists.lustre.org
> [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of
> kmehta at cs.uh.edu
> Sent: Monday, May 23, 2011 2:28 PM
> To: lustre-community at lists.lustre.org
> Subject: [Lustre-community] Poor multithreaded I/O performance
>
> Hello,
> I am running a multithreaded application that writes to a common shared
> file on lustre fs, and this is what I see:
>
> If I have a single thread in my application, I get a bandwidth of approx.
> 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8
> threads such that all of them write to the same file (non-overlapping
> locations), without explicitly synchronizing the writes (i.e. I dont lock
> the file handle), I still get the same bandwidth.
>
> Now, instead of writing to a shared file, if these threads write to
> separate files, the bandwidth obtained is approx. 700 Mbytes/sec.
>
> I would ideally like my multithreaded application to see similar scaling.
> Any ideas why the performance is limited and any workarounds?
>
> Thank you,
> Kshitij
>
>
> _______________________________________________
> Lustre-community mailing list
> Lustre-community at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-community
>

Wojciech Turek

2011-May-23 21:37 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

Run lfs getstripe <your_output_file> and paste the output of that command
to
the mailing list.
Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe
count will be 11)
If your striping is correct, the bottleneck can be your client network.

regards,

Wojciech



On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote:
> The stripe count is 48.
>
> Just fyi, this is what my application does:
> A simple I/O test where threads continually write blocks of size 64Kbytes
> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is
> created.
>
> Thanks,
> Kshitij
>
> > What is your stripe count on the file,  if your default is 1, you are
> only
> > writing to one of the OST''s.  you can check with the lfs
getstripe
> > command, you can set the stripe bigger, and hopefully your
wide-stripped
> > file with threaded writes will be faster.
> >
> > Evan
> >
> > -----Original Message-----
> > From: lustre-community-bounces at lists.lustre.org
> > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of
> > kmehta at cs.uh.edu
> > Sent: Monday, May 23, 2011 2:28 PM
> > To: lustre-community at lists.lustre.org
> > Subject: [Lustre-community] Poor multithreaded I/O performance
> >
> > Hello,
> > I am running a multithreaded application that writes to a common
shared
> > file on lustre fs, and this is what I see:
> >
> > If I have a single thread in my application, I get a bandwidth of
approx.
> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8
> > threads such that all of them write to the same file (non-overlapping
> > locations), without explicitly synchronizing the writes (i.e. I dont
lock
> > the file handle), I still get the same bandwidth.
> >
> > Now, instead of writing to a shared file, if these threads write to
> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec.
> >
> > I would ideally like my multithreaded application to see similar
scaling.
> > Any ideas why the performance is limited and any workarounds?
> >
> > Thank you,
> > Kshitij
> >
> >
> > _______________________________________________
> > Lustre-community mailing list
> > Lustre-community at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-community
> >
>
>
> _______________________________________________
> Lustre-community mailing list
> Lustre-community at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-community
>


-- 
Wojciech Turek

Senior System Architect

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-community/attachments/20110523/c59fb882/attachment.html

kmehta at cs.uh.edu

2011-May-23 21:46 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

This is what my system documentation says:
"Lustre filesystem is exported by 11 servers via Infiniband".
I guess this means 11 OSTs (my apologies if it doesn''t).

This is the output of the lfs getstripe command:

$ lfs getstripe my_output_1824315 --quiet --verbose
OBDS:
0: fastfs-OST0000_UUID ACTIVE
1: fastfs-OST0001_UUID ACTIVE
2: fastfs-OST0002_UUID ACTIVE
3: fastfs-OST0003_UUID ACTIVE
4: fastfs-OST0004_UUID ACTIVE
5: fastfs-OST0005_UUID ACTIVE
6: fastfs-OST0006_UUID ACTIVE
7: fastfs-OST0007_UUID ACTIVE
8: fastfs-OST0008_UUID ACTIVE
9: fastfs-OST0009_UUID ACTIVE
10: fastfs-OST000a_UUID ACTIVE
11: fastfs-OST000b_UUID ACTIVE
12: fastfs-OST000c_UUID ACTIVE
13: fastfs-OST000d_UUID ACTIVE
14: fastfs-OST000e_UUID ACTIVE
15: fastfs-OST000f_UUID ACTIVE
16: fastfs-OST0010_UUID ACTIVE
17: fastfs-OST0011_UUID ACTIVE
18: fastfs-OST0012_UUID ACTIVE
19: fastfs-OST0013_UUID ACTIVE
20: fastfs-OST0014_UUID ACTIVE
21: fastfs-OST0015_UUID ACTIVE
22: fastfs-OST0016_UUID ACTIVE
23: fastfs-OST0017_UUID ACTIVE
24: fastfs-OST0018_UUID ACTIVE
25: fastfs-OST0019_UUID ACTIVE
26: fastfs-OST001a_UUID ACTIVE
27: fastfs-OST001b_UUID ACTIVE
28: fastfs-OST001c_UUID ACTIVE
29: fastfs-OST001d_UUID ACTIVE
30: fastfs-OST001e_UUID ACTIVE
31: fastfs-OST001f_UUID ACTIVE
32: fastfs-OST0020_UUID ACTIVE
33: fastfs-OST0021_UUID ACTIVE
34: fastfs-OST0022_UUID ACTIVE
35: fastfs-OST0023_UUID ACTIVE
36: fastfs-OST0024_UUID ACTIVE
37: fastfs-OST0025_UUID ACTIVE
38: fastfs-OST0026_UUID ACTIVE
39: fastfs-OST0027_UUID ACTIVE
40: fastfs-OST0028_UUID ACTIVE
41: fastfs-OST0029_UUID ACTIVE
42: fastfs-OST002a_UUID ACTIVE
43: fastfs-OST002b_UUID ACTIVE
44: fastfs-OST002c_UUID ACTIVE
45: fastfs-OST002d_UUID ACTIVE
46: fastfs-OST002e_UUID ACTIVE
47: fastfs-OST002f_UUID ACTIVE
48: fastfs-OST0030_UUID ACTIVE
49: fastfs-OST0031_UUID ACTIVE
50: fastfs-OST0032_UUID ACTIVE
51: fastfs-OST0033_UUID ACTIVE
52: fastfs-OST0034_UUID ACTIVE
53: fastfs-OST0035_UUID ACTIVE
54: fastfs-OST0036_UUID ACTIVE
55: fastfs-OST0037_UUID ACTIVE
56: fastfs-OST0038_UUID ACTIVE
57: fastfs-OST0039_UUID ACTIVE
58: fastfs-OST003a_UUID ACTIVE
59: fastfs-OST003b_UUID ACTIVE
60: fastfs-OST003c_UUID ACTIVE
61: fastfs-OST003d_UUID ACTIVE
62: fastfs-OST003e_UUID ACTIVE
63: fastfs-OST003f_UUID ACTIVE
my_output_1824315
lmm_magic:          0x0BD10BD0
lmm_object_gr:      0
lmm_object_id:      0x3c3839d
lmm_stripe_count:   48
lmm_stripe_size:    1048576
lmm_stripe_pattern: 1
        obdidx           objid          objid            group
             5         6096574       0x5d06be                0
            25         6216932       0x5edce4                0
             9         6428932       0x621904                0
            27         6275058       0x5fbff2                0
            19         6290046       0x5ffa7e                0
            48         6082133       0x5cce55                0
            58         6223558       0x5ef6c6                0
            40         6153492       0x5de514                0
            59         6269987       0x5fac23                0
            15         5587155       0x5540d3                0
            46         6191301       0x5e78c5                0
            26         6444958       0x62579e                0
            54         6421150       0x61fa9e                0
            34         6222465       0x5ef281                0
            55         6288603       0x5ff4db                0
            13         6360247       0x610cb7                0
             8         5921168       0x5a5990                0
            29         6144665       0x5dc299                0
            63         5799435       0x587e0b                0
            53         6356594       0x60fe72                0
             6         6214509       0x5ed36d                0
            61         6319347       0x606cf3                0
            43         6414677       0x61e155                0
            36         5790422       0x585ad6                0
            18         6222532       0x5ef2c4                0
            28         5921782       0x5a5bf6                0
             1         6361844       0x6112f4                0
            41         5746110       0x57adbe                0
            35         6043439       0x5c372f                0
            45         6122676       0x5d6cb4                0
             2         6193223       0x5e8047                0
            62         5902764       0x5a11ac                0
            56         6511354       0x635afa                0
            23         5576293       0x551665                0
            14         6258551       0x5f7f77                0
            12         6109474       0x5d3922                0
            60         6407726       0x61c62e                0
            57         6243713       0x5f4581                0
            20         6249079       0x5f5a77                0
             3         5639606       0x560db6                0
            50         5982718       0x5b49fe                0
            31         6372788       0x613db4                0
            52         6502335       0x6337bf                0
            32         4738970       0x484f9a                0
            38         5440109       0x53026d                0
            51         4683453       0x4776bd                0
            39         6391955       0x618893                0
            16         5755161       0x57d119                0





----------------------------------------------------------------------------
> Run lfs getstripe <your_output_file> and paste the output of that
command
> to
> the mailing list.
> Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe
> count will be 11)
> If your striping is correct, the bottleneck can be your client network.
>
> regards,
>
> Wojciech
>
>
>
> On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote:
>
>> The stripe count is 48.
>>
>> Just fyi, this is what my application does:
>> A simple I/O test where threads continually write blocks of size
>> 64Kbytes
>> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes
>> is
>> created.
>>
>> Thanks,
>> Kshitij
>>
>> > What is your stripe count on the file,  if your default is 1, you
are
>> only
>> > writing to one of the OST''s.  you can check with the lfs
getstripe
>> > command, you can set the stripe bigger, and hopefully your
>> wide-stripped
>> > file with threaded writes will be faster.
>> >
>> > Evan
>> >
>> > -----Original Message-----
>> > From: lustre-community-bounces at lists.lustre.org
>> > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of
>> > kmehta at cs.uh.edu
>> > Sent: Monday, May 23, 2011 2:28 PM
>> > To: lustre-community at lists.lustre.org
>> > Subject: [Lustre-community] Poor multithreaded I/O performance
>> >
>> > Hello,
>> > I am running a multithreaded application that writes to a common
>> shared
>> > file on lustre fs, and this is what I see:
>> >
>> > If I have a single thread in my application, I get a bandwidth of
>> approx.
>> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn
8
>> > threads such that all of them write to the same file
(non-overlapping
>> > locations), without explicitly synchronizing the writes (i.e. I
dont
>> lock
>> > the file handle), I still get the same bandwidth.
>> >
>> > Now, instead of writing to a shared file, if these threads write
to
>> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec.
>> >
>> > I would ideally like my multithreaded application to see similar
>> scaling.
>> > Any ideas why the performance is limited and any workarounds?
>> >
>> > Thank you,
>> > Kshitij
>> >
>> >
>> > _______________________________________________
>> > Lustre-community mailing list
>> > Lustre-community at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-community
>> >
>>
>>
>> _______________________________________________
>> Lustre-community mailing list
>> Lustre-community at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>
>
>
>
> --
> Wojciech Turek
>
> Senior System Architect
>
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk
> Tel: (+)44 1223 763517
>

kmehta at cs.uh.edu

2011-May-23 22:04 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

So I think there are 11 servers (OSSs and not OSTs. Sorry). Running
''lfs
check osts'' returns 64 entries, so I think the system has been
configured
with 64 OSTs.

- Kshitij
> Run lfs getstripe <your_output_file> and paste the output of that
command
> to
> the mailing list.
> Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe
> count will be 11)
> If your striping is correct, the bottleneck can be your client network.
>
> regards,
>
> Wojciech
>
>
>
> On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote:
>
>> The stripe count is 48.
>>
>> Just fyi, this is what my application does:
>> A simple I/O test where threads continually write blocks of size
>> 64Kbytes
>> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes
>> is
>> created.
>>
>> Thanks,
>> Kshitij
>>
>> > What is your stripe count on the file,  if your default is 1, you
are
>> only
>> > writing to one of the OST''s.  you can check with the lfs
getstripe
>> > command, you can set the stripe bigger, and hopefully your
>> wide-stripped
>> > file with threaded writes will be faster.
>> >
>> > Evan
>> >
>> > -----Original Message-----
>> > From: lustre-community-bounces at lists.lustre.org
>> > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of
>> > kmehta at cs.uh.edu
>> > Sent: Monday, May 23, 2011 2:28 PM
>> > To: lustre-community at lists.lustre.org
>> > Subject: [Lustre-community] Poor multithreaded I/O performance
>> >
>> > Hello,
>> > I am running a multithreaded application that writes to a common
>> shared
>> > file on lustre fs, and this is what I see:
>> >
>> > If I have a single thread in my application, I get a bandwidth of
>> approx.
>> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn
8
>> > threads such that all of them write to the same file
(non-overlapping
>> > locations), without explicitly synchronizing the writes (i.e. I
dont
>> lock
>> > the file handle), I still get the same bandwidth.
>> >
>> > Now, instead of writing to a shared file, if these threads write
to
>> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec.
>> >
>> > I would ideally like my multithreaded application to see similar
>> scaling.
>> > Any ideas why the performance is limited and any workarounds?
>> >
>> > Thank you,
>> > Kshitij
>> >
>> >
>> > _______________________________________________
>> > Lustre-community mailing list
>> > Lustre-community at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-community
>> >
>>
>>
>> _______________________________________________
>> Lustre-community mailing list
>> Lustre-community at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>
>
>
>
> --
> Wojciech Turek
>
> Senior System Architect
>
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk
> Tel: (+)44 1223 763517
>

kmehta at cs.uh.edu

2011-May-23 22:09 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

Actually, ''lfs check servers'' returns 64 entries as well, so I
presume the
system documentation is out of date.

Again, I am sorry the basic information had been incorrect.

- Kshitij
> Run lfs getstripe <your_output_file> and paste the output of that
command
> to
> the mailing list.
> Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe
> count will be 11)
> If your striping is correct, the bottleneck can be your client network.
>
> regards,
>
> Wojciech
>
>
>
> On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote:
>
>> The stripe count is 48.
>>
>> Just fyi, this is what my application does:
>> A simple I/O test where threads continually write blocks of size
>> 64Kbytes
>> or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes
>> is
>> created.
>>
>> Thanks,
>> Kshitij
>>
>> > What is your stripe count on the file,  if your default is 1, you
are
>> only
>> > writing to one of the OST''s.  you can check with the lfs
getstripe
>> > command, you can set the stripe bigger, and hopefully your
>> wide-stripped
>> > file with threaded writes will be faster.
>> >
>> > Evan
>> >
>> > -----Original Message-----
>> > From: lustre-community-bounces at lists.lustre.org
>> > [mailto:lustre-community-bounces at lists.lustre.org] On Behalf Of
>> > kmehta at cs.uh.edu
>> > Sent: Monday, May 23, 2011 2:28 PM
>> > To: lustre-community at lists.lustre.org
>> > Subject: [Lustre-community] Poor multithreaded I/O performance
>> >
>> > Hello,
>> > I am running a multithreaded application that writes to a common
>> shared
>> > file on lustre fs, and this is what I see:
>> >
>> > If I have a single thread in my application, I get a bandwidth of
>> approx.
>> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn
8
>> > threads such that all of them write to the same file
(non-overlapping
>> > locations), without explicitly synchronizing the writes (i.e. I
dont
>> lock
>> > the file handle), I still get the same bandwidth.
>> >
>> > Now, instead of writing to a shared file, if these threads write
to
>> > separate files, the bandwidth obtained is approx. 700 Mbytes/sec.
>> >
>> > I would ideally like my multithreaded application to see similar
>> scaling.
>> > Any ideas why the performance is limited and any workarounds?
>> >
>> > Thank you,
>> > Kshitij
>> >
>> >
>> > _______________________________________________
>> > Lustre-community mailing list
>> > Lustre-community at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-community
>> >
>>
>>
>> _______________________________________________
>> Lustre-community mailing list
>> Lustre-community at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>
>
>
>
> --
> Wojciech Turek
>
> Senior System Architect
>
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk
> Tel: (+)44 1223 763517
>

Wojciech Turek

2011-May-23 23:52 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

Ok so it looks like you have in total 64 OSTs and your output file is
striped across 48 of them. May I suggest that you limit number of stripes,
lets say a good number to start with would be 8 stripes and also for best
results use OST pools feature to arrange that each stripe goes to OST owned
by different OSS.

regards,

Wojciech

On 23 May 2011 23:09, <kmehta at cs.uh.edu> wrote:
> Actually, ''lfs check servers'' returns 64 entries as well,
so I presume the
> system documentation is out of date.
>
> Again, I am sorry the basic information had been incorrect.
>
> - Kshitij
>
> > Run lfs getstripe <your_output_file> and paste the output of
that command
> > to
> > the mailing list.
> > Stripe count of 48 is not possible if you have max 11 OSTs (the max
> stripe
> > count will be 11)
> > If your striping is correct, the bottleneck can be your client
network.
> >
> > regards,
> >
> > Wojciech
> >
> >
> >
> > On 23 May 2011 22:35, <kmehta at cs.uh.edu> wrote:
> >
> >> The stripe count is 48.
> >>
> >> Just fyi, this is what my application does:
> >> A simple I/O test where threads continually write blocks of size
> >> 64Kbytes
> >> or 1Mbyte (decided at compile time) till a large file of say,
16Gbytes
> >> is
> >> created.
> >>
> >> Thanks,
> >> Kshitij
> >>
> >> > What is your stripe count on the file,  if your default is 1,
you are
> >> only
> >> > writing to one of the OST''s.  you can check with the
lfs getstripe
> >> > command, you can set the stripe bigger, and hopefully your
> >> wide-stripped
> >> > file with threaded writes will be faster.
> >> >
> >> > Evan
> >> >
> >> > -----Original Message-----
> >> > From: lustre-community-bounces at lists.lustre.org
> >> > [mailto:lustre-community-bounces at lists.lustre.org] On
Behalf Of
> >> > kmehta at cs.uh.edu
> >> > Sent: Monday, May 23, 2011 2:28 PM
> >> > To: lustre-community at lists.lustre.org
> >> > Subject: [Lustre-community] Poor multithreaded I/O
performance
> >> >
> >> > Hello,
> >> > I am running a multithreaded application that writes to a
common
> >> shared
> >> > file on lustre fs, and this is what I see:
> >> >
> >> > If I have a single thread in my application, I get a
bandwidth of
> >> approx.
> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
spawn 8
> >> > threads such that all of them write to the same file
(non-overlapping
> >> > locations), without explicitly synchronizing the writes (i.e.
I dont
> >> lock
> >> > the file handle), I still get the same bandwidth.
> >> >
> >> > Now, instead of writing to a shared file, if these threads
write to
> >> > separate files, the bandwidth obtained is approx. 700
Mbytes/sec.
> >> >
> >> > I would ideally like my multithreaded application to see
similar
> >> scaling.
> >> > Any ideas why the performance is limited and any workarounds?
> >> >
> >> > Thank you,
> >> > Kshitij
> >> >
> >> >
> >> > _______________________________________________
> >> > Lustre-community mailing list
> >> > Lustre-community at lists.lustre.org
> >> > http://lists.lustre.org/mailman/listinfo/lustre-community
> >> >
> >>
> >>
> >> _______________________________________________
> >> Lustre-community mailing list
> >> Lustre-community at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-community
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-community/attachments/20110524/7f257371/attachment.html

Kevin Van Maren

2011-May-24 14:16 UTC

head link

[Lustre-community] Poor multithreaded I/O performance

[Moved to Lustre-discuss]


"However, if I spawn 8 threads such that all of them write to the same 
file (non-overlapping locations), without explicitly synchronizing the 
writes (i.e. I dont lock the file handle)"


How exactly does your multi-threaded application write the data?  Are 
you using pwrite to ensure non-overlapping regions or are they all just 
doing unlocked write() operations on the same fd to each write (each 
just transferring size/8)?  If it divides the file into N pieces, and 
each thread does pwrite on its piece, then what each OST sees are 
multiple streams at wide offsets to the same object, which could impact 
performance.

If on the other hand the file is written sequentially, where each thread 
grabs the next piece to be written (locking normally used for the 
current_offset value, so you know where each chunk is actually going), 
then you get a more sequential pattern at the OST.

If the number of threads maps to the number of OSTs (or some modulo, 
like in your case 6 OSTs per thread), and each thread "owns" the piece
of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; 
offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then 
you''ve eliminated the need for application locks (assuming the use of 
pwrite) and ensured each OST object is being written sequentially.

It''s quite possible there is some bottleneck on the shared fd.  So 
perhaps the question is not why you aren''t scaling with more threads, 
but why the single file is not able to saturate the client, or why the 
file BW is not scaling with more OSTs.  It is somewhat common for 
multiple processes (on different nodes) to write non-overlapping regions 
of the same file; does performance improve if each thread opens its own 
file descriptor?

Kevin


Wojciech Turek wrote:> Ok so it looks like you have in total 64 OSTs and your output file is 
> striped across 48 of them. May I suggest that you limit number of 
> stripes, lets say a good number to start with would be 8 stripes and 
> also for best results use OST pools feature to arrange that each 
> stripe goes to OST owned by different OSS.
>
> regards,
>
> Wojciech
>
> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>> wrote:
>
>     Actually, ''lfs check servers'' returns 64 entries as
well, so I
>     presume the
>     system documentation is out of date.
>
>     Again, I am sorry the basic information had been incorrect.
>
>     - Kshitij
>
>     > Run lfs getstripe <your_output_file> and paste the output of
>     that command
>     > to
>     > the mailing list.
>     > Stripe count of 48 is not possible if you have max 11 OSTs (the
>     max stripe
>     > count will be 11)
>     > If your striping is correct, the bottleneck can be your client
>     network.
>     >
>     > regards,
>     >
>     > Wojciech
>     >
>     >
>     >
>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>     <mailto:kmehta at cs.uh.edu>> wrote:
>     >
>     >> The stripe count is 48.
>     >>
>     >> Just fyi, this is what my application does:
>     >> A simple I/O test where threads continually write blocks of
size
>     >> 64Kbytes
>     >> or 1Mbyte (decided at compile time) till a large file of say,
>     16Gbytes
>     >> is
>     >> created.
>     >>
>     >> Thanks,
>     >> Kshitij
>     >>
>     >> > What is your stripe count on the file,  if your default
is 1,
>     you are
>     >> only
>     >> > writing to one of the OST''s.  you can check with
the lfs
>     getstripe
>     >> > command, you can set the stripe bigger, and hopefully
your
>     >> wide-stripped
>     >> > file with threaded writes will be faster.
>     >> >
>     >> > Evan
>     >> >
>     >> > -----Original Message-----
>     >> > From: lustre-community-bounces at lists.lustre.org
>     <mailto:lustre-community-bounces at lists.lustre.org>
>     >> > [mailto:lustre-community-bounces at lists.lustre.org
>     <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf
Of
>     >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>
>     >> > Sent: Monday, May 23, 2011 2:28 PM
>     >> > To: lustre-community at lists.lustre.org
>     <mailto:lustre-community at lists.lustre.org>
>     >> > Subject: [Lustre-community] Poor multithreaded I/O
performance
>     >> >
>     >> > Hello,
>     >> > I am running a multithreaded application that writes to a
common
>     >> shared
>     >> > file on lustre fs, and this is what I see:
>     >> >
>     >> > If I have a single thread in my application, I get a
bandwidth of
>     >> approx.
>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if
I
>     spawn 8
>     >> > threads such that all of them write to the same file
>     (non-overlapping
>     >> > locations), without explicitly synchronizing the writes
(i.e.
>     I dont
>     >> lock
>     >> > the file handle), I still get the same bandwidth.
>     >> >
>     >> > Now, instead of writing to a shared file, if these
threads
>     write to
>     >> > separate files, the bandwidth obtained is approx. 700
Mbytes/sec.
>     >> >
>     >> > I would ideally like my multithreaded application to see
similar
>     >> scaling.
>     >> > Any ideas why the performance is limited and any
workarounds?
>     >> >
>     >> > Thank you,
>     >> > Kshitij
>     >> >
>     >> >
>     >> > _______________________________________________
>     >> > Lustre-community mailing list
>     >> > Lustre-community at lists.lustre.org
>     <mailto:Lustre-community at lists.lustre.org>
>     >> > http://lists.lustre.org/mailman/listinfo/lustre-community
>     >> >
>     >>
>     >>
>     >> _______________________________________________
>     >> Lustre-community mailing list
>     >> Lustre-community at lists.lustre.org
>     <mailto:Lustre-community at lists.lustre.org>
>     >> http://lists.lustre.org/mailman/listinfo/lustre-community
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-community mailing list
> Lustre-community at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-community
>

kmehta at cs.uh.edu

2011-May-24 15:36 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

This is what my application does:

Each thread has its own file descriptor to the file.
I use pwrite to ensure non-overlapping regions, as follows:

Thread 0, data_size: 1MB, offset: 0
Thread 1, data_size: 1MB, offset: 1MB
Thread 2, data_size: 1MB, offset: 2MB
Thread 3, data_size: 1MB, offset: 3MB

<repeat cycle>
Thread 0, data_size: 1MB, offset: 4MB
and so on (This happens in parallel, I dont wait for one cycle to end
before the next one begins).

I am gonna try the following:
a)
Instead of a round-robin distribution of offsets, test with sequential
offsets:
Thread 0, data_size: 1MB, offset:0
Thread 0, data_size: 1MB, offset:1MB
Thread 0, data_size: 1MB, offset:2MB
Thread 0, data_size: 1MB, offset:3MB

Thread 1, data_size: 1MB, offset:4MB
and so on. (I am gonna keep these separate pwrite I/O requests instead of
merging them or using writev)

b)
Map the threads to the no. of OSTs using some modulo, as suggested in the
email below.

c)
Experiment with fewer no. of OSTs (I currently have 48).

I shall report back with my findings.

Thanks,
Kshitij
> [Moved to Lustre-discuss]
>
>
> "However, if I spawn 8 threads such that all of them write to the same
> file (non-overlapping locations), without explicitly synchronizing the
> writes (i.e. I dont lock the file handle)"
>
>
> How exactly does your multi-threaded application write the data?  Are
> you using pwrite to ensure non-overlapping regions or are they all just
> doing unlocked write() operations on the same fd to each write (each
> just transferring size/8)?  If it divides the file into N pieces, and
> each thread does pwrite on its piece, then what each OST sees are
> multiple streams at wide offsets to the same object, which could impact
> performance.
>
> If on the other hand the file is written sequentially, where each thread
> grabs the next piece to be written (locking normally used for the
> current_offset value, so you know where each chunk is actually going),
> then you get a more sequential pattern at the OST.
>
> If the number of threads maps to the number of OSTs (or some modulo,
> like in your case 6 OSTs per thread), and each thread "owns" the
piece
> of the file that belongs to an OST (ie: for (offset = thread_num * 6MB;
> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then
> you''ve eliminated the need for application locks (assuming the use
of
> pwrite) and ensured each OST object is being written sequentially.
>
> It''s quite possible there is some bottleneck on the shared fd.  So
> perhaps the question is not why you aren''t scaling with more
threads,
> but why the single file is not able to saturate the client, or why the
> file BW is not scaling with more OSTs.  It is somewhat common for
> multiple processes (on different nodes) to write non-overlapping regions
> of the same file; does performance improve if each thread opens its own
> file descriptor?
>
> Kevin
>
>
> Wojciech Turek wrote:
>> Ok so it looks like you have in total 64 OSTs and your output file is
>> striped across 48 of them. May I suggest that you limit number of
>> stripes, lets say a good number to start with would be 8 stripes and
>> also for best results use OST pools feature to arrange that each
>> stripe goes to OST owned by different OSS.
>>
>> regards,
>>
>> Wojciech
>>
>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>>
>> wrote:
>>
>>     Actually, ''lfs check servers'' returns 64 entries
as well, so I
>>     presume the
>>     system documentation is out of date.
>>
>>     Again, I am sorry the basic information had been incorrect.
>>
>>     - Kshitij
>>
>>     > Run lfs getstripe <your_output_file> and paste the
output of
>>     that command
>>     > to
>>     > the mailing list.
>>     > Stripe count of 48 is not possible if you have max 11 OSTs
(the
>>     max stripe
>>     > count will be 11)
>>     > If your striping is correct, the bottleneck can be your client
>>     network.
>>     >
>>     > regards,
>>     >
>>     > Wojciech
>>     >
>>     >
>>     >
>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>     >
>>     >> The stripe count is 48.
>>     >>
>>     >> Just fyi, this is what my application does:
>>     >> A simple I/O test where threads continually write blocks
of size
>>     >> 64Kbytes
>>     >> or 1Mbyte (decided at compile time) till a large file of
say,
>>     16Gbytes
>>     >> is
>>     >> created.
>>     >>
>>     >> Thanks,
>>     >> Kshitij
>>     >>
>>     >> > What is your stripe count on the file,  if your
default is 1,
>>     you are
>>     >> only
>>     >> > writing to one of the OST''s.  you can check
with the lfs
>>     getstripe
>>     >> > command, you can set the stripe bigger, and hopefully
your
>>     >> wide-stripped
>>     >> > file with threaded writes will be faster.
>>     >> >
>>     >> > Evan
>>     >> >
>>     >> > -----Original Message-----
>>     >> > From: lustre-community-bounces at lists.lustre.org
>>     <mailto:lustre-community-bounces at lists.lustre.org>
>>     >> > [mailto:lustre-community-bounces at lists.lustre.org
>>     <mailto:lustre-community-bounces at lists.lustre.org>] On
Behalf Of
>>     >> > kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>
>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>     >> > To: lustre-community at lists.lustre.org
>>     <mailto:lustre-community at lists.lustre.org>
>>     >> > Subject: [Lustre-community] Poor multithreaded I/O
performance
>>     >> >
>>     >> > Hello,
>>     >> > I am running a multithreaded application that writes
to a
>> common
>>     >> shared
>>     >> > file on lustre fs, and this is what I see:
>>     >> >
>>     >> > If I have a single thread in my application, I get a
bandwidth
>> of
>>     >> approx.
>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size)
However, if I
>>     spawn 8
>>     >> > threads such that all of them write to the same file
>>     (non-overlapping
>>     >> > locations), without explicitly synchronizing the
writes (i.e.
>>     I dont
>>     >> lock
>>     >> > the file handle), I still get the same bandwidth.
>>     >> >
>>     >> > Now, instead of writing to a shared file, if these
threads
>>     write to
>>     >> > separate files, the bandwidth obtained is approx. 700
>> Mbytes/sec.
>>     >> >
>>     >> > I would ideally like my multithreaded application to
see
>> similar
>>     >> scaling.
>>     >> > Any ideas why the performance is limited and any
workarounds?
>>     >> >
>>     >> > Thank you,
>>     >> > Kshitij
>>     >> >
>>     >> >
>>     >> > _______________________________________________
>>     >> > Lustre-community mailing list
>>     >> > Lustre-community at lists.lustre.org
>>     <mailto:Lustre-community at lists.lustre.org>
>>     >> >
http://lists.lustre.org/mailman/listinfo/lustre-community
>>     >> >
>>     >>
>>     >>
>>     >> _______________________________________________
>>     >> Lustre-community mailing list
>>     >> Lustre-community at lists.lustre.org
>>     <mailto:Lustre-community at lists.lustre.org>
>>     >> http://lists.lustre.org/mailman/listinfo/lustre-community
>>
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-community mailing list
>> Lustre-community at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>
>

kmehta at cs.uh.edu

2011-May-26 19:02 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

Ok I ran the following tests:

[1]
Application spawns 8 threads. I write to Lustre having 8 OSTs.
Each thread writes data in blocks of 1 Mbyte in a round robin fashion, i.e.

T0 writes to offsets 0, 8MB, 16MB, etc.
T1 writes to offsets 1MB, 9MB, 17MB, etc.
The stripe size being 1MByte, every thread ends up writing to only 1 OST.

I see a bandwidth of 280 Mbytes/sec, similar to the single thread
performance.

[2]
I also ran the same test such that every thread writes data in blocks of 8
Mbytes for the same stripe size. (Thus, every thread will write to every
OST). I still get similar performance, ~280Mbytes/sec, so essentially I
see no difference between each thread writing to a single OST vs each
thread writing to all OSTs.

And as I said before, if all threads write to their own separate file, the
resulting bandwidth is ~700Mbytes/sec.

I have attached my C file (simple_io_test.c) herewith. Maybe you could run
it and see where the bottleneck is. Comments and instructions for
compilation have been included in the file. Do let me know if you need any
clarification on that.

Your help is appreciated,
Kshitij
> This is what my application does:
>
> Each thread has its own file descriptor to the file.
> I use pwrite to ensure non-overlapping regions, as follows:
>
> Thread 0, data_size: 1MB, offset: 0
> Thread 1, data_size: 1MB, offset: 1MB
> Thread 2, data_size: 1MB, offset: 2MB
> Thread 3, data_size: 1MB, offset: 3MB
>
> <repeat cycle>
> Thread 0, data_size: 1MB, offset: 4MB
> and so on (This happens in parallel, I dont wait for one cycle to end
> before the next one begins).
>
> I am gonna try the following:
> a)
> Instead of a round-robin distribution of offsets, test with sequential
> offsets:
> Thread 0, data_size: 1MB, offset:0
> Thread 0, data_size: 1MB, offset:1MB
> Thread 0, data_size: 1MB, offset:2MB
> Thread 0, data_size: 1MB, offset:3MB
>
> Thread 1, data_size: 1MB, offset:4MB
> and so on. (I am gonna keep these separate pwrite I/O requests instead of
> merging them or using writev)
>
> b)
> Map the threads to the no. of OSTs using some modulo, as suggested in the
> email below.
>
> c)
> Experiment with fewer no. of OSTs (I currently have 48).
>
> I shall report back with my findings.
>
> Thanks,
> Kshitij
>
>> [Moved to Lustre-discuss]
>>
>>
>> "However, if I spawn 8 threads such that all of them write to the
same
>> file (non-overlapping locations), without explicitly synchronizing the
>> writes (i.e. I dont lock the file handle)"
>>
>>
>> How exactly does your multi-threaded application write the data?  Are
>> you using pwrite to ensure non-overlapping regions or are they all just
>> doing unlocked write() operations on the same fd to each write (each
>> just transferring size/8)?  If it divides the file into N pieces, and
>> each thread does pwrite on its piece, then what each OST sees are
>> multiple streams at wide offsets to the same object, which could impact
>> performance.
>>
>> If on the other hand the file is written sequentially, where each
thread
>> grabs the next piece to be written (locking normally used for the
>> current_offset value, so you know where each chunk is actually going),
>> then you get a more sequential pattern at the OST.
>>
>> If the number of threads maps to the number of OSTs (or some modulo,
>> like in your case 6 OSTs per thread), and each thread "owns"
the piece
>> of the file that belongs to an OST (ie: for (offset = thread_num * 6MB;
>> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then
>> you''ve eliminated the need for application locks (assuming the
use of
>> pwrite) and ensured each OST object is being written sequentially.
>>
>> It''s quite possible there is some bottleneck on the shared fd.
So
>> perhaps the question is not why you aren''t scaling with more
threads,
>> but why the single file is not able to saturate the client, or why the
>> file BW is not scaling with more OSTs.  It is somewhat common for
>> multiple processes (on different nodes) to write non-overlapping
regions
>> of the same file; does performance improve if each thread opens its own
>> file descriptor?
>>
>> Kevin
>>
>>
>> Wojciech Turek wrote:
>>> Ok so it looks like you have in total 64 OSTs and your output file
is
>>> striped across 48 of them. May I suggest that you limit number of
>>> stripes, lets say a good number to start with would be 8 stripes
and
>>> also for best results use OST pools feature to arrange that each
>>> stripe goes to OST owned by different OSS.
>>>
>>> regards,
>>>
>>> Wojciech
>>>
>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>>
>>> wrote:
>>>
>>>     Actually, ''lfs check servers'' returns 64
entries as well, so I
>>>     presume the
>>>     system documentation is out of date.
>>>
>>>     Again, I am sorry the basic information had been incorrect.
>>>
>>>     - Kshitij
>>>
>>>     > Run lfs getstripe <your_output_file> and paste the
output of
>>>     that command
>>>     > to
>>>     > the mailing list.
>>>     > Stripe count of 48 is not possible if you have max 11 OSTs
(the
>>>     max stripe
>>>     > count will be 11)
>>>     > If your striping is correct, the bottleneck can be your
client
>>>     network.
>>>     >
>>>     > regards,
>>>     >
>>>     > Wojciech
>>>     >
>>>     >
>>>     >
>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>     >
>>>     >> The stripe count is 48.
>>>     >>
>>>     >> Just fyi, this is what my application does:
>>>     >> A simple I/O test where threads continually write
blocks of size
>>>     >> 64Kbytes
>>>     >> or 1Mbyte (decided at compile time) till a large file
of say,
>>>     16Gbytes
>>>     >> is
>>>     >> created.
>>>     >>
>>>     >> Thanks,
>>>     >> Kshitij
>>>     >>
>>>     >> > What is your stripe count on the file,  if your
default is 1,
>>>     you are
>>>     >> only
>>>     >> > writing to one of the OST''s.  you can
check with the lfs
>>>     getstripe
>>>     >> > command, you can set the stripe bigger, and
hopefully your
>>>     >> wide-stripped
>>>     >> > file with threaded writes will be faster.
>>>     >> >
>>>     >> > Evan
>>>     >> >
>>>     >> > -----Original Message-----
>>>     >> > From: lustre-community-bounces at
lists.lustre.org
>>>     <mailto:lustre-community-bounces at lists.lustre.org>
>>>     >> > [mailto:lustre-community-bounces at
lists.lustre.org
>>>     <mailto:lustre-community-bounces at lists.lustre.org>] On
Behalf Of
>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>
>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>     >> > To: lustre-community at lists.lustre.org
>>>     <mailto:lustre-community at lists.lustre.org>
>>>     >> > Subject: [Lustre-community] Poor multithreaded
I/O performance
>>>     >> >
>>>     >> > Hello,
>>>     >> > I am running a multithreaded application that
writes to a
>>> common
>>>     >> shared
>>>     >> > file on lustre fs, and this is what I see:
>>>     >> >
>>>     >> > If I have a single thread in my application, I
get a bandwidth
>>> of
>>>     >> approx.
>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size)
However, if I
>>>     spawn 8
>>>     >> > threads such that all of them write to the same
file
>>>     (non-overlapping
>>>     >> > locations), without explicitly synchronizing the
writes (i.e.
>>>     I dont
>>>     >> lock
>>>     >> > the file handle), I still get the same bandwidth.
>>>     >> >
>>>     >> > Now, instead of writing to a shared file, if
these threads
>>>     write to
>>>     >> > separate files, the bandwidth obtained is approx.
700
>>> Mbytes/sec.
>>>     >> >
>>>     >> > I would ideally like my multithreaded application
to see
>>> similar
>>>     >> scaling.
>>>     >> > Any ideas why the performance is limited and any
workarounds?
>>>     >> >
>>>     >> > Thank you,
>>>     >> > Kshitij
>>>     >> >
>>>     >> >
>>>     >> > _______________________________________________
>>>     >> > Lustre-community mailing list
>>>     >> > Lustre-community at lists.lustre.org
>>>     <mailto:Lustre-community at lists.lustre.org>
>>>     >> >
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>     >> >
>>>     >>
>>>     >>
>>>     >> _______________________________________________
>>>     >> Lustre-community mailing list
>>>     >> Lustre-community at lists.lustre.org
>>>     <mailto:Lustre-community at lists.lustre.org>
>>>     >>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>
>>>
>>>
------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lustre-community mailing list
>>> Lustre-community at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>
>>
>
>-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_io_test.c
Type: text/x-csrc
Size: 9579 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110526/16b2680f/attachment.bin

Kshitij Mehta

2011-May-26 20:57 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

Hi David, 
I am writing to an existing directory that has supposedly been configured on
8 OSTs. 

I have shown below the output of lfs getstripe on the directory and on the
output file generated by the program. 
It seems that the file is correctly striped across 8 OSTs. In one of the
previous emails, Wojciech suggested I make sure that the OSTs belong to
different OSSes using the OST pools feature. Can someone suggest how I can
verify that an existing directory is configured on OSTs belonging to
different OSSes (though I have a hint that that''s not the problem,
since
writing to separate files on the same directory does give me ~700Mbytes/sec)


lfs getstripe ../../ss_8/ --verbose
----------------------------------------------------------------------------
-------------------------------------
OBDS:
0: fastfs-OST0000_UUID ACTIVE
1: fastfs-OST0001_UUID ACTIVE
2: fastfs-OST0002_UUID ACTIVE
3: fastfs-OST0003_UUID ACTIVE
4: fastfs-OST0004_UUID ACTIVE
5: fastfs-OST0005_UUID ACTIVE
6: fastfs-OST0006_UUID ACTIVE
7: fastfs-OST0007_UUID ACTIVE
8: fastfs-OST0008_UUID ACTIVE
9: fastfs-OST0009_UUID ACTIVE
10: fastfs-OST000a_UUID ACTIVE
11: fastfs-OST000b_UUID ACTIVE
12: fastfs-OST000c_UUID ACTIVE
13: fastfs-OST000d_UUID ACTIVE
14: fastfs-OST000e_UUID ACTIVE
15: fastfs-OST000f_UUID ACTIVE
16: fastfs-OST0010_UUID ACTIVE
17: fastfs-OST0011_UUID ACTIVE
18: fastfs-OST0012_UUID ACTIVE
19: fastfs-OST0013_UUID ACTIVE
20: fastfs-OST0014_UUID ACTIVE
21: fastfs-OST0015_UUID ACTIVE
22: fastfs-OST0016_UUID ACTIVE
23: fastfs-OST0017_UUID ACTIVE
24: fastfs-OST0018_UUID ACTIVE
25: fastfs-OST0019_UUID ACTIVE
26: fastfs-OST001a_UUID ACTIVE
27: fastfs-OST001b_UUID ACTIVE
28: fastfs-OST001c_UUID ACTIVE
29: fastfs-OST001d_UUID ACTIVE
30: fastfs-OST001e_UUID ACTIVE
31: fastfs-OST001f_UUID ACTIVE
32: fastfs-OST0020_UUID ACTIVE
33: fastfs-OST0021_UUID ACTIVE
34: fastfs-OST0022_UUID ACTIVE
35: fastfs-OST0023_UUID ACTIVE
36: fastfs-OST0024_UUID ACTIVE
37: fastfs-OST0025_UUID ACTIVE
38: fastfs-OST0026_UUID ACTIVE
39: fastfs-OST0027_UUID ACTIVE
40: fastfs-OST0028_UUID ACTIVE
41: fastfs-OST0029_UUID ACTIVE
42: fastfs-OST002a_UUID ACTIVE
43: fastfs-OST002b_UUID ACTIVE
44: fastfs-OST002c_UUID ACTIVE
45: fastfs-OST002d_UUID ACTIVE
46: fastfs-OST002e_UUID ACTIVE
47: fastfs-OST002f_UUID ACTIVE
48: fastfs-OST0030_UUID ACTIVE
49: fastfs-OST0031_UUID ACTIVE
50: fastfs-OST0032_UUID ACTIVE
51: fastfs-OST0033_UUID ACTIVE
52: fastfs-OST0034_UUID ACTIVE
53: fastfs-OST0035_UUID ACTIVE
54: fastfs-OST0036_UUID ACTIVE
55: fastfs-OST0037_UUID ACTIVE
56: fastfs-OST0038_UUID ACTIVE
57: fastfs-OST0039_UUID ACTIVE
58: fastfs-OST003a_UUID ACTIVE
59: fastfs-OST003b_UUID ACTIVE
60: fastfs-OST003c_UUID ACTIVE
61: fastfs-OST003d_UUID ACTIVE
62: fastfs-OST003e_UUID ACTIVE
63: fastfs-OST003f_UUID ACTIVE
../../ss_8
stripe_count: 8 stripe_size: 0 stripe_offset: -1
----------------------------------------------------------------------------
------------------------------------


Running lfs getstripe on the 16Gbyte file generated by the program shows
this:
----------------------------------------------------------------------------
------------------------------------
lfs getstripe ../../ss_8/kmtest.txt --verbose
OBDS:
0: fastfs-OST0000_UUID ACTIVE
1: fastfs-OST0001_UUID ACTIVE
2: fastfs-OST0002_UUID ACTIVE
3: fastfs-OST0003_UUID ACTIVE
4: fastfs-OST0004_UUID ACTIVE
5: fastfs-OST0005_UUID ACTIVE
6: fastfs-OST0006_UUID ACTIVE
7: fastfs-OST0007_UUID ACTIVE
8: fastfs-OST0008_UUID ACTIVE
9: fastfs-OST0009_UUID ACTIVE
10: fastfs-OST000a_UUID ACTIVE
11: fastfs-OST000b_UUID ACTIVE
12: fastfs-OST000c_UUID ACTIVE
13: fastfs-OST000d_UUID ACTIVE
14: fastfs-OST000e_UUID ACTIVE
15: fastfs-OST000f_UUID ACTIVE
16: fastfs-OST0010_UUID ACTIVE
17: fastfs-OST0011_UUID ACTIVE
18: fastfs-OST0012_UUID ACTIVE
19: fastfs-OST0013_UUID ACTIVE
20: fastfs-OST0014_UUID ACTIVE
21: fastfs-OST0015_UUID ACTIVE
22: fastfs-OST0016_UUID ACTIVE
23: fastfs-OST0017_UUID ACTIVE
24: fastfs-OST0018_UUID ACTIVE
25: fastfs-OST0019_UUID ACTIVE
26: fastfs-OST001a_UUID ACTIVE
27: fastfs-OST001b_UUID ACTIVE
28: fastfs-OST001c_UUID ACTIVE
29: fastfs-OST001d_UUID ACTIVE
30: fastfs-OST001e_UUID ACTIVE
31: fastfs-OST001f_UUID ACTIVE
32: fastfs-OST0020_UUID ACTIVE
33: fastfs-OST0021_UUID ACTIVE
34: fastfs-OST0022_UUID ACTIVE
35: fastfs-OST0023_UUID ACTIVE
36: fastfs-OST0024_UUID ACTIVE
37: fastfs-OST0025_UUID ACTIVE
38: fastfs-OST0026_UUID ACTIVE
39: fastfs-OST0027_UUID ACTIVE
40: fastfs-OST0028_UUID ACTIVE
41: fastfs-OST0029_UUID ACTIVE
42: fastfs-OST002a_UUID ACTIVE
43: fastfs-OST002b_UUID ACTIVE
44: fastfs-OST002c_UUID ACTIVE
45: fastfs-OST002d_UUID ACTIVE
46: fastfs-OST002e_UUID ACTIVE
47: fastfs-OST002f_UUID ACTIVE
48: fastfs-OST0030_UUID ACTIVE
49: fastfs-OST0031_UUID ACTIVE
50: fastfs-OST0032_UUID ACTIVE
51: fastfs-OST0033_UUID ACTIVE
52: fastfs-OST0034_UUID ACTIVE
53: fastfs-OST0035_UUID ACTIVE
54: fastfs-OST0036_UUID ACTIVE
55: fastfs-OST0037_UUID ACTIVE
56: fastfs-OST0038_UUID ACTIVE
57: fastfs-OST0039_UUID ACTIVE
58: fastfs-OST003a_UUID ACTIVE
59: fastfs-OST003b_UUID ACTIVE
60: fastfs-OST003c_UUID ACTIVE
61: fastfs-OST003d_UUID ACTIVE
62: fastfs-OST003e_UUID ACTIVE
63: fastfs-OST003f_UUID ACTIVE
../../ss_8/kmtest.txt
lmm_magic:          0x0BD10BD0
lmm_object_gr:      0
lmm_object_id:      0x3c52894
lmm_stripe_count:   8
lmm_stripe_size:    1048576
lmm_stripe_pattern: 1
        obdidx           objid          objid            group
            10         6352973       0x60f04d                0
            20         6260051       0x5f8553                0
             4         5733251       0x577b83                0
            22         6381603       0x616023                0
            17         6265103       0x5f990f                0
            45         6133999       0x5d98ef                0
            31         6383869       0x6168fd                0
            58         6234719       0x5f225f                0
----------------------------------------------------------------------------
------------------------------------


Thanks, 
Kshitij

-----Original Message-----
From: David Vasil [mailto:dvasil at ddn.com] 
Sent: Thursday, May 26, 2011 3:01 PM
To: kmehta at cs.uh.edu
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance

Hi Kshitij,
   Did you create your files with ''lfs setstripe -c <stripe
count> <file>''
before writing to it or did you create a directory with a default stripe
size greater than 1?  It sounds like you are only striping across 1 file.

After writing your file out, perform an:

lfs getstripe <file>

Try pre-creating a more widely striped file with:

lfs setstripe -c N <file>

where N is > 1.

You can create a directory where all files under the hierarchy will be
striped using more OSTs in the same manner with lfs setstripe.

_____
David Vasil
DataDirect Networks
615.307.0865
dvasil at ddn.com


On 05/26/2011 02:02 PM, kmehta at cs.uh.edu wrote:> Ok I ran the following tests:
>
> [1]
> Application spawns 8 threads. I write to Lustre having 8 OSTs.
> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
i.e.>
> T0 writes to offsets 0, 8MB, 16MB, etc.
> T1 writes to offsets 1MB, 9MB, 17MB, etc.
> The stripe size being 1MByte, every thread ends up writing to only 1 OST.
>
> I see a bandwidth of 280 Mbytes/sec, similar to the single thread 
> performance.
>
> [2]
> I also ran the same test such that every thread writes data in blocks 
> of 8 Mbytes for the same stripe size. (Thus, every thread will write 
> to every OST). I still get similar performance, ~280Mbytes/sec, so 
> essentially I see no difference between each thread writing to a 
> single OST vs each thread writing to all OSTs.
>
> And as I said before, if all threads write to their own separate file, 
> the resulting bandwidth is ~700Mbytes/sec.
>
> I have attached my C file (simple_io_test.c) herewith. Maybe you could 
> run it and see where the bottleneck is. Comments and instructions for 
> compilation have been included in the file. Do let me know if you need 
> any clarification on that.
>
> Your help is appreciated,
> Kshitij

kmehta at cs.uh.edu

2011-Jun-03 00:06 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

Hello,
I was wondering if anyone could replicate the performance of the
multithreaded application using the C file that I posted in my previous
email.

Thanks,
Kshitij

> Ok I ran the following tests:
>
> [1]
> Application spawns 8 threads. I write to Lustre having 8 OSTs.
> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
> i.e.
>
> T0 writes to offsets 0, 8MB, 16MB, etc.
> T1 writes to offsets 1MB, 9MB, 17MB, etc.
> The stripe size being 1MByte, every thread ends up writing to only 1 OST.
>
> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
> performance.
>
> [2]
> I also ran the same test such that every thread writes data in blocks of 8
> Mbytes for the same stripe size. (Thus, every thread will write to every
> OST). I still get similar performance, ~280Mbytes/sec, so essentially I
> see no difference between each thread writing to a single OST vs each
> thread writing to all OSTs.
>
> And as I said before, if all threads write to their own separate file, the
> resulting bandwidth is ~700Mbytes/sec.
>
> I have attached my C file (simple_io_test.c) herewith. Maybe you could run
> it and see where the bottleneck is. Comments and instructions for
> compilation have been included in the file. Do let me know if you need any
> clarification on that.
>
> Your help is appreciated,
> Kshitij
>
>> This is what my application does:
>>
>> Each thread has its own file descriptor to the file.
>> I use pwrite to ensure non-overlapping regions, as follows:
>>
>> Thread 0, data_size: 1MB, offset: 0
>> Thread 1, data_size: 1MB, offset: 1MB
>> Thread 2, data_size: 1MB, offset: 2MB
>> Thread 3, data_size: 1MB, offset: 3MB
>>
>> <repeat cycle>
>> Thread 0, data_size: 1MB, offset: 4MB
>> and so on (This happens in parallel, I dont wait for one cycle to end
>> before the next one begins).
>>
>> I am gonna try the following:
>> a)
>> Instead of a round-robin distribution of offsets, test with sequential
>> offsets:
>> Thread 0, data_size: 1MB, offset:0
>> Thread 0, data_size: 1MB, offset:1MB
>> Thread 0, data_size: 1MB, offset:2MB
>> Thread 0, data_size: 1MB, offset:3MB
>>
>> Thread 1, data_size: 1MB, offset:4MB
>> and so on. (I am gonna keep these separate pwrite I/O requests instead
>> of
>> merging them or using writev)
>>
>> b)
>> Map the threads to the no. of OSTs using some modulo, as suggested in
>> the
>> email below.
>>
>> c)
>> Experiment with fewer no. of OSTs (I currently have 48).
>>
>> I shall report back with my findings.
>>
>> Thanks,
>> Kshitij
>>
>>> [Moved to Lustre-discuss]
>>>
>>>
>>> "However, if I spawn 8 threads such that all of them write to
the same
>>> file (non-overlapping locations), without explicitly synchronizing
the
>>> writes (i.e. I dont lock the file handle)"
>>>
>>>
>>> How exactly does your multi-threaded application write the data? 
Are
>>> you using pwrite to ensure non-overlapping regions or are they all
just
>>> doing unlocked write() operations on the same fd to each write
(each
>>> just transferring size/8)?  If it divides the file into N pieces,
and
>>> each thread does pwrite on its piece, then what each OST sees are
>>> multiple streams at wide offsets to the same object, which could
impact
>>> performance.
>>>
>>> If on the other hand the file is written sequentially, where each
>>> thread
>>> grabs the next piece to be written (locking normally used for the
>>> current_offset value, so you know where each chunk is actually
going),
>>> then you get a more sequential pattern at the OST.
>>>
>>> If the number of threads maps to the number of OSTs (or some
modulo,
>>> like in your case 6 OSTs per thread), and each thread
"owns" the piece
>>> of the file that belongs to an OST (ie: for (offset = thread_num *
6MB;
>>> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ),
then
>>> you''ve eliminated the need for application locks (assuming
the use of
>>> pwrite) and ensured each OST object is being written sequentially.
>>>
>>> It''s quite possible there is some bottleneck on the shared
fd.  So
>>> perhaps the question is not why you aren''t scaling with
more threads,
>>> but why the single file is not able to saturate the client, or why
the
>>> file BW is not scaling with more OSTs.  It is somewhat common for
>>> multiple processes (on different nodes) to write non-overlapping
>>> regions
>>> of the same file; does performance improve if each thread opens its
own
>>> file descriptor?
>>>
>>> Kevin
>>>
>>>
>>> Wojciech Turek wrote:
>>>> Ok so it looks like you have in total 64 OSTs and your output
file is
>>>> striped across 48 of them. May I suggest that you limit number
of
>>>> stripes, lets say a good number to start with would be 8
stripes and
>>>> also for best results use OST pools feature to arrange that
each
>>>> stripe goes to OST owned by different OSS.
>>>>
>>>> regards,
>>>>
>>>> Wojciech
>>>>
>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta
at cs.uh.edu>>
>>>> wrote:
>>>>
>>>>     Actually, ''lfs check servers'' returns 64
entries as well, so I
>>>>     presume the
>>>>     system documentation is out of date.
>>>>
>>>>     Again, I am sorry the basic information had been incorrect.
>>>>
>>>>     - Kshitij
>>>>
>>>>     > Run lfs getstripe <your_output_file> and paste
the output of
>>>>     that command
>>>>     > to
>>>>     > the mailing list.
>>>>     > Stripe count of 48 is not possible if you have max 11
OSTs (the
>>>>     max stripe
>>>>     > count will be 11)
>>>>     > If your striping is correct, the bottleneck can be
your client
>>>>     network.
>>>>     >
>>>>     > regards,
>>>>     >
>>>>     > Wojciech
>>>>     >
>>>>     >
>>>>     >
>>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>>     >
>>>>     >> The stripe count is 48.
>>>>     >>
>>>>     >> Just fyi, this is what my application does:
>>>>     >> A simple I/O test where threads continually write
blocks of
>>>> size
>>>>     >> 64Kbytes
>>>>     >> or 1Mbyte (decided at compile time) till a large
file of say,
>>>>     16Gbytes
>>>>     >> is
>>>>     >> created.
>>>>     >>
>>>>     >> Thanks,
>>>>     >> Kshitij
>>>>     >>
>>>>     >> > What is your stripe count on the file,  if
your default is 1,
>>>>     you are
>>>>     >> only
>>>>     >> > writing to one of the OST''s.  you
can check with the lfs
>>>>     getstripe
>>>>     >> > command, you can set the stripe bigger, and
hopefully your
>>>>     >> wide-stripped
>>>>     >> > file with threaded writes will be faster.
>>>>     >> >
>>>>     >> > Evan
>>>>     >> >
>>>>     >> > -----Original Message-----
>>>>     >> > From: lustre-community-bounces at
lists.lustre.org
>>>>     <mailto:lustre-community-bounces at lists.lustre.org>
>>>>     >> > [mailto:lustre-community-bounces at
lists.lustre.org
>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>] On Behalf Of
>>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>
>>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>>     >> > To: lustre-community at lists.lustre.org
>>>>     <mailto:lustre-community at lists.lustre.org>
>>>>     >> > Subject: [Lustre-community] Poor
multithreaded I/O
>>>> performance
>>>>     >> >
>>>>     >> > Hello,
>>>>     >> > I am running a multithreaded application that
writes to a
>>>> common
>>>>     >> shared
>>>>     >> > file on lustre fs, and this is what I see:
>>>>     >> >
>>>>     >> > If I have a single thread in my application,
I get a
>>>> bandwidth
>>>> of
>>>>     >> approx.
>>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size)
However, if I
>>>>     spawn 8
>>>>     >> > threads such that all of them write to the
same file
>>>>     (non-overlapping
>>>>     >> > locations), without explicitly synchronizing
the writes (i.e.
>>>>     I dont
>>>>     >> lock
>>>>     >> > the file handle), I still get the same
bandwidth.
>>>>     >> >
>>>>     >> > Now, instead of writing to a shared file, if
these threads
>>>>     write to
>>>>     >> > separate files, the bandwidth obtained is
approx. 700
>>>> Mbytes/sec.
>>>>     >> >
>>>>     >> > I would ideally like my multithreaded
application to see
>>>> similar
>>>>     >> scaling.
>>>>     >> > Any ideas why the performance is limited and
any workarounds?
>>>>     >> >
>>>>     >> > Thank you,
>>>>     >> > Kshitij
>>>>     >> >
>>>>     >> >
>>>>     >> >
_______________________________________________
>>>>     >> > Lustre-community mailing list
>>>>     >> > Lustre-community at lists.lustre.org
>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>     >> >
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>     >> >
>>>>     >>
>>>>     >>
>>>>     >> _______________________________________________
>>>>     >> Lustre-community mailing list
>>>>     >> Lustre-community at lists.lustre.org
>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>     >>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>>
>>>>
------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Lustre-community mailing list
>>>> Lustre-community at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>
>>
>>
>

Felix, Evan J

2011-Jun-03 16:09 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

What file sizes and segment sizes are you using for your tests?

Evan

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of kmehta at cs.uh.edu
Sent: Thursday, June 02, 2011 5:07 PM
To: kmehta at cs.uh.edu
Cc: kmehta at cs.uh.edu; Lustre discuss
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance

Hello,
I was wondering if anyone could replicate the performance of the multithreaded
application using the C file that I posted in my previous email.

Thanks,
Kshitij

> Ok I ran the following tests:
>
> [1]
> Application spawns 8 threads. I write to Lustre having 8 OSTs.
> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, 
> i.e.
>
> T0 writes to offsets 0, 8MB, 16MB, etc.
> T1 writes to offsets 1MB, 9MB, 17MB, etc.
> The stripe size being 1MByte, every thread ends up writing to only 1 OST.
>
> I see a bandwidth of 280 Mbytes/sec, similar to the single thread 
> performance.
>
> [2]
> I also ran the same test such that every thread writes data in blocks 
> of 8 Mbytes for the same stripe size. (Thus, every thread will write 
> to every OST). I still get similar performance, ~280Mbytes/sec, so 
> essentially I see no difference between each thread writing to a 
> single OST vs each thread writing to all OSTs.
>
> And as I said before, if all threads write to their own separate file, 
> the resulting bandwidth is ~700Mbytes/sec.
>
> I have attached my C file (simple_io_test.c) herewith. Maybe you could 
> run it and see where the bottleneck is. Comments and instructions for 
> compilation have been included in the file. Do let me know if you need 
> any clarification on that.
>
> Your help is appreciated,
> Kshitij
>
>> This is what my application does:
>>
>> Each thread has its own file descriptor to the file.
>> I use pwrite to ensure non-overlapping regions, as follows:
>>
>> Thread 0, data_size: 1MB, offset: 0
>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, 
>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>
>> <repeat cycle>
>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in 
>> parallel, I dont wait for one cycle to end before the next one 
>> begins).
>>
>> I am gonna try the following:
>> a)
>> Instead of a round-robin distribution of offsets, test with 
>> sequential
>> offsets:
>> Thread 0, data_size: 1MB, offset:0
>> Thread 0, data_size: 1MB, offset:1MB
>> Thread 0, data_size: 1MB, offset:2MB
>> Thread 0, data_size: 1MB, offset:3MB
>>
>> Thread 1, data_size: 1MB, offset:4MB
>> and so on. (I am gonna keep these separate pwrite I/O requests 
>> instead of merging them or using writev)
>>
>> b)
>> Map the threads to the no. of OSTs using some modulo, as suggested in 
>> the email below.
>>
>> c)
>> Experiment with fewer no. of OSTs (I currently have 48).
>>
>> I shall report back with my findings.
>>
>> Thanks,
>> Kshitij
>>
>>> [Moved to Lustre-discuss]
>>>
>>>
>>> "However, if I spawn 8 threads such that all of them write to
the
>>> same file (non-overlapping locations), without explicitly 
>>> synchronizing the writes (i.e. I dont lock the file handle)"
>>>
>>>
>>> How exactly does your multi-threaded application write the data?  
>>> Are you using pwrite to ensure non-overlapping regions or are they 
>>> all just doing unlocked write() operations on the same fd to each 
>>> write (each just transferring size/8)?  If it divides the file into
>>> N pieces, and each thread does pwrite on its piece, then what each 
>>> OST sees are multiple streams at wide offsets to the same object, 
>>> which could impact performance.
>>>
>>> If on the other hand the file is written sequentially, where each 
>>> thread grabs the next piece to be written (locking normally used
for
>>> the current_offset value, so you know where each chunk is actually 
>>> going), then you get a more sequential pattern at the OST.
>>>
>>> If the number of threads maps to the number of OSTs (or some
modulo,
>>> like in your case 6 OSTs per thread), and each thread
"owns" the
>>> piece of the file that belongs to an OST (ie: for (offset = 
>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf,
>>> 6MB, offset); ), then you''ve eliminated the need for
application
>>> locks (assuming the use of
>>> pwrite) and ensured each OST object is being written sequentially.
>>>
>>> It''s quite possible there is some bottleneck on the shared
fd.  So
>>> perhaps the question is not why you aren''t scaling with
more
>>> threads, but why the single file is not able to saturate the
client,
>>> or why the file BW is not scaling with more OSTs.  It is somewhat 
>>> common for multiple processes (on different nodes) to write 
>>> non-overlapping regions of the same file; does performance improve 
>>> if each thread opens its own file descriptor?
>>>
>>> Kevin
>>>
>>>
>>> Wojciech Turek wrote:
>>>> Ok so it looks like you have in total 64 OSTs and your output
file
>>>> is striped across 48 of them. May I suggest that you limit
number
>>>> of stripes, lets say a good number to start with would be 8
stripes
>>>> and also for best results use OST pools feature to arrange that
>>>> each stripe goes to OST owned by different OSS.
>>>>
>>>> regards,
>>>>
>>>> Wojciech
>>>>
>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta
at cs.uh.edu>>
>>>> wrote:
>>>>
>>>>     Actually, ''lfs check servers'' returns 64
entries as well, so I
>>>>     presume the
>>>>     system documentation is out of date.
>>>>
>>>>     Again, I am sorry the basic information had been incorrect.
>>>>
>>>>     - Kshitij
>>>>
>>>>     > Run lfs getstripe <your_output_file> and paste
the output of
>>>>     that command
>>>>     > to
>>>>     > the mailing list.
>>>>     > Stripe count of 48 is not possible if you have max 11
OSTs (the
>>>>     max stripe
>>>>     > count will be 11)
>>>>     > If your striping is correct, the bottleneck can be
your client
>>>>     network.
>>>>     >
>>>>     > regards,
>>>>     >
>>>>     > Wojciech
>>>>     >
>>>>     >
>>>>     >
>>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>>     >
>>>>     >> The stripe count is 48.
>>>>     >>
>>>>     >> Just fyi, this is what my application does:
>>>>     >> A simple I/O test where threads continually write
blocks of
>>>> size
>>>>     >> 64Kbytes
>>>>     >> or 1Mbyte (decided at compile time) till a large
file of say,
>>>>     16Gbytes
>>>>     >> is
>>>>     >> created.
>>>>     >>
>>>>     >> Thanks,
>>>>     >> Kshitij
>>>>     >>
>>>>     >> > What is your stripe count on the file,  if
your default is 1,
>>>>     you are
>>>>     >> only
>>>>     >> > writing to one of the OST''s.  you
can check with the lfs
>>>>     getstripe
>>>>     >> > command, you can set the stripe bigger, and
hopefully your
>>>>     >> wide-stripped
>>>>     >> > file with threaded writes will be faster.
>>>>     >> >
>>>>     >> > Evan
>>>>     >> >
>>>>     >> > -----Original Message-----
>>>>     >> > From: lustre-community-bounces at
lists.lustre.org
>>>>     <mailto:lustre-community-bounces at lists.lustre.org>
>>>>     >> > [mailto:lustre-community-bounces at
lists.lustre.org
>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>] On Behalf Of
>>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>
>>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>>     >> > To: lustre-community at lists.lustre.org
>>>>     <mailto:lustre-community at lists.lustre.org>
>>>>     >> > Subject: [Lustre-community] Poor
multithreaded I/O
>>>> performance
>>>>     >> >
>>>>     >> > Hello,
>>>>     >> > I am running a multithreaded application that
writes to a
>>>> common
>>>>     >> shared
>>>>     >> > file on lustre fs, and this is what I see:
>>>>     >> >
>>>>     >> > If I have a single thread in my application,
I get a
>>>> bandwidth of
>>>>     >> approx.
>>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size)
However, if I
>>>>     spawn 8
>>>>     >> > threads such that all of them write to the
same file
>>>>     (non-overlapping
>>>>     >> > locations), without explicitly synchronizing
the writes (i.e.
>>>>     I dont
>>>>     >> lock
>>>>     >> > the file handle), I still get the same
bandwidth.
>>>>     >> >
>>>>     >> > Now, instead of writing to a shared file, if
these threads
>>>>     write to
>>>>     >> > separate files, the bandwidth obtained is
approx. 700
>>>> Mbytes/sec.
>>>>     >> >
>>>>     >> > I would ideally like my multithreaded
application to see
>>>> similar
>>>>     >> scaling.
>>>>     >> > Any ideas why the performance is limited and
any workarounds?
>>>>     >> >
>>>>     >> > Thank you,
>>>>     >> > Kshitij
>>>>     >> >
>>>>     >> >
>>>>     >> >
_______________________________________________
>>>>     >> > Lustre-community mailing list
>>>>     >> > Lustre-community at lists.lustre.org
>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>     >> >
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>     >> >
>>>>     >>
>>>>     >>
>>>>     >> _______________________________________________
>>>>     >> Lustre-community mailing list
>>>>     >> Lustre-community at lists.lustre.org
>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>     >>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>>
>>>>
-------------------------------------------------------------------
>>>> -----
>>>>
>>>> _______________________________________________
>>>> Lustre-community mailing list
>>>> Lustre-community at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>
>>
>>
>

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

kmehta at cs.uh.edu

2011-Jun-03 16:53 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

I ran the test with 16Gbytes file size and segment sizes of 64Kbytes,
1Mbyte, 8Mbytes.

Thanks,
Kshitij
> What file sizes and segment sizes are you using for your tests?
>
> Evan
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
> kmehta at cs.uh.edu
> Sent: Thursday, June 02, 2011 5:07 PM
> To: kmehta at cs.uh.edu
> Cc: kmehta at cs.uh.edu; Lustre discuss
> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>
> Hello,
> I was wondering if anyone could replicate the performance of the
> multithreaded application using the C file that I posted in my previous
> email.
>
> Thanks,
> Kshitij
>
>
>> Ok I ran the following tests:
>>
>> [1]
>> Application spawns 8 threads. I write to Lustre having 8 OSTs.
>> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
>> i.e.
>>
>> T0 writes to offsets 0, 8MB, 16MB, etc.
>> T1 writes to offsets 1MB, 9MB, 17MB, etc.
>> The stripe size being 1MByte, every thread ends up writing to only 1
>> OST.
>>
>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
>> performance.
>>
>> [2]
>> I also ran the same test such that every thread writes data in blocks
>> of 8 Mbytes for the same stripe size. (Thus, every thread will write
>> to every OST). I still get similar performance, ~280Mbytes/sec, so
>> essentially I see no difference between each thread writing to a
>> single OST vs each thread writing to all OSTs.
>>
>> And as I said before, if all threads write to their own separate file,
>> the resulting bandwidth is ~700Mbytes/sec.
>>
>> I have attached my C file (simple_io_test.c) herewith. Maybe you could
>> run it and see where the bottleneck is. Comments and instructions for
>> compilation have been included in the file. Do let me know if you need
>> any clarification on that.
>>
>> Your help is appreciated,
>> Kshitij
>>
>>> This is what my application does:
>>>
>>> Each thread has its own file descriptor to the file.
>>> I use pwrite to ensure non-overlapping regions, as follows:
>>>
>>> Thread 0, data_size: 1MB, offset: 0
>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB,
>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>>
>>> <repeat cycle>
>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in
>>> parallel, I dont wait for one cycle to end before the next one
>>> begins).
>>>
>>> I am gonna try the following:
>>> a)
>>> Instead of a round-robin distribution of offsets, test with
>>> sequential
>>> offsets:
>>> Thread 0, data_size: 1MB, offset:0
>>> Thread 0, data_size: 1MB, offset:1MB
>>> Thread 0, data_size: 1MB, offset:2MB
>>> Thread 0, data_size: 1MB, offset:3MB
>>>
>>> Thread 1, data_size: 1MB, offset:4MB
>>> and so on. (I am gonna keep these separate pwrite I/O requests
>>> instead of merging them or using writev)
>>>
>>> b)
>>> Map the threads to the no. of OSTs using some modulo, as suggested
in
>>> the email below.
>>>
>>> c)
>>> Experiment with fewer no. of OSTs (I currently have 48).
>>>
>>> I shall report back with my findings.
>>>
>>> Thanks,
>>> Kshitij
>>>
>>>> [Moved to Lustre-discuss]
>>>>
>>>>
>>>> "However, if I spawn 8 threads such that all of them write
to the
>>>> same file (non-overlapping locations), without explicitly
>>>> synchronizing the writes (i.e. I dont lock the file
handle)"
>>>>
>>>>
>>>> How exactly does your multi-threaded application write the
data?
>>>> Are you using pwrite to ensure non-overlapping regions or are
they
>>>> all just doing unlocked write() operations on the same fd to
each
>>>> write (each just transferring size/8)?  If it divides the file
into
>>>> N pieces, and each thread does pwrite on its piece, then what
each
>>>> OST sees are multiple streams at wide offsets to the same
object,
>>>> which could impact performance.
>>>>
>>>> If on the other hand the file is written sequentially, where
each
>>>> thread grabs the next piece to be written (locking normally
used for
>>>> the current_offset value, so you know where each chunk is
actually
>>>> going), then you get a more sequential pattern at the OST.
>>>>
>>>> If the number of threads maps to the number of OSTs (or some
modulo,
>>>> like in your case 6 OSTs per thread), and each thread
"owns" the
>>>> piece of the file that belongs to an OST (ie: for (offset
>>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd,
buf,
>>>> 6MB, offset); ), then you''ve eliminated the need for
application
>>>> locks (assuming the use of
>>>> pwrite) and ensured each OST object is being written
sequentially.
>>>>
>>>> It''s quite possible there is some bottleneck on the
shared fd.  So
>>>> perhaps the question is not why you aren''t scaling
with more
>>>> threads, but why the single file is not able to saturate the
client,
>>>> or why the file BW is not scaling with more OSTs.  It is
somewhat
>>>> common for multiple processes (on different nodes) to write
>>>> non-overlapping regions of the same file; does performance
improve
>>>> if each thread opens its own file descriptor?
>>>>
>>>> Kevin
>>>>
>>>>
>>>> Wojciech Turek wrote:
>>>>> Ok so it looks like you have in total 64 OSTs and your
output file
>>>>> is striped across 48 of them. May I suggest that you limit
number
>>>>> of stripes, lets say a good number to start with would be 8
stripes
>>>>> and also for best results use OST pools feature to arrange
that
>>>>> each stripe goes to OST owned by different OSS.
>>>>>
>>>>> regards,
>>>>>
>>>>> Wojciech
>>>>>
>>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu
<mailto:kmehta at cs.uh.edu>>
>>>>> wrote:
>>>>>
>>>>>     Actually, ''lfs check servers'' returns
64 entries as well, so I
>>>>>     presume the
>>>>>     system documentation is out of date.
>>>>>
>>>>>     Again, I am sorry the basic information had been
incorrect.
>>>>>
>>>>>     - Kshitij
>>>>>
>>>>>     > Run lfs getstripe <your_output_file> and
paste the output of
>>>>>     that command
>>>>>     > to
>>>>>     > the mailing list.
>>>>>     > Stripe count of 48 is not possible if you have max
11 OSTs (the
>>>>>     max stripe
>>>>>     > count will be 11)
>>>>>     > If your striping is correct, the bottleneck can be
your client
>>>>>     network.
>>>>>     >
>>>>>     > regards,
>>>>>     >
>>>>>     > Wojciech
>>>>>     >
>>>>>     >
>>>>>     >
>>>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>>>     >
>>>>>     >> The stripe count is 48.
>>>>>     >>
>>>>>     >> Just fyi, this is what my application does:
>>>>>     >> A simple I/O test where threads continually
write blocks of
>>>>> size
>>>>>     >> 64Kbytes
>>>>>     >> or 1Mbyte (decided at compile time) till a
large file of say,
>>>>>     16Gbytes
>>>>>     >> is
>>>>>     >> created.
>>>>>     >>
>>>>>     >> Thanks,
>>>>>     >> Kshitij
>>>>>     >>
>>>>>     >> > What is your stripe count on the file, 
if your default is
>>>>> 1,
>>>>>     you are
>>>>>     >> only
>>>>>     >> > writing to one of the OST''s. 
you can check with the lfs
>>>>>     getstripe
>>>>>     >> > command, you can set the stripe bigger,
and hopefully your
>>>>>     >> wide-stripped
>>>>>     >> > file with threaded writes will be faster.
>>>>>     >> >
>>>>>     >> > Evan
>>>>>     >> >
>>>>>     >> > -----Original Message-----
>>>>>     >> > From: lustre-community-bounces at
lists.lustre.org
>>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>
>>>>>     >> > [mailto:lustre-community-bounces at
lists.lustre.org
>>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>] On Behalf Of
>>>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>
>>>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>>>     >> > To: lustre-community at lists.lustre.org
>>>>>     <mailto:lustre-community at lists.lustre.org>
>>>>>     >> > Subject: [Lustre-community] Poor
multithreaded I/O
>>>>> performance
>>>>>     >> >
>>>>>     >> > Hello,
>>>>>     >> > I am running a multithreaded application
that writes to a
>>>>> common
>>>>>     >> shared
>>>>>     >> > file on lustre fs, and this is what I
see:
>>>>>     >> >
>>>>>     >> > If I have a single thread in my
application, I get a
>>>>> bandwidth of
>>>>>     >> approx.
>>>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe
size) However, if I
>>>>>     spawn 8
>>>>>     >> > threads such that all of them write to
the same file
>>>>>     (non-overlapping
>>>>>     >> > locations), without explicitly
synchronizing the writes
>>>>> (i.e.
>>>>>     I dont
>>>>>     >> lock
>>>>>     >> > the file handle), I still get the same
bandwidth.
>>>>>     >> >
>>>>>     >> > Now, instead of writing to a shared file,
if these threads
>>>>>     write to
>>>>>     >> > separate files, the bandwidth obtained is
approx. 700
>>>>> Mbytes/sec.
>>>>>     >> >
>>>>>     >> > I would ideally like my multithreaded
application to see
>>>>> similar
>>>>>     >> scaling.
>>>>>     >> > Any ideas why the performance is limited
and any
>>>>> workarounds?
>>>>>     >> >
>>>>>     >> > Thank you,
>>>>>     >> > Kshitij
>>>>>     >> >
>>>>>     >> >
>>>>>     >> >
_______________________________________________
>>>>>     >> > Lustre-community mailing list
>>>>>     >> > Lustre-community at lists.lustre.org
>>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>>     >> >
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>     >> >
>>>>>     >>
>>>>>     >>
>>>>>     >>
_______________________________________________
>>>>>     >> Lustre-community mailing list
>>>>>     >> Lustre-community at lists.lustre.org
>>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>>     >>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>
>>>>>
>>>>>
-------------------------------------------------------------------
>>>>> -----
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-community mailing list
>>>>> Lustre-community at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Felix, Evan J

2011-Jun-03 19:51 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

I''ve been trying to test this, but not finding an obvious error...  so
more questions:

How much RAM do you have on your client, and how much on the OST''s 
some of my smaller tests go much faster, but I believe that it is cache based
effects.  My larger test at 32GB gives pretty consistent results.

The other thing to consider:  are the separate files being striped 8 ways? 
Because that would allow them to hit possibly all 64 OST''s, while the
shared file case will only hit 8.

Evan

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of Felix, Evan J
Sent: Friday, June 03, 2011 9:09 AM
To: kmehta at cs.uh.edu
Cc: Lustre discuss
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance

What file sizes and segment sizes are you using for your tests?

Evan

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of kmehta at cs.uh.edu
Sent: Thursday, June 02, 2011 5:07 PM
To: kmehta at cs.uh.edu
Cc: kmehta at cs.uh.edu; Lustre discuss
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance

Hello,
I was wondering if anyone could replicate the performance of the multithreaded
application using the C file that I posted in my previous email.

Thanks,
Kshitij

> Ok I ran the following tests:
>
> [1]
> Application spawns 8 threads. I write to Lustre having 8 OSTs.
> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, 
> i.e.
>
> T0 writes to offsets 0, 8MB, 16MB, etc.
> T1 writes to offsets 1MB, 9MB, 17MB, etc.
> The stripe size being 1MByte, every thread ends up writing to only 1 OST.
>
> I see a bandwidth of 280 Mbytes/sec, similar to the single thread 
> performance.
>
> [2]
> I also ran the same test such that every thread writes data in blocks 
> of 8 Mbytes for the same stripe size. (Thus, every thread will write 
> to every OST). I still get similar performance, ~280Mbytes/sec, so 
> essentially I see no difference between each thread writing to a 
> single OST vs each thread writing to all OSTs.
>
> And as I said before, if all threads write to their own separate file, 
> the resulting bandwidth is ~700Mbytes/sec.
>
> I have attached my C file (simple_io_test.c) herewith. Maybe you could 
> run it and see where the bottleneck is. Comments and instructions for 
> compilation have been included in the file. Do let me know if you need 
> any clarification on that.
>
> Your help is appreciated,
> Kshitij
>
>> This is what my application does:
>>
>> Each thread has its own file descriptor to the file.
>> I use pwrite to ensure non-overlapping regions, as follows:
>>
>> Thread 0, data_size: 1MB, offset: 0
>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB,
>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>
>> <repeat cycle>
>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in 
>> parallel, I dont wait for one cycle to end before the next one 
>> begins).
>>
>> I am gonna try the following:
>> a)
>> Instead of a round-robin distribution of offsets, test with 
>> sequential
>> offsets:
>> Thread 0, data_size: 1MB, offset:0
>> Thread 0, data_size: 1MB, offset:1MB
>> Thread 0, data_size: 1MB, offset:2MB
>> Thread 0, data_size: 1MB, offset:3MB
>>
>> Thread 1, data_size: 1MB, offset:4MB
>> and so on. (I am gonna keep these separate pwrite I/O requests 
>> instead of merging them or using writev)
>>
>> b)
>> Map the threads to the no. of OSTs using some modulo, as suggested in 
>> the email below.
>>
>> c)
>> Experiment with fewer no. of OSTs (I currently have 48).
>>
>> I shall report back with my findings.
>>
>> Thanks,
>> Kshitij
>>
>>> [Moved to Lustre-discuss]
>>>
>>>
>>> "However, if I spawn 8 threads such that all of them write to
the
>>> same file (non-overlapping locations), without explicitly 
>>> synchronizing the writes (i.e. I dont lock the file handle)"
>>>
>>>
>>> How exactly does your multi-threaded application write the data?  
>>> Are you using pwrite to ensure non-overlapping regions or are they 
>>> all just doing unlocked write() operations on the same fd to each 
>>> write (each just transferring size/8)?  If it divides the file into
>>> N pieces, and each thread does pwrite on its piece, then what each 
>>> OST sees are multiple streams at wide offsets to the same object, 
>>> which could impact performance.
>>>
>>> If on the other hand the file is written sequentially, where each 
>>> thread grabs the next piece to be written (locking normally used
for
>>> the current_offset value, so you know where each chunk is actually 
>>> going), then you get a more sequential pattern at the OST.
>>>
>>> If the number of threads maps to the number of OSTs (or some
modulo,
>>> like in your case 6 OSTs per thread), and each thread
"owns" the
>>> piece of the file that belongs to an OST (ie: for (offset = 
>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf,
>>> 6MB, offset); ), then you''ve eliminated the need for
application
>>> locks (assuming the use of
>>> pwrite) and ensured each OST object is being written sequentially.
>>>
>>> It''s quite possible there is some bottleneck on the shared
fd.  So
>>> perhaps the question is not why you aren''t scaling with
more
>>> threads, but why the single file is not able to saturate the
client,
>>> or why the file BW is not scaling with more OSTs.  It is somewhat 
>>> common for multiple processes (on different nodes) to write 
>>> non-overlapping regions of the same file; does performance improve 
>>> if each thread opens its own file descriptor?
>>>
>>> Kevin
>>>
>>>
>>> Wojciech Turek wrote:
>>>> Ok so it looks like you have in total 64 OSTs and your output
file
>>>> is striped across 48 of them. May I suggest that you limit
number
>>>> of stripes, lets say a good number to start with would be 8
stripes
>>>> and also for best results use OST pools feature to arrange that
>>>> each stripe goes to OST owned by different OSS.
>>>>
>>>> regards,
>>>>
>>>> Wojciech
>>>>
>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta
at cs.uh.edu>>
>>>> wrote:
>>>>
>>>>     Actually, ''lfs check servers'' returns 64
entries as well, so I
>>>>     presume the
>>>>     system documentation is out of date.
>>>>
>>>>     Again, I am sorry the basic information had been incorrect.
>>>>
>>>>     - Kshitij
>>>>
>>>>     > Run lfs getstripe <your_output_file> and paste
the output of
>>>>     that command
>>>>     > to
>>>>     > the mailing list.
>>>>     > Stripe count of 48 is not possible if you have max 11
OSTs (the
>>>>     max stripe
>>>>     > count will be 11)
>>>>     > If your striping is correct, the bottleneck can be
your client
>>>>     network.
>>>>     >
>>>>     > regards,
>>>>     >
>>>>     > Wojciech
>>>>     >
>>>>     >
>>>>     >
>>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>>     >
>>>>     >> The stripe count is 48.
>>>>     >>
>>>>     >> Just fyi, this is what my application does:
>>>>     >> A simple I/O test where threads continually write
blocks of
>>>> size
>>>>     >> 64Kbytes
>>>>     >> or 1Mbyte (decided at compile time) till a large
file of say,
>>>>     16Gbytes
>>>>     >> is
>>>>     >> created.
>>>>     >>
>>>>     >> Thanks,
>>>>     >> Kshitij
>>>>     >>
>>>>     >> > What is your stripe count on the file,  if
your default is 1,
>>>>     you are
>>>>     >> only
>>>>     >> > writing to one of the OST''s.  you
can check with the lfs
>>>>     getstripe
>>>>     >> > command, you can set the stripe bigger, and
hopefully your
>>>>     >> wide-stripped
>>>>     >> > file with threaded writes will be faster.
>>>>     >> >
>>>>     >> > Evan
>>>>     >> >
>>>>     >> > -----Original Message-----
>>>>     >> > From: lustre-community-bounces at
lists.lustre.org
>>>>     <mailto:lustre-community-bounces at lists.lustre.org>
>>>>     >> > [mailto:lustre-community-bounces at
lists.lustre.org
>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>] On Behalf Of
>>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>
>>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>>     >> > To: lustre-community at lists.lustre.org
>>>>     <mailto:lustre-community at lists.lustre.org>
>>>>     >> > Subject: [Lustre-community] Poor
multithreaded I/O
>>>> performance
>>>>     >> >
>>>>     >> > Hello,
>>>>     >> > I am running a multithreaded application that
writes to a
>>>> common
>>>>     >> shared
>>>>     >> > file on lustre fs, and this is what I see:
>>>>     >> >
>>>>     >> > If I have a single thread in my application,
I get a
>>>> bandwidth of
>>>>     >> approx.
>>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size)
However, if I
>>>>     spawn 8
>>>>     >> > threads such that all of them write to the
same file
>>>>     (non-overlapping
>>>>     >> > locations), without explicitly synchronizing
the writes (i.e.
>>>>     I dont
>>>>     >> lock
>>>>     >> > the file handle), I still get the same
bandwidth.
>>>>     >> >
>>>>     >> > Now, instead of writing to a shared file, if
these threads
>>>>     write to
>>>>     >> > separate files, the bandwidth obtained is
approx. 700
>>>> Mbytes/sec.
>>>>     >> >
>>>>     >> > I would ideally like my multithreaded
application to see
>>>> similar
>>>>     >> scaling.
>>>>     >> > Any ideas why the performance is limited and
any workarounds?
>>>>     >> >
>>>>     >> > Thank you,
>>>>     >> > Kshitij
>>>>     >> >
>>>>     >> >
>>>>     >> >
_______________________________________________
>>>>     >> > Lustre-community mailing list
>>>>     >> > Lustre-community at lists.lustre.org
>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>     >> >
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>     >> >
>>>>     >>
>>>>     >>
>>>>     >> _______________________________________________
>>>>     >> Lustre-community mailing list
>>>>     >> Lustre-community at lists.lustre.org
>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>     >>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>>
>>>>
-------------------------------------------------------------------
>>>> -----
>>>>
>>>> _______________________________________________
>>>> Lustre-community mailing list
>>>> Lustre-community at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>
>>
>>
>

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

kmehta at cs.uh.edu

2011-Jun-06 18:20 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

> are the separate files being striped 8 ways?
>  Because that would allow them to hit possibly all 64 OST''s, while
the
> shared file case will only hit 8
Yes, I found out that the files are getting striped 8 ways, so we end up
hitting 64 OSTs. This is what I tried next:

1. Ran a test case where 6 threads write separate files, each of size 6
GB, to a directory configured over 8 OSTs. Thus the application writes
36GB of data in total, over 48 OSTs.

2. Ran a test case where 8 threads write a common file of size 36GB to a
directory configured over 48 OSTs.

Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a
b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1
(separate files).

Thanks,
Kshitij
> I''ve been trying to test this, but not finding an obvious error...
so
> more questions:
>
> How much RAM do you have on your client, and how much on the OST''s
some
> of my smaller tests go much faster, but I believe that it is cache based
> effects.  My larger test at 32GB gives pretty consistent results.
>
> The other thing to consider:  are the separate files being striped 8 ways?
>  Because that would allow them to hit possibly all 64 OST''s, while
the
> shared file case will only hit 8.
>
> Evan
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix,
Evan
> J
> Sent: Friday, June 03, 2011 9:09 AM
> To: kmehta at cs.uh.edu
> Cc: Lustre discuss
> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>
> What file sizes and segment sizes are you using for your tests?
>
> Evan
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
> kmehta at cs.uh.edu
> Sent: Thursday, June 02, 2011 5:07 PM
> To: kmehta at cs.uh.edu
> Cc: kmehta at cs.uh.edu; Lustre discuss
> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>
> Hello,
> I was wondering if anyone could replicate the performance of the
> multithreaded application using the C file that I posted in my previous
> email.
>
> Thanks,
> Kshitij
>
>
>> Ok I ran the following tests:
>>
>> [1]
>> Application spawns 8 threads. I write to Lustre having 8 OSTs.
>> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
>> i.e.
>>
>> T0 writes to offsets 0, 8MB, 16MB, etc.
>> T1 writes to offsets 1MB, 9MB, 17MB, etc.
>> The stripe size being 1MByte, every thread ends up writing to only 1
>> OST.
>>
>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
>> performance.
>>
>> [2]
>> I also ran the same test such that every thread writes data in blocks
>> of 8 Mbytes for the same stripe size. (Thus, every thread will write
>> to every OST). I still get similar performance, ~280Mbytes/sec, so
>> essentially I see no difference between each thread writing to a
>> single OST vs each thread writing to all OSTs.
>>
>> And as I said before, if all threads write to their own separate file,
>> the resulting bandwidth is ~700Mbytes/sec.
>>
>> I have attached my C file (simple_io_test.c) herewith. Maybe you could
>> run it and see where the bottleneck is. Comments and instructions for
>> compilation have been included in the file. Do let me know if you need
>> any clarification on that.
>>
>> Your help is appreciated,
>> Kshitij
>>
>>> This is what my application does:
>>>
>>> Each thread has its own file descriptor to the file.
>>> I use pwrite to ensure non-overlapping regions, as follows:
>>>
>>> Thread 0, data_size: 1MB, offset: 0
>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB,
>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>>
>>> <repeat cycle>
>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in
>>> parallel, I dont wait for one cycle to end before the next one
>>> begins).
>>>
>>> I am gonna try the following:
>>> a)
>>> Instead of a round-robin distribution of offsets, test with
>>> sequential
>>> offsets:
>>> Thread 0, data_size: 1MB, offset:0
>>> Thread 0, data_size: 1MB, offset:1MB
>>> Thread 0, data_size: 1MB, offset:2MB
>>> Thread 0, data_size: 1MB, offset:3MB
>>>
>>> Thread 1, data_size: 1MB, offset:4MB
>>> and so on. (I am gonna keep these separate pwrite I/O requests
>>> instead of merging them or using writev)
>>>
>>> b)
>>> Map the threads to the no. of OSTs using some modulo, as suggested
in
>>> the email below.
>>>
>>> c)
>>> Experiment with fewer no. of OSTs (I currently have 48).
>>>
>>> I shall report back with my findings.
>>>
>>> Thanks,
>>> Kshitij
>>>
>>>> [Moved to Lustre-discuss]
>>>>
>>>>
>>>> "However, if I spawn 8 threads such that all of them write
to the
>>>> same file (non-overlapping locations), without explicitly
>>>> synchronizing the writes (i.e. I dont lock the file
handle)"
>>>>
>>>>
>>>> How exactly does your multi-threaded application write the
data?
>>>> Are you using pwrite to ensure non-overlapping regions or are
they
>>>> all just doing unlocked write() operations on the same fd to
each
>>>> write (each just transferring size/8)?  If it divides the file
into
>>>> N pieces, and each thread does pwrite on its piece, then what
each
>>>> OST sees are multiple streams at wide offsets to the same
object,
>>>> which could impact performance.
>>>>
>>>> If on the other hand the file is written sequentially, where
each
>>>> thread grabs the next piece to be written (locking normally
used for
>>>> the current_offset value, so you know where each chunk is
actually
>>>> going), then you get a more sequential pattern at the OST.
>>>>
>>>> If the number of threads maps to the number of OSTs (or some
modulo,
>>>> like in your case 6 OSTs per thread), and each thread
"owns" the
>>>> piece of the file that belongs to an OST (ie: for (offset
>>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd,
buf,
>>>> 6MB, offset); ), then you''ve eliminated the need for
application
>>>> locks (assuming the use of
>>>> pwrite) and ensured each OST object is being written
sequentially.
>>>>
>>>> It''s quite possible there is some bottleneck on the
shared fd.  So
>>>> perhaps the question is not why you aren''t scaling
with more
>>>> threads, but why the single file is not able to saturate the
client,
>>>> or why the file BW is not scaling with more OSTs.  It is
somewhat
>>>> common for multiple processes (on different nodes) to write
>>>> non-overlapping regions of the same file; does performance
improve
>>>> if each thread opens its own file descriptor?
>>>>
>>>> Kevin
>>>>
>>>>
>>>> Wojciech Turek wrote:
>>>>> Ok so it looks like you have in total 64 OSTs and your
output file
>>>>> is striped across 48 of them. May I suggest that you limit
number
>>>>> of stripes, lets say a good number to start with would be 8
stripes
>>>>> and also for best results use OST pools feature to arrange
that
>>>>> each stripe goes to OST owned by different OSS.
>>>>>
>>>>> regards,
>>>>>
>>>>> Wojciech
>>>>>
>>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu
<mailto:kmehta at cs.uh.edu>>
>>>>> wrote:
>>>>>
>>>>>     Actually, ''lfs check servers'' returns
64 entries as well, so I
>>>>>     presume the
>>>>>     system documentation is out of date.
>>>>>
>>>>>     Again, I am sorry the basic information had been
incorrect.
>>>>>
>>>>>     - Kshitij
>>>>>
>>>>>     > Run lfs getstripe <your_output_file> and
paste the output of
>>>>>     that command
>>>>>     > to
>>>>>     > the mailing list.
>>>>>     > Stripe count of 48 is not possible if you have max
11 OSTs (the
>>>>>     max stripe
>>>>>     > count will be 11)
>>>>>     > If your striping is correct, the bottleneck can be
your client
>>>>>     network.
>>>>>     >
>>>>>     > regards,
>>>>>     >
>>>>>     > Wojciech
>>>>>     >
>>>>>     >
>>>>>     >
>>>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>>>     >
>>>>>     >> The stripe count is 48.
>>>>>     >>
>>>>>     >> Just fyi, this is what my application does:
>>>>>     >> A simple I/O test where threads continually
write blocks of
>>>>> size
>>>>>     >> 64Kbytes
>>>>>     >> or 1Mbyte (decided at compile time) till a
large file of say,
>>>>>     16Gbytes
>>>>>     >> is
>>>>>     >> created.
>>>>>     >>
>>>>>     >> Thanks,
>>>>>     >> Kshitij
>>>>>     >>
>>>>>     >> > What is your stripe count on the file, 
if your default is
>>>>> 1,
>>>>>     you are
>>>>>     >> only
>>>>>     >> > writing to one of the OST''s. 
you can check with the lfs
>>>>>     getstripe
>>>>>     >> > command, you can set the stripe bigger,
and hopefully your
>>>>>     >> wide-stripped
>>>>>     >> > file with threaded writes will be faster.
>>>>>     >> >
>>>>>     >> > Evan
>>>>>     >> >
>>>>>     >> > -----Original Message-----
>>>>>     >> > From: lustre-community-bounces at
lists.lustre.org
>>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>
>>>>>     >> > [mailto:lustre-community-bounces at
lists.lustre.org
>>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>] On Behalf Of
>>>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at
cs.uh.edu>
>>>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>>>     >> > To: lustre-community at lists.lustre.org
>>>>>     <mailto:lustre-community at lists.lustre.org>
>>>>>     >> > Subject: [Lustre-community] Poor
multithreaded I/O
>>>>> performance
>>>>>     >> >
>>>>>     >> > Hello,
>>>>>     >> > I am running a multithreaded application
that writes to a
>>>>> common
>>>>>     >> shared
>>>>>     >> > file on lustre fs, and this is what I
see:
>>>>>     >> >
>>>>>     >> > If I have a single thread in my
application, I get a
>>>>> bandwidth of
>>>>>     >> approx.
>>>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe
size) However, if I
>>>>>     spawn 8
>>>>>     >> > threads such that all of them write to
the same file
>>>>>     (non-overlapping
>>>>>     >> > locations), without explicitly
synchronizing the writes
>>>>> (i.e.
>>>>>     I dont
>>>>>     >> lock
>>>>>     >> > the file handle), I still get the same
bandwidth.
>>>>>     >> >
>>>>>     >> > Now, instead of writing to a shared file,
if these threads
>>>>>     write to
>>>>>     >> > separate files, the bandwidth obtained is
approx. 700
>>>>> Mbytes/sec.
>>>>>     >> >
>>>>>     >> > I would ideally like my multithreaded
application to see
>>>>> similar
>>>>>     >> scaling.
>>>>>     >> > Any ideas why the performance is limited
and any
>>>>> workarounds?
>>>>>     >> >
>>>>>     >> > Thank you,
>>>>>     >> > Kshitij
>>>>>     >> >
>>>>>     >> >
>>>>>     >> >
_______________________________________________
>>>>>     >> > Lustre-community mailing list
>>>>>     >> > Lustre-community at lists.lustre.org
>>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>>     >> >
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>     >> >
>>>>>     >>
>>>>>     >>
>>>>>     >>
_______________________________________________
>>>>>     >> Lustre-community mailing list
>>>>>     >> Lustre-community at lists.lustre.org
>>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>>     >>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>
>>>>>
>>>>>
-------------------------------------------------------------------
>>>>> -----
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-community mailing list
>>>>> Lustre-community at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Kshitij Mehta

2011-Jun-09 17:57 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

I read in a research paper 
(http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about 
Lustre''s ability to join files in place. Can someone point me to sample
code and documentation on this? I couldnt find information in the 
manual. Being able to join files in place could be a potential solution 
to the issue I have.

Thanks,
Kshitij

On 06/06/2011 01:20 PM, kmehta at cs.uh.edu wrote:>> are the separate files being striped 8 ways?
>>   Because that would allow them to hit possibly all 64 OST''s,
while the
>> shared file case will only hit 8
> Yes, I found out that the files are getting striped 8 ways, so we end up
> hitting 64 OSTs. This is what I tried next:
>
> 1. Ran a test case where 6 threads write separate files, each of size 6
> GB, to a directory configured over 8 OSTs. Thus the application writes
> 36GB of data in total, over 48 OSTs.
>
> 2. Ran a test case where 8 threads write a common file of size 36GB to a
> directory configured over 48 OSTs.
>
> Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a
> b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1
> (separate files).
>
> Thanks,
> Kshitij
>
>> I''ve been trying to test this, but not finding an obvious
error...  so
>> more questions:
>>
>> How much RAM do you have on your client, and how much on the
OST''s  some
>> of my smaller tests go much faster, but I believe that it is cache
based
>> effects.  My larger test at 32GB gives pretty consistent results.
>>
>> The other thing to consider:  are the separate files being striped 8
ways?
>>   Because that would allow them to hit possibly all 64 OST''s,
while the
>> shared file case will only hit 8.
>>
>> Evan
>>
>> -----Original Message-----
>> From: lustre-discuss-bounces at lists.lustre.org
>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix,
Evan
>> J
>> Sent: Friday, June 03, 2011 9:09 AM
>> To: kmehta at cs.uh.edu
>> Cc: Lustre discuss
>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>
>> What file sizes and segment sizes are you using for your tests?
>>
>> Evan
>>
>> -----Original Message-----
>> From: lustre-discuss-bounces at lists.lustre.org
>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
>> kmehta at cs.uh.edu
>> Sent: Thursday, June 02, 2011 5:07 PM
>> To: kmehta at cs.uh.edu
>> Cc: kmehta at cs.uh.edu; Lustre discuss
>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>
>> Hello,
>> I was wondering if anyone could replicate the performance of the
>> multithreaded application using the C file that I posted in my previous
>> email.
>>
>> Thanks,
>> Kshitij
>>
>>
>>> Ok I ran the following tests:
>>>
>>> [1]
>>> Application spawns 8 threads. I write to Lustre having 8 OSTs.
>>> Each thread writes data in blocks of 1 Mbyte in a round robin
fashion,
>>> i.e.
>>>
>>> T0 writes to offsets 0, 8MB, 16MB, etc.
>>> T1 writes to offsets 1MB, 9MB, 17MB, etc.
>>> The stripe size being 1MByte, every thread ends up writing to only
1
>>> OST.
>>>
>>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
>>> performance.
>>>
>>> [2]
>>> I also ran the same test such that every thread writes data in
blocks
>>> of 8 Mbytes for the same stripe size. (Thus, every thread will
write
>>> to every OST). I still get similar performance, ~280Mbytes/sec, so
>>> essentially I see no difference between each thread writing to a
>>> single OST vs each thread writing to all OSTs.
>>>
>>> And as I said before, if all threads write to their own separate
file,
>>> the resulting bandwidth is ~700Mbytes/sec.
>>>
>>> I have attached my C file (simple_io_test.c) herewith. Maybe you
could
>>> run it and see where the bottleneck is. Comments and instructions
for
>>> compilation have been included in the file. Do let me know if you
need
>>> any clarification on that.
>>>
>>> Your help is appreciated,
>>> Kshitij
>>>
>>>> This is what my application does:
>>>>
>>>> Each thread has its own file descriptor to the file.
>>>> I use pwrite to ensure non-overlapping regions, as follows:
>>>>
>>>> Thread 0, data_size: 1MB, offset: 0
>>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB,
>>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>>>
>>>> <repeat cycle>
>>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens
in
>>>> parallel, I dont wait for one cycle to end before the next one
>>>> begins).
>>>>
>>>> I am gonna try the following:
>>>> a)
>>>> Instead of a round-robin distribution of offsets, test with
>>>> sequential
>>>> offsets:
>>>> Thread 0, data_size: 1MB, offset:0
>>>> Thread 0, data_size: 1MB, offset:1MB
>>>> Thread 0, data_size: 1MB, offset:2MB
>>>> Thread 0, data_size: 1MB, offset:3MB
>>>>
>>>> Thread 1, data_size: 1MB, offset:4MB
>>>> and so on. (I am gonna keep these separate pwrite I/O requests
>>>> instead of merging them or using writev)
>>>>
>>>> b)
>>>> Map the threads to the no. of OSTs using some modulo, as
suggested in
>>>> the email below.
>>>>
>>>> c)
>>>> Experiment with fewer no. of OSTs (I currently have 48).
>>>>
>>>> I shall report back with my findings.
>>>>
>>>> Thanks,
>>>> Kshitij
>>>>
>>>>> [Moved to Lustre-discuss]
>>>>>
>>>>>
>>>>> "However, if I spawn 8 threads such that all of them
write to the
>>>>> same file (non-overlapping locations), without explicitly
>>>>> synchronizing the writes (i.e. I dont lock the file
handle)"
>>>>>
>>>>>
>>>>> How exactly does your multi-threaded application write the
data?
>>>>> Are you using pwrite to ensure non-overlapping regions or
are they
>>>>> all just doing unlocked write() operations on the same fd
to each
>>>>> write (each just transferring size/8)?  If it divides the
file into
>>>>> N pieces, and each thread does pwrite on its piece, then
what each
>>>>> OST sees are multiple streams at wide offsets to the same
object,
>>>>> which could impact performance.
>>>>>
>>>>> If on the other hand the file is written sequentially,
where each
>>>>> thread grabs the next piece to be written (locking normally
used for
>>>>> the current_offset value, so you know where each chunk is
actually
>>>>> going), then you get a more sequential pattern at the OST.
>>>>>
>>>>> If the number of threads maps to the number of OSTs (or
some modulo,
>>>>> like in your case 6 OSTs per thread), and each thread
"owns" the
>>>>> piece of the file that belongs to an OST (ie: for (offset
>>>>> thread_num * 6MB; offset<  size; offset += 48MB)
pwrite(fd, buf,
>>>>> 6MB, offset); ), then you''ve eliminated the need
for application
>>>>> locks (assuming the use of
>>>>> pwrite) and ensured each OST object is being written
sequentially.
>>>>>
>>>>> It''s quite possible there is some bottleneck on
the shared fd.  So
>>>>> perhaps the question is not why you aren''t scaling
with more
>>>>> threads, but why the single file is not able to saturate
the client,
>>>>> or why the file BW is not scaling with more OSTs.  It is
somewhat
>>>>> common for multiple processes (on different nodes) to write
>>>>> non-overlapping regions of the same file; does performance
improve
>>>>> if each thread opens its own file descriptor?
>>>>>
>>>>> Kevin
>>>>>
>>>>>
>>>>> Wojciech Turek wrote:
>>>>>> Ok so it looks like you have in total 64 OSTs and your
output file
>>>>>> is striped across 48 of them. May I suggest that you
limit number
>>>>>> of stripes, lets say a good number to start with would
be 8 stripes
>>>>>> and also for best results use OST pools feature to
arrange that
>>>>>> each stripe goes to OST owned by different OSS.
>>>>>>
>>>>>> regards,
>>>>>>
>>>>>> Wojciech
>>>>>>
>>>>>> On 23 May 2011 23:09,<kmehta at
cs.uh.edu<mailto:kmehta at cs.uh.edu>>
>>>>>> wrote:
>>>>>>
>>>>>>      Actually, ''lfs check servers''
returns 64 entries as well, so I
>>>>>>      presume the
>>>>>>      system documentation is out of date.
>>>>>>
>>>>>>      Again, I am sorry the basic information had been
incorrect.
>>>>>>
>>>>>>      - Kshitij
>>>>>>
>>>>>>      >  Run lfs getstripe<your_output_file> 
and paste the output of
>>>>>>      that command
>>>>>>      >  to
>>>>>>      >  the mailing list.
>>>>>>      >  Stripe count of 48 is not possible if you
have max 11 OSTs (the
>>>>>>      max stripe
>>>>>>      >  count will be 11)
>>>>>>      >  If your striping is correct, the bottleneck
can be your client
>>>>>>      network.
>>>>>>      >
>>>>>>      >  regards,
>>>>>>      >
>>>>>>      >  Wojciech
>>>>>>      >
>>>>>>      >
>>>>>>      >
>>>>>>      >  On 23 May 2011 22:35,<kmehta at cs.uh.edu
>>>>>>      <mailto:kmehta at cs.uh.edu>>  wrote:
>>>>>>      >
>>>>>>      >>  The stripe count is 48.
>>>>>>      >>
>>>>>>      >>  Just fyi, this is what my application
does:
>>>>>>      >>  A simple I/O test where threads
continually write blocks of
>>>>>> size
>>>>>>      >>  64Kbytes
>>>>>>      >>  or 1Mbyte (decided at compile time) till
a large file of say,
>>>>>>      16Gbytes
>>>>>>      >>  is
>>>>>>      >>  created.
>>>>>>      >>
>>>>>>      >>  Thanks,
>>>>>>      >>  Kshitij
>>>>>>      >>
>>>>>>      >>  >  What is your stripe count on the
file,  if your default is
>>>>>> 1,
>>>>>>      you are
>>>>>>      >>  only
>>>>>>      >>  >  writing to one of the
OST''s.  you can check with the lfs
>>>>>>      getstripe
>>>>>>      >>  >  command, you can set the stripe
bigger, and hopefully your
>>>>>>      >>  wide-stripped
>>>>>>      >>  >  file with threaded writes will be
faster.
>>>>>>      >>  >
>>>>>>      >>  >  Evan
>>>>>>      >>  >
>>>>>>      >>  >  -----Original Message-----
>>>>>>      >>  >  From: lustre-community-bounces at
lists.lustre.org
>>>>>>      <mailto:lustre-community-bounces at
lists.lustre.org>
>>>>>>      >>  >  [mailto:lustre-community-bounces
at lists.lustre.org
>>>>>>      <mailto:lustre-community-bounces at
lists.lustre.org>] On Behalf Of
>>>>>>      >>  >  kmehta at
cs.uh.edu<mailto:kmehta at cs.uh.edu>
>>>>>>      >>  >  Sent: Monday, May 23, 2011 2:28 PM
>>>>>>      >>  >  To: lustre-community at
lists.lustre.org
>>>>>>      <mailto:lustre-community at
lists.lustre.org>
>>>>>>      >>  >  Subject: [Lustre-community] Poor
multithreaded I/O
>>>>>> performance
>>>>>>      >>  >
>>>>>>      >>  >  Hello,
>>>>>>      >>  >  I am running a multithreaded
application that writes to a
>>>>>> common
>>>>>>      >>  shared
>>>>>>      >>  >  file on lustre fs, and this is
what I see:
>>>>>>      >>  >
>>>>>>      >>  >  If I have a single thread in my
application, I get a
>>>>>> bandwidth of
>>>>>>      >>  approx.
>>>>>>      >>  >  250 MBytes/sec. (11 OSTs, 1MByte
stripe size) However, if I
>>>>>>      spawn 8
>>>>>>      >>  >  threads such that all of them
write to the same file
>>>>>>      (non-overlapping
>>>>>>      >>  >  locations), without explicitly
synchronizing the writes
>>>>>> (i.e.
>>>>>>      I dont
>>>>>>      >>  lock
>>>>>>      >>  >  the file handle), I still get the
same bandwidth.
>>>>>>      >>  >
>>>>>>      >>  >  Now, instead of writing to a
shared file, if these threads
>>>>>>      write to
>>>>>>      >>  >  separate files, the bandwidth
obtained is approx. 700
>>>>>> Mbytes/sec.
>>>>>>      >>  >
>>>>>>      >>  >  I would ideally like my
multithreaded application to see
>>>>>> similar
>>>>>>      >>  scaling.
>>>>>>      >>  >  Any ideas why the performance is
limited and any
>>>>>> workarounds?
>>>>>>      >>  >
>>>>>>      >>  >  Thank you,
>>>>>>      >>  >  Kshitij
>>>>>>      >>  >
>>>>>>      >>  >
>>>>>>      >>  > 
_______________________________________________
>>>>>>      >>  >  Lustre-community mailing list
>>>>>>      >>  >  Lustre-community at
lists.lustre.org
>>>>>>      <mailto:Lustre-community at
lists.lustre.org>
>>>>>>      >>  > 
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>      >>  >
>>>>>>      >>
>>>>>>      >>
>>>>>>      >> 
_______________________________________________
>>>>>>      >>  Lustre-community mailing list
>>>>>>      >>  Lustre-community at lists.lustre.org
>>>>>>      <mailto:Lustre-community at
lists.lustre.org>
>>>>>>      >> 
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>
>>>>>>
>>>>>>
-------------------------------------------------------------------
>>>>>> -----
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-community mailing list
>>>>>> Lustre-community at lists.lustre.org
>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>
>>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>

Felix, Evan J

2011-Jun-09 18:00 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

Its part of the lfs lustre tool, I have not used it myself..  try ''lfs
help join''

Evan

-----Original Message-----
From: Kshitij Mehta [mailto:kmehta at cs.uh.edu] 
Sent: Thursday, June 09, 2011 10:58 AM
To: kmehta at cs.uh.edu
Cc: Felix, Evan J; Lustre discuss
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance

I read in a research paper
(http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about
Lustre''s ability to join files in place. Can someone point me to sample
code and documentation on this? I couldnt find information in the manual. Being
able to join files in place could be a potential solution to the issue I have.

Thanks,
Kshitij

On 06/06/2011 01:20 PM, kmehta at cs.uh.edu wrote:>> are the separate files being striped 8 ways?
>>   Because that would allow them to hit possibly all 64 OST''s,
while
>> the shared file case will only hit 8
> Yes, I found out that the files are getting striped 8 ways, so we end 
> up hitting 64 OSTs. This is what I tried next:
>
> 1. Ran a test case where 6 threads write separate files, each of size 
> 6 GB, to a directory configured over 8 OSTs. Thus the application 
> writes 36GB of data in total, over 48 OSTs.
>
> 2. Ran a test case where 8 threads write a common file of size 36GB to 
> a directory configured over 48 OSTs.
>
> Thus both tests ultimately write 36GB of data over 48 OSTS. I still 
> see a b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for 
> test 1 (separate files).
>
> Thanks,
> Kshitij
>
>> I''ve been trying to test this, but not finding an obvious
error...
>> so more questions:
>>
>> How much RAM do you have on your client, and how much on the
OST''s
>> some of my smaller tests go much faster, but I believe that it is 
>> cache based effects.  My larger test at 32GB gives pretty consistent
results.
>>
>> The other thing to consider:  are the separate files being striped 8
ways?
>>   Because that would allow them to hit possibly all 64 OST''s,
while
>> the shared file case will only hit 8.
>>
>> Evan
>>
>> -----Original Message-----
>> From: lustre-discuss-bounces at lists.lustre.org
>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix,
>> Evan J
>> Sent: Friday, June 03, 2011 9:09 AM
>> To: kmehta at cs.uh.edu
>> Cc: Lustre discuss
>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>
>> What file sizes and segment sizes are you using for your tests?
>>
>> Evan
>>
>> -----Original Message-----
>> From: lustre-discuss-bounces at lists.lustre.org
>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of 
>> kmehta at cs.uh.edu
>> Sent: Thursday, June 02, 2011 5:07 PM
>> To: kmehta at cs.uh.edu
>> Cc: kmehta at cs.uh.edu; Lustre discuss
>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>
>> Hello,
>> I was wondering if anyone could replicate the performance of the 
>> multithreaded application using the C file that I posted in my 
>> previous email.
>>
>> Thanks,
>> Kshitij
>>
>>
>>> Ok I ran the following tests:
>>>
>>> [1]
>>> Application spawns 8 threads. I write to Lustre having 8 OSTs.
>>> Each thread writes data in blocks of 1 Mbyte in a round robin 
>>> fashion, i.e.
>>>
>>> T0 writes to offsets 0, 8MB, 16MB, etc.
>>> T1 writes to offsets 1MB, 9MB, 17MB, etc.
>>> The stripe size being 1MByte, every thread ends up writing to only
1
>>> OST.
>>>
>>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread 
>>> performance.
>>>
>>> [2]
>>> I also ran the same test such that every thread writes data in 
>>> blocks of 8 Mbytes for the same stripe size. (Thus, every thread 
>>> will write to every OST). I still get similar performance, 
>>> ~280Mbytes/sec, so essentially I see no difference between each 
>>> thread writing to a single OST vs each thread writing to all OSTs.
>>>
>>> And as I said before, if all threads write to their own separate 
>>> file, the resulting bandwidth is ~700Mbytes/sec.
>>>
>>> I have attached my C file (simple_io_test.c) herewith. Maybe you 
>>> could run it and see where the bottleneck is. Comments and 
>>> instructions for compilation have been included in the file. Do let
>>> me know if you need any clarification on that.
>>>
>>> Your help is appreciated,
>>> Kshitij
>>>
>>>> This is what my application does:
>>>>
>>>> Each thread has its own file descriptor to the file.
>>>> I use pwrite to ensure non-overlapping regions, as follows:
>>>>
>>>> Thread 0, data_size: 1MB, offset: 0 Thread 1, data_size: 1MB, 
>>>> offset: 1MB Thread 2, data_size: 1MB,
>>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>>>
>>>> <repeat cycle>
>>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens
in
>>>> parallel, I dont wait for one cycle to end before the next one 
>>>> begins).
>>>>
>>>> I am gonna try the following:
>>>> a)
>>>> Instead of a round-robin distribution of offsets, test with 
>>>> sequential
>>>> offsets:
>>>> Thread 0, data_size: 1MB, offset:0
>>>> Thread 0, data_size: 1MB, offset:1MB Thread 0, data_size: 1MB, 
>>>> offset:2MB Thread 0, data_size: 1MB, offset:3MB
>>>>
>>>> Thread 1, data_size: 1MB, offset:4MB and so on. (I am gonna
keep
>>>> these separate pwrite I/O requests instead of merging them or
using
>>>> writev)
>>>>
>>>> b)
>>>> Map the threads to the no. of OSTs using some modulo, as
suggested
>>>> in the email below.
>>>>
>>>> c)
>>>> Experiment with fewer no. of OSTs (I currently have 48).
>>>>
>>>> I shall report back with my findings.
>>>>
>>>> Thanks,
>>>> Kshitij
>>>>
>>>>> [Moved to Lustre-discuss]
>>>>>
>>>>>
>>>>> "However, if I spawn 8 threads such that all of them
write to the
>>>>> same file (non-overlapping locations), without explicitly 
>>>>> synchronizing the writes (i.e. I dont lock the file
handle)"
>>>>>
>>>>>
>>>>> How exactly does your multi-threaded application write the
data?
>>>>> Are you using pwrite to ensure non-overlapping regions or
are they
>>>>> all just doing unlocked write() operations on the same fd
to each
>>>>> write (each just transferring size/8)?  If it divides the
file
>>>>> into N pieces, and each thread does pwrite on its piece,
then what
>>>>> each OST sees are multiple streams at wide offsets to the
same
>>>>> object, which could impact performance.
>>>>>
>>>>> If on the other hand the file is written sequentially,
where each
>>>>> thread grabs the next piece to be written (locking normally
used
>>>>> for the current_offset value, so you know where each chunk
is
>>>>> actually going), then you get a more sequential pattern at
the OST.
>>>>>
>>>>> If the number of threads maps to the number of OSTs (or
some
>>>>> modulo, like in your case 6 OSTs per thread), and each
thread
>>>>> "owns" the piece of the file that belongs to an
OST (ie: for
>>>>> (offset = thread_num * 6MB; offset<  size; offset +=
48MB)
>>>>> pwrite(fd, buf, 6MB, offset); ), then you''ve
eliminated the need
>>>>> for application locks (assuming the use of
>>>>> pwrite) and ensured each OST object is being written
sequentially.
>>>>>
>>>>> It''s quite possible there is some bottleneck on
the shared fd.  So
>>>>> perhaps the question is not why you aren''t scaling
with more
>>>>> threads, but why the single file is not able to saturate
the
>>>>> client, or why the file BW is not scaling with more OSTs. 
It is
>>>>> somewhat common for multiple processes (on different nodes)
to
>>>>> write non-overlapping regions of the same file; does
performance
>>>>> improve if each thread opens its own file descriptor?
>>>>>
>>>>> Kevin
>>>>>
>>>>>
>>>>> Wojciech Turek wrote:
>>>>>> Ok so it looks like you have in total 64 OSTs and your
output
>>>>>> file is striped across 48 of them. May I suggest that
you limit
>>>>>> number of stripes, lets say a good number to start with
would be
>>>>>> 8 stripes and also for best results use OST pools
feature to
>>>>>> arrange that each stripe goes to OST owned by different
OSS.
>>>>>>
>>>>>> regards,
>>>>>>
>>>>>> Wojciech
>>>>>>
>>>>>> On 23 May 2011 23:09,<kmehta at
cs.uh.edu<mailto:kmehta at cs.uh.edu>>
>>>>>> wrote:
>>>>>>
>>>>>>      Actually, ''lfs check servers''
returns 64 entries as well, so I
>>>>>>      presume the
>>>>>>      system documentation is out of date.
>>>>>>
>>>>>>      Again, I am sorry the basic information had been
incorrect.
>>>>>>
>>>>>>      - Kshitij
>>>>>>
>>>>>>      >  Run lfs getstripe<your_output_file> 
and paste the output of
>>>>>>      that command
>>>>>>      >  to
>>>>>>      >  the mailing list.
>>>>>>      >  Stripe count of 48 is not possible if you
have max 11 OSTs (the
>>>>>>      max stripe
>>>>>>      >  count will be 11)
>>>>>>      >  If your striping is correct, the bottleneck
can be your client
>>>>>>      network.
>>>>>>      >
>>>>>>      >  regards,
>>>>>>      >
>>>>>>      >  Wojciech
>>>>>>      >
>>>>>>      >
>>>>>>      >
>>>>>>      >  On 23 May 2011 22:35,<kmehta at cs.uh.edu
>>>>>>      <mailto:kmehta at cs.uh.edu>>  wrote:
>>>>>>      >
>>>>>>      >>  The stripe count is 48.
>>>>>>      >>
>>>>>>      >>  Just fyi, this is what my application
does:
>>>>>>      >>  A simple I/O test where threads
continually write blocks
>>>>>> of size
>>>>>>      >>  64Kbytes
>>>>>>      >>  or 1Mbyte (decided at compile time) till
a large file of say,
>>>>>>      16Gbytes
>>>>>>      >>  is
>>>>>>      >>  created.
>>>>>>      >>
>>>>>>      >>  Thanks,
>>>>>>      >>  Kshitij
>>>>>>      >>
>>>>>>      >>  >  What is your stripe count on the
file,  if your
>>>>>> default is 1,
>>>>>>      you are
>>>>>>      >>  only
>>>>>>      >>  >  writing to one of the
OST''s.  you can check with the lfs
>>>>>>      getstripe
>>>>>>      >>  >  command, you can set the stripe
bigger, and hopefully your
>>>>>>      >>  wide-stripped
>>>>>>      >>  >  file with threaded writes will be
faster.
>>>>>>      >>  >
>>>>>>      >>  >  Evan
>>>>>>      >>  >
>>>>>>      >>  >  -----Original Message-----
>>>>>>      >>  >  From: lustre-community-bounces at
lists.lustre.org
>>>>>>      <mailto:lustre-community-bounces at
lists.lustre.org>
>>>>>>      >>  >  [mailto:lustre-community-bounces
at lists.lustre.org
>>>>>>      <mailto:lustre-community-bounces at
lists.lustre.org>] On Behalf Of
>>>>>>      >>  >  kmehta at
cs.uh.edu<mailto:kmehta at cs.uh.edu>
>>>>>>      >>  >  Sent: Monday, May 23, 2011 2:28 PM
>>>>>>      >>  >  To: lustre-community at
lists.lustre.org
>>>>>>      <mailto:lustre-community at
lists.lustre.org>
>>>>>>      >>  >  Subject: [Lustre-community] Poor
multithreaded I/O
>>>>>> performance
>>>>>>      >>  >
>>>>>>      >>  >  Hello,
>>>>>>      >>  >  I am running a multithreaded
application that writes
>>>>>> to a common
>>>>>>      >>  shared
>>>>>>      >>  >  file on lustre fs, and this is
what I see:
>>>>>>      >>  >
>>>>>>      >>  >  If I have a single thread in my
application, I get a
>>>>>> bandwidth of
>>>>>>      >>  approx.
>>>>>>      >>  >  250 MBytes/sec. (11 OSTs, 1MByte
stripe size) However, if I
>>>>>>      spawn 8
>>>>>>      >>  >  threads such that all of them
write to the same file
>>>>>>      (non-overlapping
>>>>>>      >>  >  locations), without explicitly
synchronizing the
>>>>>> writes (i.e.
>>>>>>      I dont
>>>>>>      >>  lock
>>>>>>      >>  >  the file handle), I still get the
same bandwidth.
>>>>>>      >>  >
>>>>>>      >>  >  Now, instead of writing to a
shared file, if these threads
>>>>>>      write to
>>>>>>      >>  >  separate files, the bandwidth
obtained is approx. 700
>>>>>> Mbytes/sec.
>>>>>>      >>  >
>>>>>>      >>  >  I would ideally like my
multithreaded application to
>>>>>> see similar
>>>>>>      >>  scaling.
>>>>>>      >>  >  Any ideas why the performance is
limited and any
>>>>>> workarounds?
>>>>>>      >>  >
>>>>>>      >>  >  Thank you,
>>>>>>      >>  >  Kshitij
>>>>>>      >>  >
>>>>>>      >>  >
>>>>>>      >>  > 
_______________________________________________
>>>>>>      >>  >  Lustre-community mailing list
>>>>>>      >>  >  Lustre-community at
lists.lustre.org
>>>>>>      <mailto:Lustre-community at
lists.lustre.org>
>>>>>>      >>  > 
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>      >>  >
>>>>>>      >>
>>>>>>      >>
>>>>>>      >> 
_______________________________________________
>>>>>>      >>  Lustre-community mailing list
>>>>>>      >>  Lustre-community at lists.lustre.org
>>>>>>      <mailto:Lustre-community at
lists.lustre.org>
>>>>>>      >>  
>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>
>>>>>>
>>>>>>
-----------------------------------------------------------------
>>>>>> --
>>>>>> -----
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-community mailing list
>>>>>> Lustre-community at lists.lustre.org 
>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>
>>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>

Andreas Dilger

2011-Jun-09 20:06 UTC

head link

[Lustre-discuss] Poor multithreaded I/O performance

On 2011-06-09, at 11:57 AM, Kshitij Mehta wrote:> I read in a research paper 
> (http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about 
> Lustre''s ability to join files in place. Can someone point me to
sample
> code and documentation on this? I couldnt find information in the 
> manual. Being able to join files in place could be a potential solution 
> to the issue I have.
That feature was mostly experimental, and has been disabled in newer
versions of Lustre.
> On 06/06/2011 01:20 PM, kmehta at cs.uh.edu wrote:
>>> are the separate files being striped 8 ways?
>>>  Because that would allow them to hit possibly all 64
OST''s, while the
>>> shared file case will only hit 8
>> 
>> Yes, I found out that the files are getting striped 8 ways, so we end
up
>> hitting 64 OSTs. This is what I tried next:
>> 
>> 1. Ran a test case where 6 threads write separate files, each of size 6
>> GB, to a directory configured over 8 OSTs. Thus the application writes
>> 36GB of data in total, over 48 OSTs.
>> 
>> 2. Ran a test case where 8 threads write a common file of size 36GB to
a
>> directory configured over 48 OSTs.
>> 
>> Thus both tests ultimately write 36GB of data over 48 OSTS. I still see
a
>> b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1
>> (separate files).
>> 
>> Thanks,
>> Kshitij
>> 
>>> I''ve been trying to test this, but not finding an obvious
error...  so
>>> more questions:
>>> 
>>> How much RAM do you have on your client, and how much on the
OST''s  some
>>> of my smaller tests go much faster, but I believe that it is cache
based
>>> effects.  My larger test at 32GB gives pretty consistent results.
>>> 
>>> The other thing to consider:  are the separate files being striped
8 ways?
>>>  Because that would allow them to hit possibly all 64
OST''s, while the
>>> shared file case will only hit 8.
>>> 
>>> Evan
>>> 
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at lists.lustre.org
>>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
Felix, Evan
>>> J
>>> Sent: Friday, June 03, 2011 9:09 AM
>>> To: kmehta at cs.uh.edu
>>> Cc: Lustre discuss
>>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>> 
>>> What file sizes and segment sizes are you using for your tests?
>>> 
>>> Evan
>>> 
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at lists.lustre.org
>>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
>>> kmehta at cs.uh.edu
>>> Sent: Thursday, June 02, 2011 5:07 PM
>>> To: kmehta at cs.uh.edu
>>> Cc: kmehta at cs.uh.edu; Lustre discuss
>>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>> 
>>> Hello,
>>> I was wondering if anyone could replicate the performance of the
>>> multithreaded application using the C file that I posted in my
previous
>>> email.
>>> 
>>> Thanks,
>>> Kshitij
>>> 
>>> 
>>>> Ok I ran the following tests:
>>>> 
>>>> [1]
>>>> Application spawns 8 threads. I write to Lustre having 8 OSTs.
>>>> Each thread writes data in blocks of 1 Mbyte in a round robin
fashion,
>>>> i.e.
>>>> 
>>>> T0 writes to offsets 0, 8MB, 16MB, etc.
>>>> T1 writes to offsets 1MB, 9MB, 17MB, etc.
>>>> The stripe size being 1MByte, every thread ends up writing to
only 1
>>>> OST.
>>>> 
>>>> I see a bandwidth of 280 Mbytes/sec, similar to the single
thread
>>>> performance.
>>>> 
>>>> [2]
>>>> I also ran the same test such that every thread writes data in
blocks
>>>> of 8 Mbytes for the same stripe size. (Thus, every thread will
write
>>>> to every OST). I still get similar performance, ~280Mbytes/sec,
so
>>>> essentially I see no difference between each thread writing to
a
>>>> single OST vs each thread writing to all OSTs.
>>>> 
>>>> And as I said before, if all threads write to their own
separate file,
>>>> the resulting bandwidth is ~700Mbytes/sec.
>>>> 
>>>> I have attached my C file (simple_io_test.c) herewith. Maybe
you could
>>>> run it and see where the bottleneck is. Comments and
instructions for
>>>> compilation have been included in the file. Do let me know if
you need
>>>> any clarification on that.
>>>> 
>>>> Your help is appreciated,
>>>> Kshitij
>>>> 
>>>>> This is what my application does:
>>>>> 
>>>>> Each thread has its own file descriptor to the file.
>>>>> I use pwrite to ensure non-overlapping regions, as follows:
>>>>> 
>>>>> Thread 0, data_size: 1MB, offset: 0
>>>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size:
1MB,
>>>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>>>> 
>>>>> <repeat cycle>
>>>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This
happens in
>>>>> parallel, I dont wait for one cycle to end before the next
one
>>>>> begins).
>>>>> 
>>>>> I am gonna try the following:
>>>>> a)
>>>>> Instead of a round-robin distribution of offsets, test with
>>>>> sequential
>>>>> offsets:
>>>>> Thread 0, data_size: 1MB, offset:0
>>>>> Thread 0, data_size: 1MB, offset:1MB
>>>>> Thread 0, data_size: 1MB, offset:2MB
>>>>> Thread 0, data_size: 1MB, offset:3MB
>>>>> 
>>>>> Thread 1, data_size: 1MB, offset:4MB
>>>>> and so on. (I am gonna keep these separate pwrite I/O
requests
>>>>> instead of merging them or using writev)
>>>>> 
>>>>> b)
>>>>> Map the threads to the no. of OSTs using some modulo, as
suggested in
>>>>> the email below.
>>>>> 
>>>>> c)
>>>>> Experiment with fewer no. of OSTs (I currently have 48).
>>>>> 
>>>>> I shall report back with my findings.
>>>>> 
>>>>> Thanks,
>>>>> Kshitij
>>>>> 
>>>>>> [Moved to Lustre-discuss]
>>>>>> 
>>>>>> 
>>>>>> "However, if I spawn 8 threads such that all of
them write to the
>>>>>> same file (non-overlapping locations), without
explicitly
>>>>>> synchronizing the writes (i.e. I dont lock the file
handle)"
>>>>>> 
>>>>>> 
>>>>>> How exactly does your multi-threaded application write
the data?
>>>>>> Are you using pwrite to ensure non-overlapping regions
or are they
>>>>>> all just doing unlocked write() operations on the same
fd to each
>>>>>> write (each just transferring size/8)?  If it divides
the file into
>>>>>> N pieces, and each thread does pwrite on its piece,
then what each
>>>>>> OST sees are multiple streams at wide offsets to the
same object,
>>>>>> which could impact performance.
>>>>>> 
>>>>>> If on the other hand the file is written sequentially,
where each
>>>>>> thread grabs the next piece to be written (locking
normally used for
>>>>>> the current_offset value, so you know where each chunk
is actually
>>>>>> going), then you get a more sequential pattern at the
OST.
>>>>>> 
>>>>>> If the number of threads maps to the number of OSTs (or
some modulo,
>>>>>> like in your case 6 OSTs per thread), and each thread
"owns" the
>>>>>> piece of the file that belongs to an OST (ie: for
(offset >>>>>> thread_num * 6MB; offset<  size; offset +=
48MB) pwrite(fd, buf,
>>>>>> 6MB, offset); ), then you''ve eliminated the
need for application
>>>>>> locks (assuming the use of
>>>>>> pwrite) and ensured each OST object is being written
sequentially.
>>>>>> 
>>>>>> It''s quite possible there is some bottleneck
on the shared fd.  So
>>>>>> perhaps the question is not why you aren''t
scaling with more
>>>>>> threads, but why the single file is not able to
saturate the client,
>>>>>> or why the file BW is not scaling with more OSTs.  It
is somewhat
>>>>>> common for multiple processes (on different nodes) to
write
>>>>>> non-overlapping regions of the same file; does
performance improve
>>>>>> if each thread opens its own file descriptor?
>>>>>> 
>>>>>> Kevin
>>>>>> 
>>>>>> 
>>>>>> Wojciech Turek wrote:
>>>>>>> Ok so it looks like you have in total 64 OSTs and
your output file
>>>>>>> is striped across 48 of them. May I suggest that
you limit number
>>>>>>> of stripes, lets say a good number to start with
would be 8 stripes
>>>>>>> and also for best results use OST pools feature to
arrange that
>>>>>>> each stripe goes to OST owned by different OSS.
>>>>>>> 
>>>>>>> regards,
>>>>>>> 
>>>>>>> Wojciech
>>>>>>> 
>>>>>>> On 23 May 2011 23:09,<kmehta at
cs.uh.edu<mailto:kmehta at cs.uh.edu>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>     Actually, ''lfs check servers''
returns 64 entries as well, so I
>>>>>>>     presume the
>>>>>>>     system documentation is out of date.
>>>>>>> 
>>>>>>>     Again, I am sorry the basic information had
been incorrect.
>>>>>>> 
>>>>>>>     - Kshitij
>>>>>>> 
>>>>>>>> Run lfs getstripe<your_output_file>  and
paste the output of
>>>>>>>     that command
>>>>>>>> to
>>>>>>>> the mailing list.
>>>>>>>> Stripe count of 48 is not possible if you have
max 11 OSTs (the
>>>>>>>     max stripe
>>>>>>>> count will be 11)
>>>>>>>> If your striping is correct, the bottleneck can
be your client
>>>>>>>     network.
>>>>>>>> 
>>>>>>>> regards,
>>>>>>>> 
>>>>>>>> Wojciech
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 23 May 2011 22:35,<kmehta at cs.uh.edu
>>>>>>>     <mailto:kmehta at cs.uh.edu>>  wrote:
>>>>>>>> 
>>>>>>>>> The stripe count is 48.
>>>>>>>>> 
>>>>>>>>> Just fyi, this is what my application does:
>>>>>>>>> A simple I/O test where threads continually
write blocks of
>>>>>>> size
>>>>>>>>> 64Kbytes
>>>>>>>>> or 1Mbyte (decided at compile time) till a
large file of say,
>>>>>>>     16Gbytes
>>>>>>>>> is
>>>>>>>>> created.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Kshitij
>>>>>>>>> 
>>>>>>>>>> What is your stripe count on the file, 
if your default is
>>>>>>> 1,
>>>>>>>     you are
>>>>>>>>> only
>>>>>>>>>> writing to one of the OST''s. 
you can check with the lfs
>>>>>>>     getstripe
>>>>>>>>>> command, you can set the stripe bigger,
and hopefully your
>>>>>>>>> wide-stripped
>>>>>>>>>> file with threaded writes will be
faster.
>>>>>>>>>> 
>>>>>>>>>> Evan
>>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: lustre-community-bounces at
lists.lustre.org
>>>>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>
>>>>>>>>>> [mailto:lustre-community-bounces at
lists.lustre.org
>>>>>>>     <mailto:lustre-community-bounces at
lists.lustre.org>] On Behalf Of
>>>>>>>>>> kmehta at cs.uh.edu<mailto:kmehta at
cs.uh.edu>
>>>>>>>>>> Sent: Monday, May 23, 2011 2:28 PM
>>>>>>>>>> To: lustre-community at
lists.lustre.org
>>>>>>>     <mailto:lustre-community at
lists.lustre.org>
>>>>>>>>>> Subject: [Lustre-community] Poor
multithreaded I/O
>>>>>>> performance
>>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> I am running a multithreaded
application that writes to a
>>>>>>> common
>>>>>>>>> shared
>>>>>>>>>> file on lustre fs, and this is what I
see:
>>>>>>>>>> 
>>>>>>>>>> If I have a single thread in my
application, I get a
>>>>>>> bandwidth of
>>>>>>>>> approx.
>>>>>>>>>> 250 MBytes/sec. (11 OSTs, 1MByte stripe
size) However, if I
>>>>>>>     spawn 8
>>>>>>>>>> threads such that all of them write to
the same file
>>>>>>>     (non-overlapping
>>>>>>>>>> locations), without explicitly
synchronizing the writes
>>>>>>> (i.e.
>>>>>>>     I dont
>>>>>>>>> lock
>>>>>>>>>> the file handle), I still get the same
bandwidth.
>>>>>>>>>> 
>>>>>>>>>> Now, instead of writing to a shared
file, if these threads
>>>>>>>     write to
>>>>>>>>>> separate files, the bandwidth obtained
is approx. 700
>>>>>>> Mbytes/sec.
>>>>>>>>>> 
>>>>>>>>>> I would ideally like my multithreaded
application to see
>>>>>>> similar
>>>>>>>>> scaling.
>>>>>>>>>> Any ideas why the performance is
limited and any
>>>>>>> workarounds?
>>>>>>>>>> 
>>>>>>>>>> Thank you,
>>>>>>>>>> Kshitij
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>
_______________________________________________
>>>>>>>>>> Lustre-community mailing list
>>>>>>>>>> Lustre-community at lists.lustre.org
>>>>>>>     <mailto:Lustre-community at
lists.lustre.org>
>>>>>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>
_______________________________________________
>>>>>>>>> Lustre-community mailing list
>>>>>>>>> Lustre-community at lists.lustre.org
>>>>>>>     <mailto:Lustre-community at
lists.lustre.org>
>>>>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>> 
>>>>>>> 
>>>>>>>
-------------------------------------------------------------------
>>>>>>> -----
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Lustre-community mailing list
>>>>>>> Lustre-community at lists.lustre.org
>>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>> 
>>>>> 
>>> 
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Lustre discuss - May 2011 - [Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-community] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance

[Lustre-discuss] Poor multithreaded I/O performance