Hi, The I/O performance of CNL (as measured with IOR) seems quite different for a shared file, compared to the same with separated files. Here are some numbers on a smaller file system on XT system at ORNL. All files are striped to 72OSTs. I deliberately use a block size 8512m. 1. sample tests with separate files # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -F -o iortes Max Write: 9978.18 MiB/sec (10462.88 MB/sec) Max Read: 5612.78 MiB/sec (5885.43 MB/sec) 2. sample share file performance # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -o iortes Max Write: 6817.31 MiB/sec (7148.47 MB/sec) Max Read: 5591.98 MiB/sec (5863.62 MB/sec) In addition, using my experimental MPI-IO library, I noticed that enabling direct I/O can have various effects for I/O on CNL. 3. sample seprate files with direct I/O export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n 32 -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -F -k -o lustre:iortest Max Write: 9353.66 MiB/sec (9808.03 MB/sec) Max Read: 8269.28 MiB/sec (8670.97 MB/sec) 4. sample share file performance with direct IO # export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n 32 -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -k -o lustre:iortes Max Write: 9484.11 MiB/sec (9944.81 MB/sec) Max Read: 7929.63 MiB/sec (8314.81 MB/sec) It seems direct I/O helps quite a bit on the performance of parallel reads, but not on writes. The shared file mode appears to benefit more from direct write. While it is understandable that the client cache can play a big role here, I am not sure how it could help the share-file mode much better. Anybody can help with some explanations on the comparison between reads and writes and the same for shared-file and separated-files? Also let me know if I am not clear in my descriptions. -- Weikuan Yu <+> 1-865-574-7990 http://ft.ornl.gov/~wyu/ P.S.: What shown are the good numbers from several runs. So you may consider them as consistent results.
My, perhaps, misunderstanding was that a Lustre FS had a maximum lfs stripe-count of 160. Is this not a constant set in the LFS, but just some local configuration? Could you be more specific about the actual lfs stripe-count of the file or files you wrote? MLB Weikuan Yu wrote:> Hi, > > The I/O performance of CNL (as measured with IOR) seems quite different > for a shared file, compared to the same with separated files. > > Here are some numbers on a smaller file system on XT system at ORNL. All > files are striped to 72OSTs. I deliberately use a block size 8512m. > > 1. sample tests with separate files > # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t > 64m -d 1 -i 2 -w -r -g -F -o iortes > Max Write: 9978.18 MiB/sec (10462.88 MB/sec) > Max Read: 5612.78 MiB/sec (5885.43 MB/sec) > > 2. sample share file performance > # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t > 64m -d 1 -i 2 -w -r -g -o iortes > Max Write: 6817.31 MiB/sec (7148.47 MB/sec) > Max Read: 5591.98 MiB/sec (5863.62 MB/sec) > > In addition, using my experimental MPI-IO library, I noticed that > enabling direct I/O can have various effects for I/O on CNL. > > 3. sample seprate files with direct I/O > export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n 32 > -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i > 2 -w -r -g -F -k -o lustre:iortest > Max Write: 9353.66 MiB/sec (9808.03 MB/sec) > Max Read: 8269.28 MiB/sec (8670.97 MB/sec) > > 4. sample share file performance with direct IO > # export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n > 32 -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 > -i 2 -w -r -g -k -o lustre:iortes > Max Write: 9484.11 MiB/sec (9944.81 MB/sec) > Max Read: 7929.63 MiB/sec (8314.81 MB/sec) > > It seems direct I/O helps quite a bit on the performance of parallel > reads, but not on writes. The shared file mode appears to benefit more > from direct write. > > While it is understandable that the client cache can play a big role > here, I am not sure how it could help the share-file mode much better. > Anybody can help with some explanations on the comparison between reads > and writes and the same for shared-file and separated-files? > > Also let me know if I am not clear in my descriptions. > > -- > Weikuan Yu <+> 1-865-574-7990 > http://ft.ornl.gov/~wyu/ > > P.S.: > What shown are the good numbers from several runs. So you may consider > them as consistent results. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Marty Barnaby wrote:> My, perhaps, misunderstanding was that a Lustre FS had a maximum lfs > stripe-count of 160. Is this not a constant set in the LFS, but just > some local configuration? Could you be more specific about the actual > lfs stripe-count of the file or files you wrote?You''re right on the maximal stripe-count, and 72 being local for my choice of the testing. The stripe count can have, but probably a little, for the relative comparisons between w/ or w/o direct I/O. --Weikuan
Hi, Weikuan Yu wrote:> Hi, > > The I/O performance of CNL (as measured with IOR) seems quite different > for a shared file, compared to the same with separated files. > > Here are some numbers on a smaller file system on XT system at ORNL. All > files are striped to 72OSTs. I deliberately use a block size 8512m. > > 1. sample tests with separate files > # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t > 64m -d 1 -i 2 -w -r -g -F -o iortes > Max Write: 9978.18 MiB/sec (10462.88 MB/sec) > Max Read: 5612.78 MiB/sec (5885.43 MB/sec) > > 2. sample share file performance > # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t > 64m -d 1 -i 2 -w -r -g -o iortes > Max Write: 6817.31 MiB/sec (7148.47 MB/sec) > Max Read: 5591.98 MiB/sec (5863.62 MB/sec) > > In addition, using my experimental MPI-IO library, I noticed that > enabling direct I/O can have various effects for I/O on CNL. >What is the stripe_size of this test? 4M? If it is 4M, then transfer_size is a little bigger(64M). And we have seen this situation before, finally it seems because client hold too much lock in each write(because of lustre down-forward extent lock policy) which might block other client writing, so impact the parallel of the whole system. Maybe you could try decrease transfer size to stripe_size. Or increase stripe_size to 64M and see how is it? Thanks WangDi> 3. sample seprate files with direct I/O > export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n 32 > -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i > 2 -w -r -g -F -k -o lustre:iortest > Max Write: 9353.66 MiB/sec (9808.03 MB/sec) > Max Read: 8269.28 MiB/sec (8670.97 MB/sec) > > 4. sample share file performance with direct IO > # export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n > 32 -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 > -i 2 -w -r -g -k -o lustre:iortes > Max Write: 9484.11 MiB/sec (9944.81 MB/sec) > Max Read: 7929.63 MiB/sec (8314.81 MB/sec) > > It seems direct I/O helps quite a bit on the performance of parallel > reads, but not on writes. The shared file mode appears to benefit more > from direct write. > > While it is understandable that the client cache can play a big role > here, I am not sure how it could help the share-file mode much better. > Anybody can help with some explanations on the comparison between reads > and writes and the same for shared-file and separated-files? > > Also let me know if I am not clear in my descriptions. > >
> What is the stripe_size of this test? 4M? If it is 4M, then > transfer_size is a little > bigger(64M). And we have seen this situation before, finally it seems > because client hold > too much lock in each write(because of lustre down-forward extent lock > policy) which might > block other client writing, so impact the parallel of the whole system. > Maybe you could try > decrease transfer size to stripe_size. Or increase stripe_size to 64M > and see how is it?Yes, the situation between shared file and separated files has been seen before. But I have never seen an explanation regarding CNL. BTW, this performance difference between shared/separated stays the same, regardless what transfer size is. Anybody wants to post a reason regarding direct I/O too? --Weikuan
I had tried the Direct I/O last year and it didn''t seem to be working at the time, so I gave up and haven''t been back there again. For the file-per-processor vs. shared, I made many different benchmark trials, but never really head-to-head. My efforts were all with our redstorm:/scratch_grande: /home/mlbarna> lfs getstripe -v /scratch_grande | grep ACTIVE | wc -l 320 /home/mlbarna> lfs getstripe -v /scratch_grande | grep -v ACTIVE OBDS: /scratch_grande/ default stripe_count: 4 stripe_size: 2097152 stripe_offset: -1 /scratch_grande/test.sh lmm_magic: 0x0BD10BD0 lmm_object_gr: 0 lmm_object_id: 0x4e92503 lmm_stripe_count: 4 lmm_stripe_size: 2097152 lmm_stripe_pattern: 1 obdidx objid objid group 281 2777792 0x2a62c0 0 282 2780317 0x2a6c9d 0 283 2778125 0x2a640d 0 284 2778316 0x2a64cc 0 My one-file-per-processor mode was executed with a NetCDF benchmark code someone had put together. I can''t remember final numbers, or processor count, but, at the time, we were interested in actual, scientific computing usage patterns, so we had only an 80-400 KB range in blocksizes, per processor, respectively, which will never demonstrate a maximal byte-rate with a huge Lustre FS. The one point here I do know is the performance was always highest when the directory the files were written into was lfs setstripe with the values 0 -1 1. I found no improvement in adjusting the stripe_size from the default 2 MB, but, for large processor count runs, a stripe_count of 1 was patently fastest. My maximal MPI-IO collective writing to a shared file benchmarking, again with a simple, unique program, wrote into a directory defined with the lfs setstripe settings 0 -1 160. I found my appex 26 GB/s running on only 160 processors with a per-processor, respective blocksize of 20 MB. To clarify my use of blocksize, the NetCDF trials are something like running IOR with ''-b 100m -t 80k''; and for the MPI-IO collective, I''d have ''-b100m -t 20m''. Limiting -b option is not important, one would want it to be as large as the available memory would allow. Both the benchmarking codes I employed differed somewhat from the approach in IOR. They each simply malloced a single buffer of the specified blocksize, and, after the file or files openings, iterated on a barried loop, appending the same buffer for ''n'' many rotations. Usually, the timer is stopped as soon as the loop is exited, before the file closings. I recently completed some modifications for my own IOR, to execute more like this. I moved the loop for repetitions inside the file open and close, and adjusted the offset to be continuous, so every blocksize of transfers appends to the end of the still open file; then sum up the product of the blocksize and the repetitions for the total written to the file. I have this basically working for Posix single-shared-file, and also PNetCDF. MLB Weikuan Yu wrote:>> What is the stripe_size of this test? 4M? If it is 4M, then >> transfer_size is a little >> bigger(64M). And we have seen this situation before, finally it seems >> because client hold >> too much lock in each write(because of lustre down-forward extent lock >> policy) which might >> block other client writing, so impact the parallel of the whole system. >> Maybe you could try >> decrease transfer size to stripe_size. Or increase stripe_size to 64M >> and see how is it? >> > > Yes, the situation between shared file and separated files has been seen > before. But I have never seen an explanation regarding CNL. BTW, this > performance difference between shared/separated stays the same, > regardless what transfer size is. > > Anybody wants to post a reason regarding direct I/O too? > > --Weikuan > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080306/f83ead2c/attachment-0002.html
Thanks for the information. There are choices about the stripe count, depends on the targeted pattern. Is redstorm running under CNL or Catamount? --Weikuan Marty Barnaby wrote:> I had tried the Direct I/O last year and it didn''t seem to be working at > the time, so I gave up and haven''t been back there again. > > For the file-per-processor vs. shared, I made many different benchmark > trials, but never really head-to-head. My efforts were all with our > redstorm:/scratch_grande: > > /home/mlbarna> lfs getstripe -v /scratch_grande | grep ACTIVE | wc -l > 320 > /home/mlbarna> lfs getstripe -v /scratch_grande | grep -v ACTIVE > OBDS: > /scratch_grande/ > default stripe_count: 4 stripe_size: 2097152 stripe_offset: -1 > /scratch_grande/test.sh > lmm_magic: 0x0BD10BD0 > lmm_object_gr: 0 > lmm_object_id: 0x4e92503 > lmm_stripe_count: 4 > lmm_stripe_size: 2097152 > lmm_stripe_pattern: 1 > obdidx objid objid group > 281 2777792 0x2a62c0 0 > 282 2780317 0x2a6c9d 0 > 283 2778125 0x2a640d 0 > 284 2778316 0x2a64cc 0 > > My one-file-per-processor mode was executed with a NetCDF benchmark code > someone had put together. I can''t remember final numbers, or processor > count, but, at the time, we were interested in actual, scientific > computing usage patterns, so we had only an 80-400 KB range in > blocksizes, per processor, respectively, which will never demonstrate a > maximal byte-rate with a huge Lustre FS. The one point here I do know is > the performance was always highest when the directory the files were > written into was lfs setstripe with the values 0 -1 1. I found no > improvement in adjusting the stripe_size from the default 2 MB, but, for > large processor count runs, a stripe_count of 1 was patently fastest. > > My maximal MPI-IO collective writing to a shared file benchmarking, > again with a simple, unique program, wrote into a directory defined with > the lfs setstripe settings 0 -1 160. I found my appex 26 GB/s running on > only 160 processors with a per-processor, respective blocksize of 20 MB. > > To clarify my use of blocksize, the NetCDF trials are something like > running IOR with ''-b 100m -t 80k''; and for the MPI-IO collective, I''d > have ''-b100m -t 20m''. Limiting -b option is not important, one would > want it to be as large as the available memory would allow. > > Both the benchmarking codes I employed differed somewhat from the > approach in IOR. They each simply malloced a single buffer of the > specified blocksize, and, after the file or files openings, iterated on > a barried loop, appending the same buffer for ''n'' many rotations. > Usually, the timer is stopped as soon as the loop is exited, before the > file closings. > > I recently completed some modifications for my own IOR, to execute more > like this. I moved the loop for repetitions inside the file open and > close, and adjusted the offset to be continuous, so every blocksize of > transfers appends to the end of the still open file; then sum up the > product of the blocksize and the repetitions for the total written to > the file. I have this basically working for Posix single-shared-file, > and also PNetCDF. > > MLB > > > > Weikuan Yu wrote: >>> What is the stripe_size of this test? 4M? If it is 4M, then >>> transfer_size is a little >>> bigger(64M). And we have seen this situation before, finally it seems >>> because client hold >>> too much lock in each write(because of lustre down-forward extent lock >>> policy) which might >>> block other client writing, so impact the parallel of the whole system. >>> Maybe you could try >>> decrease transfer size to stripe_size. Or increase stripe_size to 64M >>> and see how is it? >>> >> >> Yes, the situation between shared file and separated files has been seen >> before. But I have never seen an explanation regarding CNL. BTW, this >> performance difference between shared/separated stays the same, >> regardless what transfer size is. >> >> Anybody wants to post a reason regarding direct I/O too? >> >> --Weikuan >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Weikuan Yu <+> 1-865-574-7990 http://ft.ornl.gov/~wyu/
The compute nodes for Redstorm are Catamount. Here at SNL we''ve traditionally had a preoccupation with light weight kernels. I don''t have anything to do with these decisions or the discussion in general, but this may finally give way now that we can get gigabytes of memory for each processor. MLB Weikuan Yu wrote:> Thanks for the information. There are choices about the stripe count, > depends on the targeted pattern. > > Is redstorm running under CNL or Catamount? > > --Weikuan > > Marty Barnaby wrote: > >> I had tried the Direct I/O last year and it didn''t seem to be working at >> the time, so I gave up and haven''t been back there again. >> >> For the file-per-processor vs. shared, I made many different benchmark >> trials, but never really head-to-head. My efforts were all with our >> redstorm:/scratch_grande: >> >> /home/mlbarna> lfs getstripe -v /scratch_grande | grep ACTIVE | wc -l >> 320 >> /home/mlbarna> lfs getstripe -v /scratch_grande | grep -v ACTIVE >> OBDS: >> /scratch_grande/ >> default stripe_count: 4 stripe_size: 2097152 stripe_offset: -1 >> /scratch_grande/test.sh >> lmm_magic: 0x0BD10BD0 >> lmm_object_gr: 0 >> lmm_object_id: 0x4e92503 >> lmm_stripe_count: 4 >> lmm_stripe_size: 2097152 >> lmm_stripe_pattern: 1 >> obdidx objid objid group >> 281 2777792 0x2a62c0 0 >> 282 2780317 0x2a6c9d 0 >> 283 2778125 0x2a640d 0 >> 284 2778316 0x2a64cc 0 >> >> My one-file-per-processor mode was executed with a NetCDF benchmark code >> someone had put together. I can''t remember final numbers, or processor >> count, but, at the time, we were interested in actual, scientific >> computing usage patterns, so we had only an 80-400 KB range in >> blocksizes, per processor, respectively, which will never demonstrate a >> maximal byte-rate with a huge Lustre FS. The one point here I do know is >> the performance was always highest when the directory the files were >> written into was lfs setstripe with the values 0 -1 1. I found no >> improvement in adjusting the stripe_size from the default 2 MB, but, for >> large processor count runs, a stripe_count of 1 was patently fastest. >> >> My maximal MPI-IO collective writing to a shared file benchmarking, >> again with a simple, unique program, wrote into a directory defined with >> the lfs setstripe settings 0 -1 160. I found my appex 26 GB/s running on >> only 160 processors with a per-processor, respective blocksize of 20 MB. >> >> To clarify my use of blocksize, the NetCDF trials are something like >> running IOR with ''-b 100m -t 80k''; and for the MPI-IO collective, I''d >> have ''-b100m -t 20m''. Limiting -b option is not important, one would >> want it to be as large as the available memory would allow. >> >> Both the benchmarking codes I employed differed somewhat from the >> approach in IOR. They each simply malloced a single buffer of the >> specified blocksize, and, after the file or files openings, iterated on >> a barried loop, appending the same buffer for ''n'' many rotations. >> Usually, the timer is stopped as soon as the loop is exited, before the >> file closings. >> >> I recently completed some modifications for my own IOR, to execute more >> like this. I moved the loop for repetitions inside the file open and >> close, and adjusted the offset to be continuous, so every blocksize of >> transfers appends to the end of the still open file; then sum up the >> product of the blocksize and the repetitions for the total written to >> the file. I have this basically working for Posix single-shared-file, >> and also PNetCDF. >> >> MLB >> >> >> >> Weikuan Yu wrote: >> >>>> What is the stripe_size of this test? 4M? If it is 4M, then >>>> transfer_size is a little >>>> bigger(64M). And we have seen this situation before, finally it seems >>>> because client hold >>>> too much lock in each write(because of lustre down-forward extent lock >>>> policy) which might >>>> block other client writing, so impact the parallel of the whole system. >>>> Maybe you could try >>>> decrease transfer size to stripe_size. Or increase stripe_size to 64M >>>> and see how is it? >>>> >>>> >>> Yes, the situation between shared file and separated files has been seen >>> before. But I have never seen an explanation regarding CNL. BTW, this >>> performance difference between shared/separated stays the same, >>> regardless what transfer size is. >>> >>> Anybody wants to post a reason regarding direct I/O too? >>> >>> --Weikuan >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > -- > Weikuan Yu <+> 1-865-574-7990 > http://ft.ornl.gov/~wyu/ > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080306/2c83d37e/attachment-0002.html