Hello, as a project for college I''m doing a behavioral comparison between Lustre and CXFS when dealing with simple strided files using POSIX semantics. On one of the tests, each participating process reads 16 chunks of data with a size of 32MB each, from a common, strided file using the following code: ------------------------------------------------------------------------------------------ int myfile = open("thefile", O_RDONLY); MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help measuring time off_t distance = (numtasks-1)*p.buffersize; off_t offset = rank*p.buffersize; int j; lseek(myfile, offset, SEEK_SET); for (j = 0; j < p.buffercount; j++) { read(myfile, buffers[j], p.buffersize); // buffers are aligned to the page size lseek(myfile, distance, SEEK_CUR); } MPI_Barrier(MPI_COMM_WORLD); close(myfile); ------------------------------------------------------------------------------------------ I''m facing the following problem: when this code is run in parallel the read operations on certain processes start to need more and more time to complete. I attached a graphical trace of this, when using only 2 processes. As you see, the read operations on process 0 stay more or less constant, taking about 0.12 seconds to complete, while on process 1 they increase up to 39 seconds! If I run the program with only one process, then the time stays at ~0.12 seconds per read operation. The problem doesn''t appear if the O_DIRECT flag is used. Can somebody explain to me why is this happening? Since I''m very new to Lustre, I may be making some silly mistakes, so be nice to me ;) I''m using Lustre SLES 10 Patchlevel 1, Kernel 2.6.16.54-0.2.5_lustre.1.6.5.1. Thanks! Alvaro Aguilera. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090820/0a94ed4e/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre.png Type: image/png Size: 7495 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090820/0a94ed4e/attachment-0001.png
On Thu, 2009-08-20 at 23:52 +0200, Alvaro Aguilera wrote:> I''m facing the following problem: when this code is run in parallel > the read operations on certain processes start to need more and more > time to complete. I attached a graphical trace of this, when using > only 2 processes.Just a (perhaps silly) question, but does the striping of the file (or the directory the file is being created in) on the filesystem match your I/O patterns? That is, ideally, each thread/rank/process (whatever you want to call them) should be doing I/O in it''s own stripe. $ man lfs if none of this is meaningful. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090820/75ae8df0/attachment.bin
Thanks for pointing that out. I was using the default striping, which in my case is 1mb stripes, on one OST. However, If I change the stripe size to 32mb (the size of the buffers being written/read), the function used to write the file using O_DIRECT stops working. Its code is very similar to the one posted above and the problem is that the write()-function stucks while writing the first buffer. Is there any trick for using O_DIRECT in Lustre? I''ve aligned the buffer using posix_memalign(), and every offset and count seem to be a multiple of the page-size (4kb). On Fri, Aug 21, 2009 at 12:04 AM, Brian J. Murrell <Brian.Murrell at sun.com>wrote:> On Thu, 2009-08-20 at 23:52 +0200, Alvaro Aguilera wrote: > > I''m facing the following problem: when this code is run in parallel > > the read operations on certain processes start to need more and more > > time to complete. I attached a graphical trace of this, when using > > only 2 processes. > > Just a (perhaps silly) question, but does the striping of the file (or > the directory the file is being created in) on the filesystem match your > I/O patterns? That is, ideally, each thread/rank/process (whatever you > want to call them) should be doing I/O in it''s own stripe. > > $ man lfs > > if none of this is meaningful. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090821/07fbef6d/attachment.html
Hello! Any chance you can use more modern release like 1.8.1? There was a number of bugs fixed including some readahead-logic fixes that could impede read performance. Bye, Oleg On Aug 20, 2009, at 10:38 PM, Alvaro Aguilera wrote:> Thanks for pointing that out. I was using the default striping, > which in my case is 1mb stripes, on one OST. > > However, If I change the stripe size to 32mb (the size of the > buffers being written/read), the function used to write the file > using O_DIRECT stops working. Its code is very similar to the one > posted above and the problem is that the write()-function stucks > while writing the first buffer. Is there any trick for using > O_DIRECT in Lustre? I''ve aligned the buffer using posix_memalign(), > and every offset and count seem to be a multiple of the page-size > (4kb). > > > > > > On Fri, Aug 21, 2009 at 12:04 AM, Brian J. Murrell <Brian.Murrell at sun.com > > wrote: > On Thu, 2009-08-20 at 23:52 +0200, Alvaro Aguilera wrote: > > I''m facing the following problem: when this code is run in parallel > > the read operations on certain processes start to need more and more > > time to complete. I attached a graphical trace of this, when using > > only 2 processes. > > Just a (perhaps silly) question, but does the striping of the file (or > the directory the file is being created in) on the filesystem match > your > I/O patterns? That is, ideally, each thread/rank/process (whatever > you > want to call them) should be doing I/O in it''s own stripe. > > $ man lfs > > if none of this is meaningful. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello, You may see bug 17197 and try to apply this patch https://bugzilla.lustre.org/attachment.cgi?id=25062 to your lustre src. Or you can wait 1.8.2. Thanks Wangdi Alvaro Aguilera wrote:> Hello, > > as a project for college I''m doing a behavioral comparison between > Lustre and CXFS when dealing with simple strided files using POSIX > semantics. On one of the tests, each participating process reads 16 > chunks of data with a size of 32MB each, from a common, strided file > using the following code: > > ------------------------------------------------------------------------------------------ > int myfile = open("thefile", O_RDONLY); > > MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help > measuring time > > off_t distance = (numtasks-1)*p.buffersize; > off_t offset = rank*p.buffersize; > > int j; > lseek(myfile, offset, SEEK_SET); > for (j = 0; j < p.buffercount; j++) { > read(myfile, buffers[j], p.buffersize); // buffers are aligned > to the page size > lseek(myfile, distance, SEEK_CUR); > } > > MPI_Barrier(MPI_COMM_WORLD); > > close(myfile); > ------------------------------------------------------------------------------------------ > > I''m facing the following problem: when this code is run in parallel > the read operations on certain processes start to need more and more > time to complete. I attached a graphical trace of this, when using > only 2 processes. > As you see, the read operations on process 0 stay more or less > constant, taking about 0.12 seconds to complete, while on process 1 > they increase up to 39 seconds! > > If I run the program with only one process, then the time stays at > ~0.12 seconds per read operation. The problem doesn''t appear if the > O_DIRECT flag is used. > > Can somebody explain to me why is this happening? Since I''m very new > to Lustre, I may be making some silly mistakes, so be nice to me ;) > > I''m using Lustre SLES 10 Patchlevel 1, Kernel > 2.6.16.54-0.2.5_lustre.1.6.5.1. > > > Thanks! > > Alvaro Aguilera. > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
no, for the time being I''m stuck with this version... Regards, Alvaro. On Fri, Aug 21, 2009 at 4:57 AM, Oleg Drokin <Oleg.Drokin at sun.com> wrote:> Hello! > > Any chance you can use more modern release like 1.8.1? There was a number > of bugs fixed including some readahead-logic fixes that could impede read > performance. > > Bye, > Oleg > > On Aug 20, 2009, at 10:38 PM, Alvaro Aguilera wrote: > > Thanks for pointing that out. I was using the default striping, which in >> my case is 1mb stripes, on one OST. >> >> However, If I change the stripe size to 32mb (the size of the buffers >> being written/read), the function used to write the file using O_DIRECT >> stops working. Its code is very similar to the one posted above and the >> problem is that the write()-function stucks while writing the first buffer. >> Is there any trick for using O_DIRECT in Lustre? I''ve aligned the buffer >> using posix_memalign(), and every offset and count seem to be a multiple of >> the page-size (4kb). >> >> >> >> >> >> On Fri, Aug 21, 2009 at 12:04 AM, Brian J. Murrell <Brian.Murrell at sun.com> >> wrote: >> On Thu, 2009-08-20 at 23:52 +0200, Alvaro Aguilera wrote: >> > I''m facing the following problem: when this code is run in parallel >> > the read operations on certain processes start to need more and more >> > time to complete. I attached a graphical trace of this, when using >> > only 2 processes. >> >> Just a (perhaps silly) question, but does the striping of the file (or >> the directory the file is being created in) on the filesystem match your >> I/O patterns? That is, ideally, each thread/rank/process (whatever you >> want to call them) should be doing I/O in it''s own stripe. >> >> $ man lfs >> >> if none of this is meaningful. >> >> b. >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090821/d7f686e0/attachment-0001.html
thanks for the hint, but unfortunately I can''t make any updates to the cluster... Do you think both of the problems I experienced are bugs in Lustre and are resolved in current versions? Thanks. Alvaro. On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com> wrote:> Hello, > > You may see bug 17197 and try to apply this patch > https://bugzilla.lustre.org/attachment.cgi?id=25062 to your lustre src. > Or you can wait 1.8.2. > > Thanks > Wangdi > > Alvaro Aguilera wrote: > >> Hello, >> >> as a project for college I''m doing a behavioral comparison between Lustre >> and CXFS when dealing with simple strided files using POSIX semantics. On >> one of the tests, each participating process reads 16 chunks of data with a >> size of 32MB each, from a common, strided file using the following code: >> >> >> ------------------------------------------------------------------------------------------ >> int myfile = open("thefile", O_RDONLY); >> >> MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help measuring >> time >> >> off_t distance = (numtasks-1)*p.buffersize; >> off_t offset = rank*p.buffersize; >> >> int j; >> lseek(myfile, offset, SEEK_SET); >> for (j = 0; j < p.buffercount; j++) { >> read(myfile, buffers[j], p.buffersize); // buffers are aligned to >> the page size >> lseek(myfile, distance, SEEK_CUR); >> } >> >> MPI_Barrier(MPI_COMM_WORLD); >> >> close(myfile); >> >> ------------------------------------------------------------------------------------------ >> >> I''m facing the following problem: when this code is run in parallel the >> read operations on certain processes start to need more and more time to >> complete. I attached a graphical trace of this, when using only 2 processes. >> As you see, the read operations on process 0 stay more or less constant, >> taking about 0.12 seconds to complete, while on process 1 they increase up >> to 39 seconds! >> >> If I run the program with only one process, then the time stays at ~0.12 >> seconds per read operation. The problem doesn''t appear if the O_DIRECT flag >> is used. >> >> Can somebody explain to me why is this happening? Since I''m very new to >> Lustre, I may be making some silly mistakes, so be nice to me ;) >> >> I''m using Lustre SLES 10 Patchlevel 1, Kernel >> 2.6.16.54-0.2.5_lustre.1.6.5.1. >> >> >> Thanks! >> >> Alvaro Aguilera. >> >> >> ------------------------------------------------------------------------ >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090821/ae29e256/attachment.html
Alvaro Aguilera wrote:> thanks for the hint, but unfortunately I can''t make any updates to the > cluster... > > Do you think both of the problems I experienced are bugs in Lustre and > are resolved in current versions?It should be lustre bugs. The 2 processes runs on different node or same node? Thanks WangDi> > Thanks. > Alvaro. > > On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com > <mailto:di.wang at sun.com>> wrote: > > Hello, > > You may see bug 17197 and try to apply this patch > https://bugzilla.lustre.org/attachment.cgi?id=25062 to your > lustre src. Or you can wait 1.8.2. > > Thanks > Wangdi > > Alvaro Aguilera wrote: > > Hello, > > as a project for college I''m doing a behavioral comparison > between Lustre and CXFS when dealing with simple strided files > using POSIX semantics. On one of the tests, each participating > process reads 16 chunks of data with a size of 32MB each, from > a common, strided file using the following code: > > ------------------------------------------------------------------------------------------ > int myfile = open("thefile", O_RDONLY); > > MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help > measuring time > > off_t distance = (numtasks-1)*p.buffersize; > off_t offset = rank*p.buffersize; > > int j; > lseek(myfile, offset, SEEK_SET); > for (j = 0; j < p.buffercount; j++) { > read(myfile, buffers[j], p.buffersize); // buffers are > aligned to the page size > lseek(myfile, distance, SEEK_CUR); > } > > MPI_Barrier(MPI_COMM_WORLD); > > close(myfile); > ------------------------------------------------------------------------------------------ > > I''m facing the following problem: when this code is run in > parallel the read operations on certain processes start to > need more and more time to complete. I attached a graphical > trace of this, when using only 2 processes. > As you see, the read operations on process 0 stay more or less > constant, taking about 0.12 seconds to complete, while on > process 1 they increase up to 39 seconds! > > If I run the program with only one process, then the time > stays at ~0.12 seconds per read operation. The problem doesn''t > appear if the O_DIRECT flag is used. > > Can somebody explain to me why is this happening? Since I''m > very new to Lustre, I may be making some silly mistakes, so be > nice to me ;) > > I''m using Lustre SLES 10 Patchlevel 1, Kernel > 2.6.16.54-0.2.5_lustre.1.6.5.1. > > > Thanks! > > Alvaro Aguilera. > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > >
they run on different physical nodes and access the ost via 4x infiniband. On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com> wrote:> Alvaro Aguilera wrote: > >> thanks for the hint, but unfortunately I can''t make any updates to the >> cluster... >> >> Do you think both of the problems I experienced are bugs in Lustre and are >> resolved in current versions? >> > It should be lustre bugs. The 2 processes runs on different node or same > node? > > Thanks > WangDi > >> >> Thanks. >> Alvaro. >> >> >> On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com <mailto: >> di.wang at sun.com>> wrote: >> >> Hello, >> >> You may see bug 17197 and try to apply this patch >> https://bugzilla.lustre.org/attachment.cgi?id=25062 to your >> lustre src. Or you can wait 1.8.2. >> >> Thanks >> Wangdi >> >> Alvaro Aguilera wrote: >> >> Hello, >> >> as a project for college I''m doing a behavioral comparison >> between Lustre and CXFS when dealing with simple strided files >> using POSIX semantics. On one of the tests, each participating >> process reads 16 chunks of data with a size of 32MB each, from >> a common, strided file using the following code: >> >> >> ------------------------------------------------------------------------------------------ >> int myfile = open("thefile", O_RDONLY); >> >> MPI_Barrier(MPI_COMM_WORLD); // the barriers are only to help >> measuring time >> >> off_t distance = (numtasks-1)*p.buffersize; >> off_t offset = rank*p.buffersize; >> >> int j; >> lseek(myfile, offset, SEEK_SET); >> for (j = 0; j < p.buffercount; j++) { >> read(myfile, buffers[j], p.buffersize); // buffers are >> aligned to the page size >> lseek(myfile, distance, SEEK_CUR); >> } >> >> MPI_Barrier(MPI_COMM_WORLD); >> >> close(myfile); >> >> ------------------------------------------------------------------------------------------ >> >> I''m facing the following problem: when this code is run in >> parallel the read operations on certain processes start to >> need more and more time to complete. I attached a graphical >> trace of this, when using only 2 processes. >> As you see, the read operations on process 0 stay more or less >> constant, taking about 0.12 seconds to complete, while on >> process 1 they increase up to 39 seconds! >> >> If I run the program with only one process, then the time >> stays at ~0.12 seconds per read operation. The problem doesn''t >> appear if the O_DIRECT flag is used. >> >> Can somebody explain to me why is this happening? Since I''m >> very new to Lustre, I may be making some silly mistakes, so be >> nice to me ;) >> >> I''m using Lustre SLES 10 Patchlevel 1, Kernel >> 2.6.16.54-0.2.5_lustre.1.6.5.1. >> >> >> Thanks! >> >> Alvaro Aguilera. >> >> >> >> ------------------------------------------------------------------------ >> >> >> ------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> <mailto:Lustre-discuss at lists.lustre.org> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090821/895d7abf/attachment-0001.html
hello, Alvaro Aguilera wrote:> they run on different physical nodes and access the ost via 4x infiniband. >I never heard such problems, if they on different nodes. Client memory? Can you post read-ahead stats (before and after the test) here by lctl get_param llite.*.read_ahead_stats But there are indeed a lot fixes about stride read since 1.6.5, which is included in the tar ball I posted below. And it probably can fix your problem. Thanks WangDi> On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com > <mailto:di.wang at sun.com>> wrote: > > Alvaro Aguilera wrote: > > thanks for the hint, but unfortunately I can''t make any > updates to the cluster... > > Do you think both of the problems I experienced are bugs in > Lustre and are resolved in current versions? > > It should be lustre bugs. The 2 processes runs on different node > or same node? > > Thanks > WangDi > > > Thanks. > Alvaro. > > > On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com > <mailto:di.wang at sun.com> <mailto:di.wang at sun.com > <mailto:di.wang at sun.com>>> wrote: > > Hello, > > You may see bug 17197 and try to apply this patch > https://bugzilla.lustre.org/attachment.cgi?id=25062 to your > lustre src. Or you can wait 1.8.2. > > Thanks > Wangdi > > Alvaro Aguilera wrote: > > Hello, > > as a project for college I''m doing a behavioral comparison > between Lustre and CXFS when dealing with simple > strided files > using POSIX semantics. On one of the tests, each > participating > process reads 16 chunks of data with a size of 32MB > each, from > a common, strided file using the following code: > > > ------------------------------------------------------------------------------------------ > int myfile = open("thefile", O_RDONLY); > > MPI_Barrier(MPI_COMM_WORLD); // the barriers are only > to help > measuring time > > off_t distance = (numtasks-1)*p.buffersize; > off_t offset = rank*p.buffersize; > > int j; > lseek(myfile, offset, SEEK_SET); > for (j = 0; j < p.buffercount; j++) { > read(myfile, buffers[j], p.buffersize); // > buffers are > aligned to the page size > lseek(myfile, distance, SEEK_CUR); > } > > MPI_Barrier(MPI_COMM_WORLD); > > close(myfile); > > ------------------------------------------------------------------------------------------ > > I''m facing the following problem: when this code is run in > parallel the read operations on certain processes start to > need more and more time to complete. I attached a graphical > trace of this, when using only 2 processes. > As you see, the read operations on process 0 stay more > or less > constant, taking about 0.12 seconds to complete, while on > process 1 they increase up to 39 seconds! > > If I run the program with only one process, then the time > stays at ~0.12 seconds per read operation. The problem > doesn''t > appear if the O_DIRECT flag is used. > > Can somebody explain to me why is this happening? Since I''m > very new to Lustre, I may be making some silly > mistakes, so be > nice to me ;) > > I''m using Lustre SLES 10 Patchlevel 1, Kernel > 2.6.16.54-0.2.5_lustre.1.6.5.1. > > > Thanks! > > Alvaro Aguilera. > > > > ------------------------------------------------------------------------ > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org> > <mailto:Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org>> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > >
hi, here is the requested information: before test: llite.fastfs-ffff810102a6a400.read_ahead_statssnapshot_time: 1251851453.382275 (secs.usecs) pending issued pages: 0 hits 7301235 misses 10546 readpage not consecutive 14369 miss inside window 1 failed grab_cache_page 6285314 failed lock match 0 read but discarded 98955 zero length file 0 zero size window 3495 read-ahead to EOF 172 hit max r-a issue 783042 wrong page from grab_cache_page 0 after: llite.fastfs-ffff810102a6a400.read_ahead_statssnapshot_time: 1251851620.183964 (secs.usecs) pending issued pages: 0 hits 7506005 misses 330064 readpage not consecutive 14432 miss inside window 319450 failed grab_cache_page 6322954 failed lock match 17294 read but discarded 98955 zero length file 0 zero size window 3495 read-ahead to EOF 192 hit max r-a issue 837908 wrong page from grab_cache_page 0 there seems to by a lot of misses, as well as a locking problem, doesn''t it? Btw. in the test, 4 processes read 512mb each from a 2gb big file. Regards, Alvaro. On Fri, Aug 21, 2009 at 3:38 PM, di wang <di.wang at sun.com> wrote:> hello, > Alvaro Aguilera wrote: > >> they run on different physical nodes and access the ost via 4x infiniband. >> >> I never heard such problems, if they on different nodes. Client memory? > Can you post read-ahead stats (before and after the test) here by > > lctl get_param llite.*.read_ahead_stats > > > But there are indeed a lot fixes about stride read since 1.6.5, which is > included in the tar ball I posted below. > And it probably can fix your problem. > > Thanks > WangDi > > On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com <mailto: >> di.wang at sun.com>> wrote: >> >> Alvaro Aguilera wrote: >> >> thanks for the hint, but unfortunately I can''t make any >> updates to the cluster... >> >> Do you think both of the problems I experienced are bugs in >> Lustre and are resolved in current versions? >> >> It should be lustre bugs. The 2 processes runs on different node >> or same node? >> >> Thanks >> WangDi >> >> >> Thanks. >> Alvaro. >> >> >> On Fri, Aug 21, 2009 at 6:32 AM, di wang <di.wang at sun.com >> <mailto:di.wang at sun.com> <mailto:di.wang at sun.com >> >> <mailto:di.wang at sun.com>>> wrote: >> >> Hello, >> >> You may see bug 17197 and try to apply this patch >> https://bugzilla.lustre.org/attachment.cgi?id=25062 to your >> lustre src. Or you can wait 1.8.2. >> >> Thanks >> Wangdi >> >> Alvaro Aguilera wrote: >> >> Hello, >> >> as a project for college I''m doing a behavioral comparison >> between Lustre and CXFS when dealing with simple >> strided files >> using POSIX semantics. On one of the tests, each >> participating >> process reads 16 chunks of data with a size of 32MB >> each, from >> a common, strided file using the following code: >> >> >> ------------------------------------------------------------------------------------------ >> int myfile = open("thefile", O_RDONLY); >> >> MPI_Barrier(MPI_COMM_WORLD); // the barriers are only >> to help >> measuring time >> >> off_t distance = (numtasks-1)*p.buffersize; >> off_t offset = rank*p.buffersize; >> >> int j; >> lseek(myfile, offset, SEEK_SET); >> for (j = 0; j < p.buffercount; j++) { >> read(myfile, buffers[j], p.buffersize); // >> buffers are >> aligned to the page size >> lseek(myfile, distance, SEEK_CUR); >> } >> >> MPI_Barrier(MPI_COMM_WORLD); >> >> close(myfile); >> >> ------------------------------------------------------------------------------------------ >> >> I''m facing the following problem: when this code is run in >> parallel the read operations on certain processes start to >> need more and more time to complete. I attached a graphical >> trace of this, when using only 2 processes. >> As you see, the read operations on process 0 stay more >> or less >> constant, taking about 0.12 seconds to complete, while on >> process 1 they increase up to 39 seconds! >> >> If I run the program with only one process, then the time >> stays at ~0.12 seconds per read operation. The problem >> doesn''t >> appear if the O_DIRECT flag is used. >> >> Can somebody explain to me why is this happening? Since I''m >> very new to Lustre, I may be making some silly >> mistakes, so be >> nice to me ;) >> >> I''m using Lustre SLES 10 Patchlevel 1, Kernel >> 2.6.16.54-0.2.5_lustre.1.6.5.1. >> >> >> Thanks! >> >> Alvaro Aguilera. >> >> >> >> ------------------------------------------------------------------------ >> >> >> ------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> <mailto:Lustre-discuss at lists.lustre.org> >> <mailto:Lustre-discuss at lists.lustre.org >> <mailto:Lustre-discuss at lists.lustre.org>> >> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090902/32115639/attachment-0001.html
Hello, Miss_inside_window vs hit is about 3 vs 2, indeed too high. It probably means a lot of pages is read in by read-ahead, but later evicted before it is really being accessed. So the patch in bug17197 probably fix this problem, and which will be included in 1.8.2. Thanks WangDi Alvaro Aguilera wrote:> hi, > > here is the requested information: > > before test: > > llite.fastfs-ffff810102a6a400.read_ahead_stats> snapshot_time: 1251851453.382275 (secs.usecs) > pending issued pages: 0 > hits 7301235 > misses 10546 > readpage not consecutive 14369 > miss inside window 1 > failed grab_cache_page 6285314 > failed lock match 0 > read but discarded 98955 > zero length file 0 > zero size window 3495 > read-ahead to EOF 172 > hit max r-a issue 783042 > wrong page from grab_cache_page 0 > > > after: > > llite.fastfs-ffff810102a6a400.read_ahead_stats> snapshot_time: 1251851620.183964 (secs.usecs) > pending issued pages: 0 > hits 7506005 > misses 330064 > readpage not consecutive 14432 > miss inside window 319450 > failed grab_cache_page 6322954 > failed lock match 17294 > read but discarded 98955 > zero length file 0 > zero size window 3495 > read-ahead to EOF 192 > hit max r-a issue 837908 > wrong page from grab_cache_page 0 > > > there seems to by a lot of misses, as well as a locking problem, > doesn''t it? Btw. in the test, 4 processes read 512mb each from a 2gb > big file. > > Regards, > Alvaro. > > On Fri, Aug 21, 2009 at 3:38 PM, di wang <di.wang at sun.com > <mailto:di.wang at sun.com>> wrote: > > hello, > > Alvaro Aguilera wrote: > > they run on different physical nodes and access the ost via 4x > infiniband. > > I never heard such problems, if they on different nodes. Client > memory? > Can you post read-ahead stats (before and after the test) here by > > lctl get_param llite.*.read_ahead_stats > > > But there are indeed a lot fixes about stride read since 1.6.5, > which is included in the tar ball I posted below. > And it probably can fix your problem. > > Thanks > WangDi > > On Fri, Aug 21, 2009 at 3:15 PM, di wang <di.wang at sun.com > <mailto:di.wang at sun.com> <mailto:di.wang at sun.com > <mailto:di.wang at sun.com>>> wrote: > > Alvaro Aguilera wrote: > > thanks for the hint, but unfortunately I can''t make any > updates to the cluster... > > Do you think both of the problems I experienced are bugs in > Lustre and are resolved in current versions? > > It should be lustre bugs. The 2 processes runs on different > node > or same node? > > Thanks > WangDi > > > Thanks. > Alvaro. > > > On Fri, Aug 21, 2009 at 6:32 AM, di wang > <di.wang at sun.com <mailto:di.wang at sun.com> > <mailto:di.wang at sun.com <mailto:di.wang at sun.com>> > <mailto:di.wang at sun.com <mailto:di.wang at sun.com> > > <mailto:di.wang at sun.com <mailto:di.wang at sun.com>>>> wrote: > > Hello, > > You may see bug 17197 and try to apply this patch > https://bugzilla.lustre.org/attachment.cgi?id=25062 > to your > lustre src. Or you can wait 1.8.2. > > Thanks > Wangdi > > Alvaro Aguilera wrote: > > Hello, > > as a project for college I''m doing a behavioral > comparison > between Lustre and CXFS when dealing with simple > strided files > using POSIX semantics. On one of the tests, each > participating > process reads 16 chunks of data with a size of 32MB > each, from > a common, strided file using the following code: > > > ------------------------------------------------------------------------------------------ > int myfile = open("thefile", O_RDONLY); > > MPI_Barrier(MPI_COMM_WORLD); // the barriers are > only > to help > measuring time > > off_t distance = (numtasks-1)*p.buffersize; > off_t offset = rank*p.buffersize; > > int j; > lseek(myfile, offset, SEEK_SET); > for (j = 0; j < p.buffercount; j++) { > read(myfile, buffers[j], p.buffersize); // > buffers are > aligned to the page size > lseek(myfile, distance, SEEK_CUR); > } > > MPI_Barrier(MPI_COMM_WORLD); > > close(myfile); > > ------------------------------------------------------------------------------------------ > > I''m facing the following problem: when this code > is run in > parallel the read operations on certain > processes start to > need more and more time to complete. I attached > a graphical > trace of this, when using only 2 processes. > As you see, the read operations on process 0 > stay more > or less > constant, taking about 0.12 seconds to complete, > while on > process 1 they increase up to 39 seconds! > > If I run the program with only one process, then > the time > stays at ~0.12 seconds per read operation. The > problem > doesn''t > appear if the O_DIRECT flag is used. > > Can somebody explain to me why is this > happening? Since I''m > very new to Lustre, I may be making some silly > mistakes, so be > nice to me ;) > > I''m using Lustre SLES 10 Patchlevel 1, Kernel > 2.6.16.54-0.2.5_lustre.1.6.5.1. > > > Thanks! > > Alvaro Aguilera. > > > > ------------------------------------------------------------------------ > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org> > <mailto:Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org>> > <mailto:Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org> > <mailto:Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org>>> > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >