Juan Piernas Canovas
2007-Oct-01 22:56 UTC
[Lustre-discuss] Performance problems with Lustre 1.6.1
Hi all, I have set up a small Lustre file system with 1 MDS and 8 OSS/OST. The particularity of our system is that every OSS is also a client of the file system (there are 8 clients altogether). The file system has a 1 GB file striped across all the OSTs. On every OST, there is a process which reads the file chunks stored locally, e.g., in its own OST (since the processes have the striping information of the file, each one knows which portions of the file are stored in its OST). The problem that I have is that, when the stripe size is 1MB (what means that there are 1024 chunks in total, or 128 chunks per OST), it takes more than 400 seconds to read the file, and the network traffic is very high. However, if the stripe size is 128 MB (8 chunks altogether, one per OST), it takes only around 100 seconds to read the file, and the network traffic is 1/10th the previous one. Note that, in both cases, the data I/O operations are local and that the processes read the same amount of data. Could this be a problem with the lock mechanism and the caching on the clients? If so, I have seen that the ldlm can be disabled, but, how? (The processes read from disjoint parts of the file, so they do not really need the ldlm service). Thanks in advance, Juan.
Kilian CAVALOTTI
2007-Oct-01 23:01 UTC
[Lustre-discuss] Performance problems with Lustre 1.6.1
Hi Juan, On Monday 01 October 2007 03:56:33 pm Juan Piernas Canovas wrote:> I have set up a small Lustre file system with 1 MDS and 8 OSS/OST. > The particularity of our system is that every OSS is also a client of > the file system (there are 8 clients altogether).I''m not sure if that''s related, but I recall reading in the Lustre manual that running a client and an OST on the same machine could lead to a whole range of unexpected results, including deadlocks: http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-26-1.html#wp1072362 Cheers, -- Kilian
Juan Piernas Canovas
2007-Oct-01 23:08 UTC
[Lustre-discuss] Performance problems with Lustre 1.6.1
Hi Kilian, Thanks for your reply. Yes, I have also read that a deadlock can occur, but I have not had that problem so far. I have forgotten to mention that if I use 8 independent files, one per OST, and every process reads the file in its OST, the performance is even better. Therefore, I assume (maybe, wrongly) that this is a consistency/synchronization problem when several process access to the same file, even when all of them are reading non-overlapped portions. Regards, Juan. Kilian CAVALOTTI wrote:> Hi Juan, > > On Monday 01 October 2007 03:56:33 pm Juan Piernas Canovas wrote: > >> I have set up a small Lustre file system with 1 MDS and 8 OSS/OST. >> The particularity of our system is that every OSS is also a client of >> the file system (there are 8 clients altogether). >> > > I''m not sure if that''s related, but I recall reading in the Lustre > manual that running a client and an OST on the same machine could lead > to a whole range of unexpected results, including deadlocks: > http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-26-1.html#wp1072362 > > Cheers, >
Andreas Dilger
2007-Oct-02 06:01 UTC
[Lustre-discuss] Performance problems with Lustre 1.6.1
On Oct 01, 2007 15:56 -0700, Juan Piernas Canovas wrote:> I have set up a small Lustre file system with 1 MDS and 8 OSS/OST. The > particularity of our system is that every OSS is also a client of the > file system (there are 8 clients altogether). > > The file system has a 1 GB file striped across all the OSTs. On every > OST, there is a process which reads the file chunks stored locally, > e.g., in its own OST (since the processes have the striping information > of the file, each one knows which portions of the file are stored in its > OST). > > The problem that I have is that, when the stripe size is 1MB (what means > that there are 1024 chunks in total, or 128 chunks per OST), it takes > more than 400 seconds to read the file, and the network traffic is very > high. However, if the stripe size is 128 MB (8 chunks altogether, one > per OST), it takes only around 100 seconds to read the file, and the > network traffic is 1/10th the previous one. Note that, in both cases, > the data I/O operations are local and that the processes read the same > amount of data.It sounds like the readahead is reading the "unused" parts of the file on the other OSTs. Are you also reading data from disk in 1MB chunks, or in smaller chunks? You should read at the stripe size for best performance in this test. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Juan Piernas Canovas
2007-Oct-10 00:13 UTC
[Lustre-discuss] Performance problems with Lustre 1.6.1
Andreas Dilger wrote:> On Oct 01, 2007 15:56 -0700, Juan Piernas Canovas wrote: > >> I have set up a small Lustre file system with 1 MDS and 8 OSS/OST. The >> particularity of our system is that every OSS is also a client of the >> file system (there are 8 clients altogether). >> >> The file system has a 1 GB file striped across all the OSTs. On every >> OST, there is a process which reads the file chunks stored locally, >> e.g., in its own OST (since the processes have the striping information >> of the file, each one knows which portions of the file are stored in its >> OST). >> >> The problem that I have is that, when the stripe size is 1MB (what means >> that there are 1024 chunks in total, or 128 chunks per OST), it takes >> more than 400 seconds to read the file, and the network traffic is very >> high. However, if the stripe size is 128 MB (8 chunks altogether, one >> per OST), it takes only around 100 seconds to read the file, and the >> network traffic is 1/10th the previous one. Note that, in both cases, >> the data I/O operations are local and that the processes read the same >> amount of data. >> > > It sounds like the readahead is reading the "unused" parts of the file > on the other OSTs. Are you also reading data from disk in 1MB chunks, > or in smaller chunks? You should read at the stripe size for best > performance in this test. > >Hi Andreas, Thank you. You are right. One problem was that the readahead made a process on an OST read chunks from other OSTs. That explains the network traffic. The other problem was the size of the I/O requests, which was too "small" (16 KB). The interesting point is that the (small) request size was the same in both configurations (1 MB and 128 MB stripe sizes) but, even then, the times were very different. Regards, Juan.> Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >
Andreas Dilger
2007-Oct-10 23:25 UTC
[Lustre-discuss] Performance problems with Lustre 1.6.1
On Oct 09, 2007 17:13 -0700, Juan Piernas Canovas wrote:> Andreas Dilger wrote: > >On Oct 01, 2007 15:56 -0700, Juan Piernas Canovas wrote: > >>The problem that I have is that, when the stripe size is 1MB (what means > >>that there are 1024 chunks in total, or 128 chunks per OST), it takes > >>more than 400 seconds to read the file, and the network traffic is very > >>high. However, if the stripe size is 128 MB (8 chunks altogether, one > >>per OST), it takes only around 100 seconds to read the file, and the > >>network traffic is 1/10th the previous one. Note that, in both cases, > >>the data I/O operations are local and that the processes read the same > >>amount of data. > > > >It sounds like the readahead is reading the "unused" parts of the file > >on the other OSTs. Are you also reading data from disk in 1MB chunks, > >or in smaller chunks? You should read at the stripe size for best > >performance in this test. > > Thank you. You are right. One problem was that the readahead made a > process on an OST read chunks from other OSTs. That explains the network > traffic. The other problem was the size of the I/O requests, which was > too "small" (16 KB). The interesting point is that the (small) request > size was the same in both configurations (1 MB and 128 MB stripe sizes) > but, even then, the times were very different.In the 128MB stripe case, the readahead is mostly reading within the "local" stripe, so the overhead is minimal (some overflow into the next stripe but not so much). In the 1MB stripe case, the majority of the readahead will be in irrelevant parts of the filesystem and will slow down the overall performance. If you read with read size == stripe size you will get "random" read heuristics (i.e. no readahead) and performance should be good. Of course, Lustre _should_ implement smarter strided readahead, but it doesn''t yet (patches welcome :-). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.