Moving this discussion to lustre-devel... OST object placement is a hard problem with conflicting requirements including... 1. Even server space balance 2. Even server load balance 3. Minimal network congestion 4. Scalable ultra-wide file layout descriptor 5. Scalable placement algorithm Implementing a placement algorithm with a centralized server clearly isn''t scalable and will have to be reworked for CMD. A starting point might be to explore how to ensure CROW goes some way to satisfy requirements 1-3 above. BTW, I''ve long believed that it''s a mistake not to give Lustre any inkling that all the creates done by a FPP parallel application are somehow related - e.g. via a cluster-wide job identifier. Surely file-per-process placement is very close to shared file placement (minus extent locking conflicts :)? I recognize that fixing this still leaves the problem of how to get best F/S utilization when different applications share a cluster - but I don''t think they are necessarily the same problem and trying to address them both with the same solution seems wrong. Cheers, Eric> -----Original Message----- > From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger > Sent: 05 March 2009 9:46 PM > To: Nathaniel Rutman > Cc: lustre-tech-leads at sun.com > Subject: Re: create performance > > On Mar 05, 2009 09:39 -0800, Nathaniel Rutman wrote: > > Alex Zhuravlav wrote: > >>>>>>> Nathaniel Rutman (NR) writes: > >> NR> What about preallocating objects per client, on the clients? > >> NR> Client still needs to get a namespace entry from MDT, but could then > >> NR> hold a write layout lock > >> NR> and do it''s own round-robin allocation. For clients with subtree > >> NR> locks this could avoid any need to talk to the MDT and wouldn''t need > >> NR> the writeback cache. > >> > >> I thought "avoid any need to talk to MDT" implies "writeback cache" > > > > Hmm, well, maybe you consider this a limited version of writeback cache? > > It would be kind of a notification of "here is the layout/objects of my > > new file, with my new fid." Fid ranges and object numbers would be > > granted to clients for their own use, and the MDT would only have to do > > the namespace entry, asynchronously. I suppose there''s recovery issues > > we have to worry about then. > > > > What I was really trying to get at was to avoid the two step process of > > client -> MDT -> OST stripe allocation, which includes an extra network > > hop in some precreation starvation cases, and always includes some (a > > little?) cpu on the MDT: > > 1. clients get object grants for every OST. > > 2. clients assign objects to new files and send in reqs to MDT, which > > just records the objects in the LOV EA > > 3. MDT batches up the assigned objects and sends to OSTs for orphan > > cleanup llog. > > The main problem with having many clients do precreation themselves is > that this will invariably cause load imbalance on the OSTs, which will > cause long-term file IO performance problems (much in excess of the > performance problems hit during precreate). > > Cray recently filed a bug on the read performance of files being noticably > hurt by QOS object allocation due to space imbalance, even thoguh the MDS > is trying to balance across OSTs locally, but is using random numbers to > do this and is not selecting OSTs evenly. > > In a file-per-process checkpoint (say 100 processes/files on 100 OSTs) > the MDS round-robin will allocate 1 object per OST evenly across all > OSTs (excluding the case where an OSC is out of preallocated objects). > If clients are doing the allocation (or in the past when the MDS did > "random" OST selection) then the chance of all 100 clients allocating > on 100 OSTs is vanishingly small. Instead it is likely that some OSTs > will have no objects used, and some will have 2 or 3 or 4, and the > aggregate write performance will FOREVER be 50% or 33% or 25% of the > MDS-round-robin allocated objects for that set of files. That is far > worse than waiting 1s for the MDS to allocate the objects. > > IMHO, if we are doing WBC on the client, then there is no _requirement_ > that the client has to allocate objects for the files at all, and any > write data could just be in the client page cache. Until the new file > is visible on the MDS to another client nobody can even try to access > the data. Once the WBC cache is flushed to the MDS then objects can > be allocated by the MDS evenly (granting an exclusive layout lock to the > client in the process) until the cached client data is either flushed to > disk or at least protected by extent locks and can be partially flushed > as needed. > > Note that I don''t totally object to WBC clients doing object allocations > if they are creating a large number of files, in essence becoming an > MDS that is tracking the load on the OSTs and balancing object creation > appropriately. What I object to is the more common case where each > client is creating a single file for a large FPP checkpoint, and the > clients all selecting the OSTs separately. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.
On Mar 06, 2009 13:25 +0000, Eric Barton wrote:> OST object placement is a hard problem with conflicting requirements > including... > > 1. Even server space balance > 2. Even server load balance > 3. Minimal network congestion > 4. Scalable ultra-wide file layout descriptor > 5. Scalable placement algorithm > > Implementing a placement algorithm with a centralized server clearly > isn''t scalable and will have to be reworked for CMD. A starting > point might be to explore how to ensure CROW goes some way to > satisfy requirements 1-3 above.While CROW can help avoid latency for precreating objects (which can avoid some of the object allocation imbalances hit today when OSTs are slow precreating objects), it doesn''t really fundamentally help to balance space and performance of the OSTs. With any filesystem with more than a handful of OSTs there shouldn''t be any reason why the OSTs precreating can''t keep up with the MDS create rate. Johann and I were discussing this problem and I suspect it is only a defect in the object precreation code and not a fundamental problem int the design. I definitely agree that for CMD we will have distributed object allocation, but so far it isn''t clear whether having more than the MDSes and/or WBC clients doing the allocation will improve the situation or make it worse.> BTW, I''ve long believed that it''s a mistake not to give Lustre any > inkling that all the creates done by a FPP parallel application are > somehow related - e.g. via a cluster-wide job identifier. Surely > file-per-process placement is very close to shared file placement > (minus extent locking conflicts :)?Yes, I agree. In theory it should be possible to extract this kind of information from the client processes themselves, either by examining the process environment (some MPI job launchers store the MPI rank there for pre-launch shell scripts) or by comparing the filenames being created by the clients. Any file-per-process job will invariably create filenames with the rank in the filename.> I recognize that fixing this > still leaves the problem of how to get best F/S utilization when > different applications share a cluster - but I don''t think they are > necessarily the same problem and trying to address them both with the > same solution seems wrong. > > Cheers, > Eric > > > -----Original Message----- > > From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger > > Sent: 05 March 2009 9:46 PM > > To: Nathaniel Rutman > > Cc: lustre-tech-leads at sun.com > > Subject: Re: create performance > > > > On Mar 05, 2009 09:39 -0800, Nathaniel Rutman wrote: > > > Alex Zhuravlav wrote: > > >>>>>>> Nathaniel Rutman (NR) writes: > > >> NR> What about preallocating objects per client, on the clients? > > >> NR> Client still needs to get a namespace entry from MDT, but could then > > >> NR> hold a write layout lock > > >> NR> and do it''s own round-robin allocation. For clients with subtree > > >> NR> locks this could avoid any need to talk to the MDT and wouldn''t need > > >> NR> the writeback cache. > > >> > > >> I thought "avoid any need to talk to MDT" implies "writeback cache" > > > > > > Hmm, well, maybe you consider this a limited version of writeback cache? > > > It would be kind of a notification of "here is the layout/objects of my > > > new file, with my new fid." Fid ranges and object numbers would be > > > granted to clients for their own use, and the MDT would only have to do > > > the namespace entry, asynchronously. I suppose there''s recovery issues > > > we have to worry about then. > > > > > > What I was really trying to get at was to avoid the two step process of > > > client -> MDT -> OST stripe allocation, which includes an extra network > > > hop in some precreation starvation cases, and always includes some (a > > > little?) cpu on the MDT: > > > 1. clients get object grants for every OST. > > > 2. clients assign objects to new files and send in reqs to MDT, which > > > just records the objects in the LOV EA > > > 3. MDT batches up the assigned objects and sends to OSTs for orphan > > > cleanup llog. > > > > The main problem with having many clients do precreation themselves is > > that this will invariably cause load imbalance on the OSTs, which will > > cause long-term file IO performance problems (much in excess of the > > performance problems hit during precreate). > > > > Cray recently filed a bug on the read performance of files being noticably > > hurt by QOS object allocation due to space imbalance, even thoguh the MDS > > is trying to balance across OSTs locally, but is using random numbers to > > do this and is not selecting OSTs evenly. > > > > In a file-per-process checkpoint (say 100 processes/files on 100 OSTs) > > the MDS round-robin will allocate 1 object per OST evenly across all > > OSTs (excluding the case where an OSC is out of preallocated objects). > > If clients are doing the allocation (or in the past when the MDS did > > "random" OST selection) then the chance of all 100 clients allocating > > on 100 OSTs is vanishingly small. Instead it is likely that some OSTs > > will have no objects used, and some will have 2 or 3 or 4, and the > > aggregate write performance will FOREVER be 50% or 33% or 25% of the > > MDS-round-robin allocated objects for that set of files. That is far > > worse than waiting 1s for the MDS to allocate the objects. > > > > IMHO, if we are doing WBC on the client, then there is no _requirement_ > > that the client has to allocate objects for the files at all, and any > > write data could just be in the client page cache. Until the new file > > is visible on the MDS to another client nobody can even try to access > > the data. Once the WBC cache is flushed to the MDS then objects can > > be allocated by the MDS evenly (granting an exclusive layout lock to the > > client in the process) until the cached client data is either flushed to > > disk or at least protected by extent locks and can be partially flushed > > as needed. > > > > Note that I don''t totally object to WBC clients doing object allocations > > if they are creating a large number of files, in essence becoming an > > MDS that is tracking the load on the OSTs and balancing object creation > > appropriately. What I object to is the more common case where each > > client is creating a single file for a large FPP checkpoint, and the > > clients all selecting the OSTs separately. > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. >Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Mon, Mar 09, 2009 at 12:05:34AM -0600, ''Andreas Dilger'' wrote:> On Mar 06, 2009 13:25 +0000, Eric Barton wrote: > > OST object placement is a hard problem with conflicting requirements > > including... > > > > 1. Even server space balance > > 2. Even server load balance > > 3. Minimal network congestion > > 4. Scalable ultra-wide file layout descriptor > > 5. Scalable placement algorithm > > > > Implementing a placement algorithm with a centralized server clearly > > isn''t scalable and will have to be reworked for CMD. A starting > > point might be to explore how to ensure CROW goes some way to > > satisfy requirements 1-3 above.CROW should satisfy #4 easily because it would allow us to have the same OST-side FID for all stripes of a file, which combined with a compression of the stripe configuration of the file (the ordered list of OSTs) should result in fixed-sized FID for all files. (For compat, small FIDs can be expanded when talking to old clients.) CROW should be mostly orthogonal to #1-3 and #5 though, except that a good compression technique for the stripe configuration might make it easier to get even server space and load balance. Imagine an algorithm that takes a list of OSTs, stripe count and index as inputs and quickly outputs an ordered list of <strip-count> OSTs, such that for each index value you get a pseudo-random permutation of a pseudo-randomly picked combination of <strip-count> OSTs. Then we could monotonically increment that index as a way to generate the next new file''s placement. For this use an LFSR would be a perfect way to get pseudo-randomness (we don''t need cryptographic strength for this purpose). The index becomes a seed for the LFSR. We might need two indexes, actually, one for the combination of OSTs and one for the permutation thereof. With a pseudo-random distribution of combinations and permutations we ought to get a fair distribution of data and load.> While CROW can help avoid latency for precreating objects (which can > avoid some of the object allocation imbalances hit today when OSTs > are slow precreating objects), it doesn''t really fundamentally help > to balance space and performance of the OSTs. With any filesystem > with more than a handful of OSTs there shouldn''t be any reason why > the OSTs precreating can''t keep up with the MDS create rate. Johann > and I were discussing this problem and I suspect it is only a defect > in the object precreation code and not a fundamental problem int the > design. > > I definitely agree that for CMD we will have distributed object > allocation, but so far it isn''t clear whether having more than the > MDSes and/or WBC clients doing the allocation will improve the > situation or make it worse.We really should use CROW for these reasons: - CROW enables fixed sized FIDs no matter how large the stripe count - no need to go destroy unused pre-created files on MGS reboot> > BTW, I''ve long believed that it''s a mistake not to give Lustre any > > inkling that all the creates done by a FPP parallel application are > > somehow related - e.g. via a cluster-wide job identifier. Surely > > file-per-process placement is very close to shared file placement > > (minus extent locking conflicts :)? > > Yes, I agree. In theory it should be possible to extract this kind > of information from the client processes themselves, either by > examining the process environment (some MPI job launchers store the > MPI rank there for pre-launch shell scripts) or by comparing the > filenames being created by the clients. Any file-per-process job > will invariably create filenames with the rank in the filename.Sounds like a good idea, and configurable via regexes (ick, I know). Even better would be a way to associate a cluster job ID with a set of processes. This could be done via Linux keyrings, say. Nico --
On Jun 02, 2009 14:38 -0500, Nicolas Williams wrote:> On Mon, Mar 09, 2009 at 12:05:34AM -0600, ''Andreas Dilger'' wrote: > > On Mar 06, 2009 13:25 +0000, Eric Barton wrote: > > > OST object placement is a hard problem with conflicting requirements > > > including... > > > > > > 1. Even server space balance > > > 2. Even server load balance > > > 3. Minimal network congestion > > > 4. Scalable ultra-wide file layout descriptor > > > 5. Scalable placement algorithm > > > > > > Implementing a placement algorithm with a centralized server clearly > > > isn''t scalable and will have to be reworked for CMD. A starting > > > point might be to explore how to ensure CROW goes some way to > > > satisfy requirements 1-3 above. > > CROW should satisfy #4 easily because it would allow us to have the same > OST-side FID for all stripes of a file, which combined with a > compression of the stripe configuration of the file (the ordered list of > OSTs) should result in fixed-sized FID for all files. (For compat, > small FIDs can be expanded when talking to old clients.)CROW itself isn''t required for wide striping. It is possible to allocate FID sequences to OSTs in a manner that will allow widely striped files to be specified in a compact manner. The main problem with widely-striped files is that they add overhead to file IO operations, because the client might potentially have to get hundreds or thousands of locks per file.> CROW should be mostly orthogonal to #1-3 and #5 though, except that a > good compression technique for the stripe configuration might make it > easier to get even server space and load balance. Imagine an algorithm > that takes a list of OSTs, stripe count and index as inputs and quickly > outputs an ordered list of <strip-count> OSTs, such that for each index > value you get a pseudo-random permutation of a pseudo-randomly picked > combination of <strip-count> OSTs. Then we could monotonically > increment that index as a way to generate the next new file''s placement. > > For this use an LFSR would be a perfect way to get pseudo-randomness (we > don''t need cryptographic strength for this purpose). The index becomes > a seed for the LFSR. We might need two indexes, actually, one for the > combination of OSTs and one for the permutation thereof. With a > pseudo-random distribution of combinations and permutations we ought to > get a fair distribution of data and load.In our previous testing, any kind of random OST selection is sub-optimal compared to round robin. The problem is that RNG/PRNG OST selection, while uniform on average, is definitely non-uniform locally, and this results in non-uniform OST selection and clients competing for OSS/OST resources. For example, if 100 MPI clients are creating 100 files on 100 OSTs, then on average there would be 1 file/OST, but typically some OSTs will have 2 or 3 OSTs, while others are idle. This will result in IO being 2-3x slower on those OSTs, and often result in the entire IO being slower 2-3x. While we do something similar to this for the case of unbalanced OSTs, we want to move to a round-robin scheme even in the case of unbalanced OSTs. This would use an "freespace accumulator" similar to a Bresenham line algorithm, so that OSTs which are below the average freespace will be skipped until their "accumulated freespace" is temporarily above average.> > > BTW, I''ve long believed that it''s a mistake not to give Lustre any > > > inkling that all the creates done by a FPP parallel application are > > > somehow related - e.g. via a cluster-wide job identifier. Surely > > > file-per-process placement is very close to shared file placement > > > (minus extent locking conflicts :)? > > > > Yes, I agree. In theory it should be possible to extract this kind > > of information from the client processes themselves, either by > > examining the process environment (some MPI job launchers store the > > MPI rank there for pre-launch shell scripts) or by comparing the > > filenames being created by the clients. Any file-per-process job > > will invariably create filenames with the rank in the filename. > > Sounds like a good idea, and configurable via regexes (ick, I know). > > Even better would be a way to associate a cluster job ID with a set of > processes. This could be done via Linux keyrings, say.This is probably easiest to start with MPI-IO ADIO ioctls directly to Lustre. Once we know it helps we can look at other mechanisms to get this information from applications that don''t use MPI-IO. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.