Oracle BZ-4424 (continued in WC LU-80) adds support for larger OST stripe counts via increased EXT4 EA sizes. Some problems with this are: 1) increased MDT storage and network loading for transmitting the object list 2) relative low new limit (1350 up from 160) We have been thinking about a different wide-striping method that doesn''t have these problems. The basic idea is to create a new stripe type that encodes the list of OSTs compactly, and then using the same (or a calculable) object identifier (or FID) on all these OSTs. <https://lh4.googleusercontent.com/mxm5R4Yd000I_v5qNcpYH6ZzHBvryGEE6pjxOBWz6ysHUNK0Yjh1J81kmP-5zVaoCiOU8RJv04WMhNoe1JqipOOmtRd7otrZ0saWKUnNyNVvaWvLRD8> Our version of widestriping does not involve increasing the EA size at all, but instead utilizes a new stripe pattern. (This will not be understandable by older Lustre versions, which will generate an error locally, or potentially we can convert into the BZ-4424 form if the layout fits in that format). A bitmap will identify which OSTs hold a stripe of this file. The bitmap should probably fit into current ext4 EA size limit, giving us ~32k stripes. Some OST?s may be down at file creation time, or new OSTs added later; hence there will likely be holes in the bitmap (but relatively few). Start index will still be used, but stripe order will be strictly round-robin (we will wrap around). In other words, the stripe sequence will always be in linear OST order, starting from start_index, maybe skipping some holes, wrapping around to start_index-1. Widestripe objects do not need a special sequence number (fid_seq); the MDT knows the file was created as widestriped and marks it as such (LOV_PATTERN_BITMAP). There are two options for OST object identification: common object ID and FID-on-OST. Common Object ID The MDT tracks a special range of OST object ID?s (?wide stripe objectid? = WSO) that are used on all OSTs. The MDT assigns the next available WSO to the file, and this objectid is used on all the OSTs. The OSTs must never use these objects for regular striped files. A special precreation group for these objects is probably necessary, as well as orphan cleanup (the MDT should purge "hole" objects that aren?t allocated from a particular OST). The MDT should track the last assigned WSO; this will be the starting point for new wide striped files after recovery. Objects cannot be migrated from one OST to another, since this would result in out-of-order access. Similarly, stripes can never be added to holes. FID-on-OST Use a mapping of the MDT FID to uniquely determine an OST object. The clients and MDT add in the OST number to the MDT FID (probably just reserve one sequence per OST). (This allows the objects to potentially migrate to different OSTs). The OSTs then internally must map the FID to a local object id. Note this allows OST-local precreation pools, getting the MDT out of the precreate/orphan cleanup business and potentially improving create speeds, and also facilitates "create on write" semantics. The FID can be assigned during the first access to OST object. The big problem here is that FID>OBJID ( or better FID->inode id ) translation is absent from the OSTs today. See http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf <http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf> (what is the current state of this?) There is also some work in this direction in the OST restructuring work (?Orion? WC branch, ORI-300(?), scheduled for Lustre 2.4). There''s a few questions here, probably the first of which is "is it worthwhile to spend effort on this, or is BZ4424 good enough?" Then there is the question of object identification, where FID-on-OST is more flexible, but also significantly more work (and risk). Also, I thought I understood from the EOFS Summit that WC also has a separate FID-on-OST project (separate from Orion that is) -- can someone tell me the state of that? ______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20111003/02427d93/attachment.html
On Mon, 2011-10-03 at 13:15 -0700, Nathan Rutman wrote:> Some OST?s may be down at file creation time, or new OSTs added later; > hence there will likely be holes in the bitmap (but relatively few). > Start index will still be used, but stripe order will be strictly > round-robin (we will wrap around). In other words, the stripe > sequence will always be in linear OST order, starting from > start_index, maybe skipping some holes, wrapping around to > start_index-1.It didn''t occur to me when spoke at EOFS, but you''d need to store the number of OSTs in the system when the mapping was created if you allow it to wrap around -- otherwise, adding OSTs later would cause existing files to loose track of the objects after the wrap point. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
On Oct 3, 2011, at 5:17 PM, David Dillow wrote:> On Mon, 2011-10-03 at 13:15 -0700, Nathan Rutman wrote: > >> Some OST?s may be down at file creation time, or new OSTs added later; >> hence there will likely be holes in the bitmap (but relatively few). >> Start index will still be used, but stripe order will be strictly >> round-robin (we will wrap around). In other words, the stripe >> sequence will always be in linear OST order, starting from >> start_index, maybe skipping some holes, wrapping around to >> start_index-1. > > It didn''t occur to me when spoke at EOFS, but you''d need to store the > number of OSTs in the system when the mapping was created if you allow > it to wrap around -- otherwise, adding OSTs later would cause existing > files to loose track of the objects after the wrap point.That''s done inherently in the bitmap, where everything beyond the current number of OSTs is marked as a hole. (So actually, there will typically be one giant hole at the end of every bitmap, and then maybe some singeltons for deactivated OSTs.)> > -- > Dave Dillow > National Center for Computational Science > Oak Ridge National Laboratory > (865) 241-6602 office > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
On Tue, 2011-10-04 at 10:44 -0700, Nathan Rutman wrote:> On Oct 3, 2011, at 5:17 PM, David Dillow wrote: > > > On Mon, 2011-10-03 at 13:15 -0700, Nathan Rutman wrote: > > > >> Some OST?s may be down at file creation time, or new OSTs added later; > >> hence there will likely be holes in the bitmap (but relatively few). > >> Start index will still be used, but stripe order will be strictly > >> round-robin (we will wrap around). In other words, the stripe > >> sequence will always be in linear OST order, starting from > >> start_index, maybe skipping some holes, wrapping around to > >> start_index-1. > > > > It didn''t occur to me when spoke at EOFS, but you''d need to store the > > number of OSTs in the system when the mapping was created if you allow > > it to wrap around -- otherwise, adding OSTs later would cause existing > > files to loose track of the objects after the wrap point. > > That''s done inherently in the bitmap, where everything beyond the > current number of OSTs is marked as a hole. (So actually, there will > typically be one giant hole at the end of every bitmap, and then maybe > some singeltons for deactivated OSTs.)Perhaps I''m misunderstanding something, then. I understood you to say that we would have a linear OST order that starts from the start_index. So bitmap position 0 would be start_index, position 1 would be start_index + 1, and so on. If those bits are on, then there is a object for this file on those OSTs. Am I on the same page so far? Now, above you mention wrapping around to start_index - 1; I take this to mean that at some point, we''d say bitmap position N is no longer OST start_index + N, but would be OST 0. Bitmap position N + 1 would be OST 1, etc. This scheme may allow for a more compact bitmap when our file consists of OSTs at the extreme ends of the ones available, but you have to store the maximum OST number when creating the file to avoid having the bitmap wrap point shift when you add new OSTs. Or perhaps I just misunderstood what you meant by wrapping? Did you mean bitmap position 0 is always OST 0, and the OST indicated by start_index will hold the first object, and each set bit in turn indicates the next OST/object, and if we run out of bits in the bitmap before we hit stripe_count, we''ll start checking again at bitmap position/OST 0? -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
Hello, Nathan On 10/03/2011 01:15 PM, Nathan Rutman wrote:> Oracle BZ-4424 (continued in WC LU-80) adds support for larger OST > stripe counts via increased EXT4 EA sizes. > Some problems with this are: > 1) increased MDT storage and network loading for transmitting the > object list > 2) relative low new limit (1350 up from 160) > > We have been thinking about a different wide-striping method that > doesn''t have these problems. The basic idea is to create a new stripe > type that encodes the list of OSTs compactly, and then using the same > (or a calculable) object identifier (or FID) on all these OSTs. > > > Our version of widestriping does not involve increasing the EA size at > all, but instead utilizes a new stripe pattern. (This will not be > understandable by older Lustre versions, which will generate an error > locally, or potentially we can convert into the BZ-4424 form if the > layout fits in that format). A bitmap will identify which OSTs hold a > stripe of this file. The bitmap should probably fit into current ext4 > EA size limit, giving us ~32k stripes. > > Some OST?s may be down at file creation time, or new OSTs added later; > hence there will likely be holes in the bitmap (but relatively few). > Start index will still be used, but stripe order will be strictly > round-robin (we will wrap around). In other words, the stripe sequence > will always be in linear OST order, starting from start_index, maybe > skipping some holes, wrapping around to start_index-1. > > Widestripe objects do not need a special sequence number (fid_seq); > the MDT knows the file was created as widestriped and marks it as such > (LOV_PATTERN_BITMAP). There are two options for OST object > identification: common object ID and FID-on-OST.Actually, we also discussed to use real object (IAM or other index format) to store the stripe pattern, instead of using EA. Of course it would use more space, but it would give us the potential to explore the stripe pattern.> Common Object ID > The MDT tracks a special range of OST object ID?s (?wide stripe > objectid? = WSO) that are used on all OSTs. The MDT assigns the next > available WSO to the file, and this objectid is used on all the OSTs. > The OSTs must never use these objects for regular striped files. A > special precreation group for these objects is probably necessary, as > well as orphan cleanup (the MDT should purge "hole" objects that > aren?t allocated from a particular OST). The MDT should track the last > assigned WSO; this will be the starting point for new wide striped > files after recovery. Objects cannot be migrated from one OST to > another, since this would result in out-of-order access. Similarly, > stripes can never be added to holes. > FID-on-OST > Use a mapping of the MDT FID to uniquely determine an OST object. The > clients and MDT add in the OST number to the MDT FID (probably just > reserve one sequence per OST). (This allows the objects to > potentially migrate to different OSTs). The OSTs then internally must > map the FID to a local object id. Note this allows OST-local > precreation pools, getting the MDT out of the precreate/orphan cleanup > business and potentially improving create speeds, and also facilitates > "create on write" semantics. The FID can be assigned during the first > access to OST object.I am not sure I follow your idea here. You mean the OST needs internally map MDT FID(added in OST number) to object id (or inode ino) ? So there are no real OST FID? But you also said "The FID can be assigned during the first access to OST object.", Could you please explain more here?> The big problem here is that FID>OBJID ( or better FID->inode id ) > translation is absent from the OSTs today. See > http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf(what is the > current state of this?) There is also some work in this direction in > the OST restructuring work (?Orion? WC branch, ORI-300(?), scheduled > for Lustre 2.4). > > There''s a few questions here, probably the first of which is "is it > worthwhile to spend effort on this, or is BZ4424 good enough?" Then > there is the question of object identification, where FID-on-OST is > more flexible, but also significantly more work (and risk). Also, I > thought I understood from the EOFS Summit that WC also has a separate > FID-on-OST project (separate from Orion that is) -- can someone tell > me the state of that?FID-on-OST is actually part of DNE(dirtribute name space) phase I. It basically follows current fid client server infrastructure. 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT). 3. Similar as MDT FID, there will be OI to map FID to object inside OST. The code will be release with DNE sometime next year. Thanks WangDi> > > > > > ______________________________________________________________________ > This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. > > Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. > > Xyratex Technology Limited (03134912), Registered in England& Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. > > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. > ______________________________________________________________________ > > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20111004/c5a68cdf/attachment.html
Hi All,> > FID-on-OST is actually part of DNE(dirtribute name space) phase I. It basically follows current fid client server infrastructure. > > 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. > 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT). > 3. Similar as MDT FID, there will be OI to map FID to object inside OST. > > The code will be release with DNE sometime next year. >I think we not need a special FID''s for OST object, except we want to migrate one object via different data containers over cluster. I think it''s not a priority for now. So we can simplify a FID management for OST now. Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}. In that case OST not need allocate any FID''s, and MDT can reuse current reallocation scheme. in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we need a guaranteed free OST object exist when client tried to make access to that object. in that case OST can preallocate some pool and report that size to MDT, MDT know it''s uses some objects from that pool, but not know which object id assigned to file. to avoid OST confusion client send a MDT FID to OST when need access to OST object. OST look to OI database and check - is that FID assigned to something or not. if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and assign to that FID. that''s all. orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will kill a unallocated objects and return last index to the MDT. open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle FID as object index. -------------------------------------------- Alexey Lyashkov alexey_lyashkov at xyratex.com ______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
On Oct 4, 2011, at 2:16 PM, David Dillow wrote:> On Tue, 2011-10-04 at 10:44 -0700, Nathan Rutman wrote: >> On Oct 3, 2011, at 5:17 PM, David Dillow wrote: >> >>> On Mon, 2011-10-03 at 13:15 -0700, Nathan Rutman wrote: >>> >>>> Some OST?s may be down at file creation time, or new OSTs added later; >>>> hence there will likely be holes in the bitmap (but relatively few). >>>> Start index will still be used, but stripe order will be strictly >>>> round-robin (we will wrap around). In other words, the stripe >>>> sequence will always be in linear OST order, starting from >>>> start_index, maybe skipping some holes, wrapping around to >>>> start_index-1. >>> >>> It didn''t occur to me when spoke at EOFS, but you''d need to store the >>> number of OSTs in the system when the mapping was created if you allow >>> it to wrap around -- otherwise, adding OSTs later would cause existing >>> files to loose track of the objects after the wrap point. >> >> That''s done inherently in the bitmap, where everything beyond the >> current number of OSTs is marked as a hole. (So actually, there will >> typically be one giant hole at the end of every bitmap, and then maybe >> some singeltons for deactivated OSTs.) > > Perhaps I''m misunderstanding something, then. > > I understood you to say that we would have a linear OST order that > starts from the start_index. So bitmap position 0 would be start_index, > position 1 would be start_index + 1, and so on. If those bits are on, > then there is a object for this file on those OSTs.Sorry if I''m being unclear. start_index is just an offset into the bitmap. That''s the OST where the first stripe will be. Next stripe will be on the next OST index (unless a hole). When we get to the big hole at the end of the used OSTs, these OST index locations are all skipped (since they are holes), and the next stripe will be at OST index 0, then 1, etc, up to start_index-1 (again, unless holes).> > Am I on the same page so far? > > Now, above you mention wrapping around to start_index - 1; I take this > to mean that at some point, we''d say bitmap position N is no longer OST > start_index + N, but would be OST 0. Bitmap position N + 1 would be OST > 1, etc. This scheme may allow for a more compact bitmap when our file > consists of OSTs at the extreme ends of the ones available, but you have > to store the maximum OST number when creating the file to avoid having > the bitmap wrap point shift when you add new OSTs. > > Or perhaps I just misunderstood what you meant by wrapping? Did you mean > bitmap position 0 is always OST 0, and the OST indicated by start_index > will hold the first object, and each set bit in turn indicates the next > OST/object, and if we run out of bits in the bitmap before we hit > stripe_count, we''ll start checking again at bitmap position/OST 0? > -- > Dave Dillow > National Center for Computational Science > Oak Ridge National Laboratory > (865) 241-6602 office > >______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
On Wed, 2011-10-05 at 11:06 -0400, Nathan Rutman wrote:> Sorry if I''m being unclear. > > start_index is just an offset into the bitmap. That''s the OST where the first > stripe will be. Next stripe will be on the next OST index (unless a hole). > When we get to the big hole at the end of the used OSTs, these OST index > locations are all skipped (since they are holes), and the next stripe will > be at OST index 0, then 1, etc, up to start_index-1 (again, unless holes).Ok, so bitmap position 0 is always OST 0; thanks for clearing up my misunderstanding. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
On Oct 4, 2011, at 5:25 PM, wangdi wrote: Hello, Nathan On 10/03/2011 01:15 PM, Nathan Rutman wrote: Oracle BZ-4424 (continued in WC LU-80) adds support for larger OST stripe counts via increased EXT4 EA sizes. Some problems with this are: 1) increased MDT storage and network loading for transmitting the object list 2) relative low new limit (1350 up from 160) We have been thinking about a different wide-striping method that doesn''t have these problems. The basic idea is to create a new stripe type that encodes the list of OSTs compactly, and then using the same (or a calculable) object identifier (or FID) on all these OSTs. <https://lh4.googleusercontent.com/mxm5R4Yd000I_v5qNcpYH6ZzHBvryGEE6pjxOBWz6ysHUNK0Yjh1J81kmP-5zVaoCiOU8RJv04WMhNoe1JqipOOmtRd7otrZ0saWKUnNyNVvaWvLRD8> Our version of widestriping does not involve increasing the EA size at all, but instead utilizes a new stripe pattern. (This will not be understandable by older Lustre versions, which will generate an error locally, or potentially we can convert into the BZ-4424 form if the layout fits in that format). A bitmap will identify which OSTs hold a stripe of this file. The bitmap should probably fit into current ext4 EA size limit, giving us ~32k stripes. Some OST?s may be down at file creation time, or new OSTs added later; hence there will likely be holes in the bitmap (but relatively few). Start index will still be used, but stripe order will be strictly round-robin (we will wrap around). In other words, the stripe sequence will always be in linear OST order, starting from start_index, maybe skipping some holes, wrapping around to start_index-1. Widestripe objects do not need a special sequence number (fid_seq); the MDT knows the file was created as widestriped and marks it as such (LOV_PATTERN_BITMAP). There are two options for OST object identification: common object ID and FID-on-OST. Actually, we also discussed to use real object (IAM or other index format) to store the stripe pattern, instead of using EA. Of course it would use more space, but it would give us the potential to explore the stripe pattern. One of the main (the only?) benefits of our design (over current BZ4424 widestriping) is that it does not need any more space than the old MDT stripe pattern. No additional storage, no additional network traffic to transmit pattern. Common Object ID The MDT tracks a special range of OST object ID?s (?wide stripe objectid? = WSO) that are used on all OSTs. The MDT assigns the next available WSO to the file, and this objectid is used on all the OSTs. The OSTs must never use these objects for regular striped files. A special precreation group for these objects is probably necessary, as well as orphan cleanup (the MDT should purge "hole" objects that aren?t allocated from a particular OST). The MDT should track the last assigned WSO; this will be the starting point for new wide striped files after recovery. Objects cannot be migrated from one OST to another, since this would result in out-of-order access. Similarly, stripes can never be added to holes. FID-on-OST Use a mapping of the MDT FID to uniquely determine an OST object. The clients and MDT add in the OST number to the MDT FID (probably just reserve one sequence per OST). (This allows the objects to potentially migrate to different OSTs). The OSTs then internally must map the FID to a local object id. Note this allows OST-local precreation pools, getting the MDT out of the precreate/orphan cleanup business and potentially improving create speeds, and also facilitates "create on write" semantics. The FID can be assigned during the first access to OST object. I am not sure I follow your idea here. You mean the OST needs internally map MDT FID(added in OST number) to object id (or inode ino) ? yes. So there are no real OST FID? I suppose -- this is just a mapping of the MDT fid to the local OST object id, via a local lookup on the OST. There would be something like the OI to do this mapping. But you also said "The FID can be assigned during the first access to OST object.", Could you please explain more here? Since the FID -> Objid mapping is performed locally, it doesn''t need to be assigned until the first write. This is not integral to the design, just a side effect. The big problem here is that FID>OBJID ( or better FID->inode id ) translation is absent from the OSTs today. See http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf <http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf> (what is the current state of this?) There is also some work in this direction in the OST restructuring work (?Orion? WC branch, ORI-300(?), scheduled for Lustre 2.4). There''s a few questions here, probably the first of which is "is it worthwhile to spend effort on this, or is BZ4424 good enough?" Then there is the question of object identification, where FID-on-OST is more flexible, but also significantly more work (and risk). Also, I thought I understood from the EOFS Summit that WC also has a separate FID-on-OST project (separate from Orion that is) -- can someone tell me the state of that? FID-on-OST is actually part of DNE(dirtribute name space) phase I. It basically follows current fid client server infrastructure. 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT). 3. Similar as MDT FID, there will be OI to map FID to object inside OST. To integrate with this, we would need to have a reserved sequence on each OST that the MDT could assign FIDs from -- the MDT would need to use the same Object ID on all OSTs. For DNE, there would need to be a reserved sequence per OST per MDT. The code will be release with DNE sometime next year. Thanks WangDi ______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20111005/d8a50271/attachment-0001.html
Shadow, Your comment describes create-on-write (CROW), which is vulnerable to orphan creation by clients which have been evicted from the MDS but are not actually dead, unless further safeguards are implemented such as capabilities or server-cluster-wide client eviction. I also think that the decision to use FIDs in the way you suggest has architectural implications which would benefit from further discussion. The original idea was that a FID would be all you need to identify any object (including its target) and that using them uniformly in this way could help simplify the code and enable further development - e.g. to allow unified targets which mix namespace and data objects to better support small/sparse files. Making the FID just a unique identifier which requires a target index to specify a specific object doesn''t have to be inconsistent with uniform usage for data and metadata, but it has further knock-on implications which must be acknowledged and debated explicitly before we go further. We really must be confident we''ve thought through all the consequences of our architectural decisions before we invest development effort in them. It''s just too expensive to reverse a bad decision otherwise. Cheers, Eric> -----Original Message----- > From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf > Of Alexey Lyashkov > Sent: 05 October 2011 10:29 AM > To: wangdi > Cc: Alexander Boyko; Lustre Development Mailing List; Artem Blagodarenko; Nathan Rutman > Subject: Re: [Lustre-devel] Wide striping > > Hi All, > > > > FID-on-OST is actually part of DNE(dirtribute name space) phase I. It basically follows current fid > client server infrastructure. > > > > 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during > pre-creation. > > 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from > fid control server (root MDT). > > 3. Similar as MDT FID, there will be OI to map FID to object inside OST. > > > > The code will be release with DNE sometime next year. > > > I think we not need a special FID''s for OST object, except we want to migrate one object via different > data containers over cluster. > I think it''s not a priority for now. > So we can simplify a FID management for OST now. > Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}. > In that case OST not need allocate any FID''s, and MDT can reuse current reallocation scheme. > in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we > need a guaranteed free OST object exist when client tried to make access to that object. > in that case OST can preallocate some pool and report that size to MDT, > MDT know it''s uses some objects from that pool, but not know which object id assigned to file. > to avoid OST confusion client send a MDT FID to OST when need access to OST object. > OST look to OI database and check - is that FID assigned to something or not. > if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and > assign to that FID. > that''s all. > > orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will > kill a unallocated objects and return last index to the MDT. > open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle > FID as object index. > > > > -------------------------------------------- > Alexey Lyashkov > alexey_lyashkov at xyratex.com > > > > > ______________________________________________________________________ > This email may contain privileged or confidential information, which should only be used for the > purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such > information. If you are not the intended recipient of this message, please notify the sender by return > and delete it. You may not use, copy, disclose or rely on the information contained in it. > > Internet email is susceptible to data corruption, interception and unauthorised amendment for which > Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this > email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses > in this email, nor for any losses caused as a result of viruses. > > Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone > Road, Havant, Hampshire, PO9 1SA. > > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex > International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, > Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan > Limited registered in Japan. > ______________________________________________________________________ > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
On Oct 5, 2011, at 2:28 AM, Alexey Lyashkov wrote:> Hi All, >> >> FID-on-OST is actually part of DNE(dirtribute name space) phase I. It basically follows current fid client server infrastructure. >> >> 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. >> 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT). >> 3. Similar as MDT FID, there will be OI to map FID to object inside OST. >> >> The code will be release with DNE sometime next year. >> > I think we not need a special FID''s for OST object, except we want to migrate one object via different data containers over cluster. > I think it''s not a priority for now. > So we can simplify a FID management for OST now. > Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}. > In that case OST not need allocate any FID''s, and MDT can reuse current reallocation scheme. > in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we need a guaranteed free OST object exist when client tried to make access to that object. > in that case OST can preallocate some pool and report that size to MDT, > MDT know it''s uses some objects from that pool, but not know which object id assigned to file. > to avoid OST confusion client send a MDT FID to OST when need access to OST object. > OST look to OI database and check - is that FID assigned to something or not. > if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and assign to that FID. > that''s all. > > orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will kill a unallocated objects and return last index to the MDT. > open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle FID as object index. >What Shadow is saying here (correct me if I''m wrong) is that full-blown FIDs on OSTs are really needed; just a way to map the MDT fid to to the local object id. (The other general class of solution being to reserve a specific range of common ost object id''s, and do no mapping.) Both of these are significantly less complicated than the DNE FID-on-OST description. As I was hinting at before, perhaps there''s not a very strong case to be made for doing anything other than using the "just make it bigger" solution of BZ4424. I was trying to gauge the interest of the community in an intermediate solution.______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
On Oct 5, 2011, at 11:18 AM, Nathan Rutman wrote:> > On Oct 5, 2011, at 2:28 AM, Alexey Lyashkov wrote: > >> Hi All, >>> >>> FID-on-OST is actually part of DNE(dirtribute name space) phase I. It basically follows current fid client server infrastructure. >>> >>> 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. >>> 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT). >>> 3. Similar as MDT FID, there will be OI to map FID to object inside OST. >>> >>> The code will be release with DNE sometime next year. >>> >> I think we not need a special FID''s for OST object, except we want to migrate one object via different data containers over cluster. >> I think it''s not a priority for now. >> So we can simplify a FID management for OST now. >> Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}. >> In that case OST not need allocate any FID''s, and MDT can reuse current reallocation scheme. >> in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we need a guaranteed free OST object exist when client tried to make access to that object. >> in that case OST can preallocate some pool and report that size to MDT, >> MDT know it''s uses some objects from that pool, but not know which object id assigned to file. >> to avoid OST confusion client send a MDT FID to OST when need access to OST object. >> OST look to OI database and check - is that FID assigned to something or not. >> if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and assign to that FID. >> that''s all. >> >> orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will kill a unallocated objects and return last index to the MDT. >> open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle FID as object index. >> > > What Shadow is saying here (correct me if I''m wrong) is that full-blown FIDs on OSTs are really needed;s/are/aren''t/ :(> just a way to map the MDT fid to to the local object id. > (The other general class of solution being to reserve a specific range of common ost object id''s, and do no mapping.) Both of these are significantly less > complicated than the DNE FID-on-OST description. > > As I was hinting at before, perhaps there''s not a very strong case to be made for doing anything other than using the "just make it bigger" solution of BZ4424. > I was trying to gauge the interest of the community in an intermediate solution.______________________________________________________________________This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
On Oct 5, 2011, at 11:02 AM, Eric Barton wrote:> Shadow, > > Your comment describes create-on-write (CROW), which is vulnerable > to orphan creation by clients which have been evicted from the MDS > but are not actually dead, unless further safeguards are implemented > such as capabilities or server-cluster-wide client eviction.create-on-write isn''t really an integral part of this design, just a side thought. Let''s leave it out of the discussion for now.> > I also think that the decision to use FIDs in the way you suggest > has architectural implications which would benefit from further > discussion. The original idea was that a FID would be all you need > to identify any object (including its target) and that using them > uniformly in this way could help simplify the code and enable further > development - e.g. to allow unified targets which mix namespace and > data objects to better support small/sparse files. > > Making the FID just a unique identifier which requires a target index > to specify a specific object doesn''t have to be inconsistent with > uniform usage for data and metadata, but it has further knock-on > implications which must be acknowledged and debated explicitly > before we go further. We really must be confident we''ve thought > through all the consequences of our architectural decisions before > we invest development effort in them. It''s just too expensive to > reverse a bad decision otherwise.That''s what we''re trying to do now :) The issue as I see it is that we''re thinking about a feature that could be useful today, and is implementable today, except for the fact that there are some longer term plans that might conflict. Our wide-striping could be implemented on top of WC''s future FID-on-OST plans -- but would require that code to exist. So then we have to decide if waiting is the best option, or whether a more minimal change (probably the "common object ID" from my original arch email) could land first, and then DNE FID-on-OST could change it later.> > Cheers, > Eric > >> -----Original Message----- >> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf >> Of Alexey Lyashkov >> Sent: 05 October 2011 10:29 AM >> To: wangdi >> Cc: Alexander Boyko; Lustre Development Mailing List; Artem Blagodarenko; Nathan Rutman >> Subject: Re: [Lustre-devel] Wide striping >> >> Hi All, >>> >>> FID-on-OST is actually part of DNE(dirtribute name space) phase I. It basically follows current fid >> client server infrastructure. >>> >>> 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during >> pre-creation. >>> 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from >> fid control server (root MDT). >>> 3. Similar as MDT FID, there will be OI to map FID to object inside OST. >>> >>> The code will be release with DNE sometime next year. >>> >> I think we not need a special FID''s for OST object, except we want to migrate one object via different >> data containers over cluster. >> I think it''s not a priority for now. >> So we can simplify a FID management for OST now. >> Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}. >> In that case OST not need allocate any FID''s, and MDT can reuse current reallocation scheme. >> in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we >> need a guaranteed free OST object exist when client tried to make access to that object. >> in that case OST can preallocate some pool and report that size to MDT, >> MDT know it''s uses some objects from that pool, but not know which object id assigned to file. >> to avoid OST confusion client send a MDT FID to OST when need access to OST object. >> OST look to OI database and check - is that FID assigned to something or not. >> if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and >> assign to that FID. >> that''s all. >> >> orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will >> kill a unallocated objects and return last index to the MDT. >> open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle >> FID as object index. >> >> >> >> -------------------------------------------- >> Alexey Lyashkov >> alexey_lyashkov at xyratex.com >> >> >> >> >> ______________________________________________________________________ >> This email may contain privileged or confidential information, which should only be used for the >> purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such >> information. If you are not the intended recipient of this message, please notify the sender by return >> and delete it. You may not use, copy, disclose or rely on the information contained in it. >> >> Internet email is susceptible to data corruption, interception and unauthorised amendment for which >> Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this >> email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses >> in this email, nor for any losses caused as a result of viruses. >> >> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone >> Road, Havant, Hampshire, PO9 1SA. >> >> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex >> International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, >> Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan >> Limited registered in Japan. >> ______________________________________________________________________ >> >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
On 10/05/2011 09:06 AM, Nathan Rutman wrote:> >> >>> Common >>> Object ID >>> The MDT >>> tracks a special range of OST object ID?s (?wide stripe >>> objectid? = WSO) that are used on all OSTs. The MDT >>> assigns the next available WSO to the file, and this >>> objectid is used on all the OSTs. The OSTs must never use >>> these objects for regular striped files. A special >>> precreation group for these objects is probably necessary, >>> as well as orphan cleanup (the MDT should purge "hole" >>> objects that aren?t allocated from a particular OST). The >>> MDT should track the last assigned WSO; this will be the >>> starting point for new wide striped files after recovery. Objects >>> cannot be migrated >>> from one OST to another, since this would result in >>> out-of-order access. Similarly, stripes can >>> never be added to holes. >>> FID-on-OST >>> Use a >>> mapping of the MDT FID to uniquely determine an OST >>> object. The clients and MDT add in the OST number to the >>> MDT FID (probably just reserve one sequence per OST). >>> (This allows the objects to potentially migrate to >>> different OSTs). The OSTs then internally must map the >>> FID to a local object id. Note this allows OST-local >>> precreation pools, getting the MDT out of the >>> precreate/orphan cleanup business and potentially >>> improving create speeds, and also facilitates "create on >>> write" semantics. The FID can be assigned during the >>> first access to OST object. >> >> I am not sure I follow your idea here. You mean the OST needs >> internally map MDT FID(added in OST number) to object id (or inode ino) ? > yes. > >> So there are no real OST FID? > I suppose -- this is just a mapping of the MDT fid to the local OST > object id, via a local lookup on the OST. There would be something > like the OI to do this mapping. > >> But you also said "The FID can be >> assigned during the first access to OST object.", Could you please >> explain more here? > > Since the FID -> Objid mapping is performed locally, it doesn''t need > to be assigned until the first write. This is not integral to the > design, just a side effect.Ah, you mean the object ID can be assigned during the first access, instead of FID? This is indeed an interesting idea, and do not need extra space. But this may add some limits of the future. (what if we decides to store some small file data in MDT directly?) And also it will add another difference between MDT and OST, probably it conflicts with the efforts of unifying MDT and OST? I still prefer to have real OST FID, i.e. every object has its own identification in the cluster. Please correct me if I miss the point of your suggestion. Thanks Wangdi> >> >> >>> The big >>> problem here is that FID>OBJID ( or better >>> FID->inode id ) translation is absent from the OSTs >>> today. See http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf >>> (what is the current state of this?) There is also some >>> work in this direction in the OST restructuring work >>> (?Orion? WC branch, ORI-300(?), scheduled for Lustre 2.4). >>> >>> >>> There''s >>> a few questions here, probably the first of which is "is it >>> worthwhile to >>> spend effort on this, or is BZ4424 good enough?" Then there >>> is the question of object identification, where FID-on-OST >>> is more flexible, but also significantly more work (and >>> risk). Also, I thought I understood from the EOFS Summit >>> that WC also has a separate FID-on-OST project (separate >>> from Orion that is) -- can someone tell me the state of >>> that? >> >> FID-on-OST is actually part of DNE(dirtribute name space) phase I. >> It basically follows current fid client server infrastructure. >> >> 1. MDT is the fid client, which requests fid from the OST and >> allocates fids for the object during pre-creation. >> 2. OST is the fid server, which will allocate the FIDs to MDTs and >> requests super fid sequence from fid control server (root MDT). >> 3. Similar as MDT FID, there will be OI to map FID to object inside OST. > > To integrate with this, we would need to have a reserved sequence on > each OST that the MDT could assign FIDs from -- > the MDT would need to use the same Object ID on all OSTs. For DNE, > there would need to be a reserved sequence per OST per MDT. > > >> >> The code will be release with DNE sometime next year. >> >> Thanks >> WangDi >> >> > ______________________________________________________________________ > This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. > > Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. > > Xyratex Technology Limited (03134912), Registered in England& Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. > > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. > ______________________________________________________________________ > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20111005/837cc869/attachment.html
Hello! On Oct 5, 2011, at 3:44 PM, wangdi wrote:>> Since the FID -> Objid mapping is performed locally, it doesn''t need to be assigned until the first write. This is not integral to the design, just a side effect. > Ah, you mean the object ID can be assigned during the first access, instead of FID? This is indeed an interesting idea, and do not need extra space. But this may add some limits of the future. (what if we decides to store some small file data in MDT directly?) And also it will add another difference between MDT and OST, probably it conflicts with the efforts of unifying MDT and OST? I still prefer to have real OST FID, i.e. every object has its own identification in the cluster. Please correct me if I miss the point of your suggestion.Another problems I see here are similar to create on write. Say if we delete a file, do we purge this mapping table too? and then when stale client comes we recreate an orphan object? Or we don''t purge the table and let it grow indefinitely using more and more space and eventually slowing down lookups? Or do we purge really old objects from it only, what triggers it, what failure scenarios are there for this process? How do we recover from disasters that happened to this table? Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc.
On Wed, 2011-10-05 at 19:31 -0400, Oleg Drokin wrote:> Hello! > > On Oct 5, 2011, at 3:44 PM, wangdi wrote: > >> Since the FID -> Objid mapping is performed locally, it doesn''t > need to be assigned until the first write. This is not integral to > the design, just a side effect. > > Ah, you mean the object ID can be assigned during the first access, > instead of FID? This is indeed an interesting idea, and do not need > extra space. But this may add some limits of the future. (what if we > decides to store some small file data in MDT directly?) And also it > will add another difference between MDT and OST, probably it conflicts > with the efforts of unifying MDT and OST? I still prefer to have real > OST FID, i.e. every object has its own identification in the cluster. > Please correct me if I miss the point of your suggestion. > > Another problems I see here are similar to create on write. > Say if we delete a file, do we purge this mapping table too? and then > when stale client comes we recreate an orphan object? > Or we don''t purge the table and let it grow indefinitely using more > and more space and eventually slowing down lookups? > Or do we purge really old objects from it only, what triggers it, what > failure scenarios are there for this process? > How do we recover from disasters that happened to this table?Wouldn''t the online lfsck work being done for OpenSFS catch and correct these types of problems? It seems like it could provide a base for purging/compacting the table as well, but that''s obviously going to be a complicated endeavor.... -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
On 2011-10-05, at 9:33 AM, David Dillow <dillowda at ornl.gov> wrote:> On Wed, 2011-10-05 at 11:06 -0400, Nathan Rutman wrote: >> Sorry if I''m being unclear. >> >> start_index is just an offset into the bitmap. That''s the OST where the first >> stripe will be. Next stripe will be on the next OST index (unless a hole). >> When we get to the big hole at the end of the used OSTs, these OST index >> locations are all skipped (since they are holes), and the next stripe will >> be at OST index 0, then 1, etc, up to start_index-1 (again, unless holes). > > Ok, so bitmap position 0 is always OST 0; thanks for clearing up my > misunderstanding.But this means that the table always needs to be as large as the maximum OST number. If the bitmap started at the starting OST index it would only need to be as large as the number of stripes. That said, the limitation if not being able to migrate objects with this layout is a major one. The ability to so online object migration is just arriving with the layout lock (from HSM), so I expect this to be useful to many users. Cheers, Andreas
On Oct 5, 2011, at 6:51 PM, Andreas Dilger wrote:> On 2011-10-05, at 9:33 AM, David Dillow <dillowda at ornl.gov> wrote: >> On Wed, 2011-10-05 at 11:06 -0400, Nathan Rutman wrote: >>> Sorry if I''m being unclear. >>> >>> start_index is just an offset into the bitmap. That''s the OST where the first >>> stripe will be. Next stripe will be on the next OST index (unless a hole). >>> When we get to the big hole at the end of the used OSTs, these OST index >>> locations are all skipped (since they are holes), and the next stripe will >>> be at OST index 0, then 1, etc, up to start_index-1 (again, unless holes). >> >> Ok, so bitmap position 0 is always OST 0; thanks for clearing up my >> misunderstanding. > > But this means that the table always needs to be as large as the maximum OST number. If the bitmap started at the starting OST index it would only need to be as large as the number of stripes.Yes, the table is as large as the maximum possible OST number. 32,000 stripes fit in a bitmap in the current (non-extended) EA size. If you started at the starting OST index, you would need to record the last OST number also. Either way I don''t see as a problem.> > That said, the limitation if not being able to migrate objects with this layout is a major one. The ability to so online object migration is just arriving with the layout lock (from HSM), so I expect this to be useful to many users.Well, that''s why we added the complication of embedding the OST index into the object FIDs that the clients would ask for. Then you could migrate that object to a new OST - but really only for exceptional cases. General migration for e.g. space rebalancing would result in a bunch of additional overhead to figure out where all the stripes moved to. So I agree - this is a weakness of the bitmap design, which really implies a fixed ordering. ______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
Hello! On Oct 5, 2011, at 7:56 PM, David Dillow wrote:>> Another problems I see here are similar to create on write. >> Say if we delete a file, do we purge this mapping table too? and then >> when stale client comes we recreate an orphan object? >> Or we don''t purge the table and let it grow indefinitely using more >> and more space and eventually slowing down lookups? >> Or do we purge really old objects from it only, what triggers it, what >> failure scenarios are there for this process? >> How do we recover from disasters that happened to this table? > Wouldn''t the online lfsck work being done for OpenSFS catch and correct > these types of problems?It probably would, once the online lfsck is implemented. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc.
On Oct 3, 2011, at 3:15 PM, Nathan Rutman wrote:> ... snip... > We have been thinking about a different wide-striping method that > doesn''t have these problems. The basic idea is to create a new > stripe type that encodes the list of OSTs compactly, and then using > the same (or a calculable) object identifier (or FID) on all these > OSTs. > > > Our version of widestriping does not involve increasing the EA size > at all, but instead utilizes a new stripe pattern. (This will not > be understandable by older Lustre versions, which will generate an > error locally, or potentially we can convert into the BZ-4424 form > if the layout fits in that format). A bitmap will identify which > OSTs hold a stripe of this file. The bitmap should probably fit into > current ext4 EA size limit, giving us ~32k stripes. > > Some OST?s may be down at file creation time, or new OSTs added > later; hence there will likely be holes in the bitmap (but > relatively few).1) There will be holes when OST pools used: if the file can be written only to the set of OST from specific OST POOL and if by the virtue of configuration OSTs in the pool do not represent continuous set then there will be holes in OST bit map even if all OSTs are online. 2) "relatively few holes [in bitmap]" - did you consider compressing bitmap? Like BBC or WAH described at en.wikipedia.org/wiki/ Bitmap_index#Compression ? Reportedly you can do bitwise operations without decompression. This way you can go up in number of stripes (well, 32k is big number). But it may help control RPC size - you may represent wide striping with few integers effectively representing continuous blocks and OST holes, the size of the descriptor is the function of # of blocks and holes and to the less extent function of number of stripes. More: It is possible to have two bitmaps: 0000000111111111000000111111 - one describing general "blocks" of OST = ((beg1,end1),(beg2,end2)) 0000000000000010100100100000 - other describing "corrections" - drop two OST, add two OST ; here 4 bits, compressed to X bytes 0000000111111101100100011111 - OST map, computed on client as bitwise XOR to uncompressed maps (1) and (2) Each of two maps is compressed for transfer, thus shall not take much space. 3) If metadata file format going to be changed, is it right time to reserve descriptors to have few replicas of the file data? In such case we need to have number of replicas, and layout descriptor for each replica. Each replica may have different number of stripes, thus you can have widely striped file replica on SAS disks (or in flash) and replicate it to slower disk storage with one or "few" stripes for further tape archival. I assume after initial writes file has more or less "stable" content. Replicas can be on different media type, like flash/ SAS/ SATA, fast / cheap disks, effectively Hierarchical Storage. I''m thinking about "lazy" replication as you implemented to replicate data to another file system but in this case replication is within the same lustre file system. Client became aware of multiple replicas and can chose what file replica to use (e.g when some OSTs down). It eliminates OST as single point of failure. Alex.> ______________________________________________________________________ > This email may contain privileged or confidential information, which > should only be used for the purpose for which it was sent by > Xyratex. No further rights or licenses are granted to use such > information. If you are not the intended recipient of this message, > please notify the sender by return and delete it. You may not use, > copy, disclose or rely on the information contained in it. > > Internet email is susceptible to data corruption, interception and > unauthorised amendment for which Xyratex does not accept liability. > While we have taken reasonable precautions to ensure that this email > is free of viruses, Xyratex does not accept liability for the > presence of any computer viruses in this email, nor for any losses > caused as a result of viruses. > > Xyratex Technology Limited (03134912), Registered in England & > Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. > > The Xyratex group of companies also includes, Xyratex Ltd, > registered in Bermuda, Xyratex International Inc, registered in > California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, > Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic > of China and Xyratex Japan Limited registered in Japan. > ______________________________________________________________________ > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20111020/a01ee1d3/attachment.html
On 2011-10-20, at 10:24 AM, Alex Kulyavtsev wrote:> On Oct 3, 2011, at 3:15 PM, Nathan Rutman wrote: >> We have been thinking about a different wide-striping method that doesn''t have these problems. The basic idea is to create a new stripe type that encodes the list of OSTs compactly, and then using the same (or a calculable) object identifier (or FID) on all these OSTs. >> >> Our version of widestriping does not involve increasing the EA size at all, but instead utilizes a new stripe pattern. (This will not be understandable by older Lustre versions, which will generate an error locally, or potentially we can convert into the BZ-4424 form if the layout fits in that format). A bitmap will identify which OSTs hold a stripe of this file. The bitmap should probably fit into current ext4 EA size limit, giving us ~32k stripes. >> >> Some OST?s may be down at file creation time, or new OSTs added later; hence there will likely be holes in the bitmap (but relatively few). > > 1) There will be holes when OST pools used: if the file can be written only to the set of OST from specific OST POOL and if by the virtue of configuration OSTs in the pool do not represent continuous set then there will be holes in OST bit map even if all OSTs are online.Since the membership in a pool can change after a file is allocated, there cannot be anything in the layout that depends on the current membership of the pool. In this regard, the layout of a file that is allocated in the pool should be identical to a non-pool file, with the exception that it saves the pool name in which the file was created. That allows future operations (migration, replication, etc) to take the originally requested pool of the user into account.> 2) "relatively few holes [in bitmap]" - did you consider compressing bitmap? Like BBC or WAH described at en.wikipedia.org/wiki/Bitmap_index#Compression ? Reportedly you can do bitwise operations without decompression. This way you can go up in number of stripes (well, 32k is big number). But it may help control RPC size - you may represent wide striping with few integers effectively representing continuous blocks and OST holes, the size of the descriptor is the function of # of blocks and holes and to the less extent function of number of stripes.I think that having some kind of bitmap compression seems reasonable, and extends the number of stripes that can be fit into a single layout for most cases. Originally I was thinking that in addition to saving the starting index of the bitmap, we could also save the index at which the bitmap wraps back to 0 (i.e. bit N = (start_idx + N) % wrap_idx), but if there is bitmap compression then the run of zeroes between the starting index and the (lower) ending index could be stored efficiently as well.> More: > It is possible to have two bitmaps: > 0000000111111111000000111111 - one describing general "blocks" of OST = ((beg1,end1),(beg2,end2)) > 0000000000000010100100100000 - other describing "corrections" - drop two OST, add two OST ; here 4 bits, compressed to X bytes > 0000000111111101100100011111 - OST map, computed on client as bitwise XOR to uncompressed maps (1) and (2) > Each of two maps is compressed for transfer, thus shall not take much space.Originally, I was thinking that we don''t need to do boolean operations on the compressed bitmaps, but then I recall an idea I had many, many years ago about clients sending the "desired" (AND "available") OSC bitmap to the MDS. When the MDS is allocating objects on the OSTs it can AND the client bitmap with its allocation bitmap ("pool" bitmap AND "available objects" bitmap) to get the subset of OSTs where objects can be allocated. If we can do operations directly on the compressed bitmaps, not only does it save space, but it also saves cycles doing the operations.> 3) If metadata file format going to be changed, is it right time to reserve descriptors to have few replicas of the file data? > > In such case we need to have number of replicas, and layout descriptor for each replica. Each replica may have different number of stripes, thus you can have widely striped file replica on SAS disks (or in flash) and replicate it to slower disk storage with one or "few" stripes for further tape archival.Right. I''ve always thought that the different replicas of the file would have completely independent layouts, to allow what you suggest. The striping of a file would be completely different for nearline storage and archival storage (different OST counts at each layer vs. tape drives).> I assume after initial writes file has more or less "stable" content. Replicas can be on different media type, like flash/ SAS/ SATA, fast / cheap disks, effectively Hierarchical Storage. > I''m thinking about "lazy" replication as you implemented to replicate data to another file system but in this case replication is within the same lustre file system. Client became aware of multiple replicas and can chose what file replica to use (e.g when some OSTs down). It eliminates OST as single point of failure.Yes, my initial goal is to have background file replication as opposed to real-time replication. The main reason is due to the complexity of the implementation being lower. In fact, once we have decided on a new layout format for RAID-1+0 files, background replication and internal file migration can largely be implemented with the HSM code. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
On Oct 20, 2011, at 2:08 PM, Nathan Rutman wrote:> > On Oct 20, 2011, at 11:45 AM, Andreas Dilger wrote: > >> On 2011-10-20, at 10:24 AM, Alex Kulyavtsev wrote: >>> On Oct 3, 2011, at 3:15 PM, Nathan Rutman wrote: >>>> We have been thinking about a different wide-striping method that >>>> doesn''t have these problems. The basic idea is to create a new >>>> stripe type that encodes the list of OSTs compactly, and then >>>> using the same (or a calculable) object identifier (or FID) on >>>> all these OSTs. >>>> >>> >>> 1) There will be holes when OST pools used: if the file can be >>> written only to the set of OST from specific OST POOL and if by >>> the virtue of configuration OSTs in the pool do not represent >>> continuous set then there will be holes in OST bit map even if all >>> OSTs are online. >> >> Since the membership in a pool can change after a file is allocated, >> there cannot be anything in the layout that depends on the current >> membership of the pool. In this regard, the layout of a file that >> is allocated in the pool should be identical to a non-pool file, with >> the exception that it saves the pool name in which the file was >> created. >> That allows future operations (migration, replication, etc) to take >> the >> originally requested pool of the user into account. > Yes, exactly like current striping works -- pool name is recorded, > but is only informational: actual striping is explicitly recorded.Sorry for not being clear, I agree the file is laid out at creation time. I''m just trying to make a point pool configuration is the other source of holes in bitmap in addition to OSTs down. Suppose user purchased eight OST each year for three years, and allocated four OSTs to pool1, four to pool2. OST numbering get mixed and OSTs are assigned as follows: 1111 0000 1111 0000 1111 0000 - pool1 0000 1111 0000 1111 0000 1111 - pool2 All OSTs are up, and file was striped across all OSTs in pool1. Thus the file layout is like 1111 0000 1111 0000 1111 0000 The file has holes in OST layout because of pool configuration.> >> >>> 2) "relatively few holes [in bitmap]" - did you consider >>> compressing bitmap? Like BBC or WAH described at en.wikipedia.org/ >>> wiki/Bitmap_index#Compression ? Reportedly you can do bitwise >>> operations without decompression. This way you can go up in number >>> of stripes (well, 32k is big number). But it may help control RPC >>> size - you may represent wide striping with few integers >>> effectively representing continuous blocks and OST holes, the size >>> of the descriptor is the function of # of blocks and holes and to >>> the less extent function of number of stripes. >> >> I think that having some kind of bitmap compression seems reasonable, >> and extends the number of stripes that can be fit into a single >> layout >> for most cases. Originally I was thinking that in addition to saving >> the starting index of the bitmap, we could also save the index at >> which >> the bitmap wraps back to 0 (i.e. bit N = (start_idx + N) % wrap_idx), >> but if there is bitmap compression then the run of zeroes between the >> starting index and the (lower) ending index could be stored >> efficiently >> as well. > > I don''t think there''s any point of compressing this. 32,000 stripes > fit in the old EA limit, and there''s going to be plenty of other > limits hit before > we start using 32,000 OSTs. And even then, we can use the larger EA > size. So perhaps we turn the question around and ask, "how many > stripes do you want to support"?Frankly, we do not use wide striping at this point and 32k is a "large number." Having said that, if you have flash OST on each compute node and/or have replication and can use local disk on compute node for opportunistic storage ("local file replica"), the number of OSTs is O(compute nodes) in the cluster and that can be "large number" too. Best regards, Alex. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20111020/eff8fe41/attachment.html