Hello Here is a first draft for comments of the Lustre HSM HLD. It is intended to be a support for further analyzes and comments from CFS/Sun. The document covers the main parts of the HSM features but some elements are still lacking. The policy management and the space manager will be describe later. Let us know your comments and ideas about it. Regards, Aurelien Degremont CEA -------------- next part -------------- A non-text attachment was scrubbed... Name: hld_hsm.pdf Type: application/pdf Size: 159329 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080207/dd334e91/attachment-0004.pdf
All, I''m new to this list, so I''ll start with apologies. My Lustre background is also limited; a situation I hope to fix. As part of the Solaris Software Archiving group, I was asked to review the HSM HLD by my management. That review was sent to Peter Bojanic. He suggested I get involved in the community discussion. This is a posting of my original response, based on a copy of the HLD which seems to be the one posted. I''ve made a couple of minor corrections. Page 1, 1, Define coordinator (space coordinator?), define agent, (condense Part II intro, page 14) (for me, MDT, MGS and OST) Page 8, 3.8, "use" not "used" in second sentence Page 9, 3.8.2 et.al., "precised" (maybe, explicit or precise) Page 9, 3.8.4, Lustre ID "if" no path Page 10, 4.1, 1) When archived? (probably in Space Manager portion) SAM-QFS archives well ahead of space need. 4) External object reference must be unusable, until 5. 4.2, 2) Implies only one copy per "version"...bad idea Page 12, 5.3, Last Sentence, This enables, not This ables 6.1, 100,000 migrations make current migration list operations problematic (lets say want to move last migration to be next migration). Page 13, Lustre object mtime may not be good enough. There are several mechanisms (like touch) to manipulate mtime, which makes it unusable as a last written time. Page 15, a variant on 1.5, ask for/return last valid byte offset (perhaps within a range). Page 19, Special Path, does this boil down to invisible I/O? Page 23, 2.3 and 2.4, I''m assuming that lists of tuples can be processed in any order. Page 25, 1, Punch - becomes "sparse" not "spare" I think this spec needs to be more consistent with its use of data range. It is confusing as laid out. Page 26, 3.2 space will be exhausted, or space will be low, not space will be missing. Page 28, protection of Lustre extended attributes? Issues: The Space manager is likely the most important piece. There is no detail on it. This is where archive and other policy is enforced. The described HSM seems to follow the "copy out" when space needed, then purge, model. This function (a Space Manager function) is contrary to SAM, and a shortfall of many HSMs. File/object association is an important component of SAM. For example, if I access a file in a source tree, I''m likely to access the others as well. The purge (3.2, Space manager needs to make room) and 4.1 "needs to be atomic" is a complex operations. Sequencing is important. Coordination between agents seems important. For example, if agents requested new copy-outs on objects striped on 10 different stores, ordering them on tape seems difficult. What is the backup story for Lustre? How does that play with the HSM? -- --------------------------------------------------------------------- Rick Matthews email: Rick.Matthews at sun.com Sun Microsystems, Inc. phone:+1(651) 554-1518 1270 Eagan Industrial Road phone(internal): 54418 Suite 160 fax: +1(651) 554-1540 Eagan, MN 55121-1231 USA main: +1(651) 554-1500 ---------------------------------------------------------------------
Hello thank you for your review, I add some comments in the following Page 1, 1, Define coordinator (space coordinator?), define agent, (condense Part II intro, page 14) (for me, MDT, MGS and OST) These are defined in the arch wiki pages Page 10, 4.2, 2) Implies only one copy per "version"...bad idea Different versions correspond to different files in the external storage. We take the more recent. Not sure I understand your remark Page 13, Lustre object mtime may not be good enough. There are several mechanisms (like touch) to manipulate mtime, which makes it unusable as a last written time. If a user make a touch in the past this change the mtime and can hide previous writes. If we want to keep real write time we need to add a new time field in Lustre backend (may be ZFS has it) Page 19, Special Path, does this boil down to invisible I/O? The path is /mnt_mount/.lustre/fid/FID_NUMBER. When a file is open through this path a flag is carried to the OSS to avoid copy in trigger (this used to fill the file) Page 23, 2.3 and 2.4, I''m assuming that lists of tuples can be processed in any order. yes Issues: The Space manager is likely the most important piece. There is no detail on it. This is where archive and other policy is enforced. The space manager is based on changelogs/feed Lustre feature which are very new (draft HLD has just been published). This is why it not described at this time. The described HSM seems to follow the "copy out" when space needed, then purge, model. This function (a Space Manager function) is contrary to SAM, and a shortfall of many HSMs. no spacemanger is doing pre-migration and when free space is needed, it only has to make punc Coordination between agents seems important. For example, if agents requested new copy-outs on objects striped on 10 different stores, ordering them on tape seems difficult. Tape access optimization has to be made by the archival system. We try to put as few external storage knowledge as possible in Lustre to be external storage independant. What is the backup story for Lustre? How does that play with the HSM? HSM do not backup the namespace. It has to be done with a separate tool like a MDT scannner. The copy tool can use the FID2PATH() function to save the object pathname with the file. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080208/ac88ef54/attachment-0004.html
JC.LAFOUCRIERE at CEA.FR wrote: Thanks for allowing me to participate.> Hello > > thank you for your review, I add some comments in the following > > Page 1, 1, Define coordinator (space coordinator?), > define agent, (condense Part II intro, page 14) > (for me, MDT, MGS and OST) > These are defined in the arch wiki pages >Thank you, I still haven''t got to them yet...but plan to.> Page 10, > 4.2, 2) Implies only one copy per "version"...bad idea > Different versions correspond to different files in the external storage. We take the more recent. > Not sure I understand your remark >A basic mantra of SAM-QFS and other data retention systems is that one image of the data is vulnerable (a tape breaks, or is otherwise overwritten). While the archival system can be responsible for making multiple identical images, it can still represent a single point of failure. Note: I am using version to represent a point in time image of the files data, and copy to represent an image of that version. (See LOCKSS for additional references on copies).> Page 13, Lustre object mtime may not be good enough. There are several > mechanisms (like touch) to manipulate mtime, which makes it > unusable as a last written time. > If a user make a touch in the past this change the mtime and can hide previous writes. > If we want to keep real write time we need to add a new time field in Lustre backend > (may be ZFS has it) >What the archival system needs to know is that the copy previously made (or a first copy need to be made), which seems to be triggered by a user (not archive or other - like restore) write operation.> Page 19, Special Path, does this boil down to invisible I/O? > The path is /mnt_mount/.lustre/fid/FID_NUMBER. When a file is open through this path a > flag is carried to the OSS to avoid copy in trigger (this used to fill the file) > > Page 23, 2.3 and 2.4, I''m assuming that lists of tuples can be processed > in any order. > yes > > Issues: > The Space manager is likely the most important piece. There is no > detail on it. This is where archive and other policy is enforced. > The space manager is based on changelogs/feed Lustre feature which are very new (draft HLD has just been > published). This is why it not described at this time. >OK...also consider using change logs as a trigger for need of a new archive version (not copy). Alleviates the mtime issue above.> The described HSM seems to follow the "copy out" when space needed, > then purge, model. This function (a Space Manager function) is contrary > to SAM, and a shortfall of many HSMs. > no spacemanger is doing pre-migration and when free space is needed, it only has to make punc >OK, so who schedules the pre-migration to the archive system?> Coordination between agents seems important. For example, > if agents requested new copy-outs on objects striped on > 10 different stores, ordering them on tape seems difficult. > Tape access optimization has to be made by the archival system. We try to put as few external storage knowledge > as possible in Lustre to be external storage independant. >The isolation between archive system and file system is (to me) a good idea. I''d just like you to consider that the recall (stage-in) events can be optimized. At least, make sure the archive system is allowed to reorder as needed (hence the async - list of tuples in any order - question above). Think of other association between files to live storage as 1) a pre-stage operation, or 2) a disk cache pre-fetch operation. I hope I''m using understandable words ;>)> What is the backup story for Lustre? How does that play with > the HSM? > HSM do not backup the namespace. It has to be done with a separate tool like a MDT scannner. > The copy tool can use the FID2PATH() function to save the object pathname with the file. > >One point here is that an HSM + namespace/metadata backup + unarchived data capture can be used to be a nearly continuous backup operation with a relatively tiny backup window. -- --------------------------------------------------------------------- Rick Matthews email: Rick.Matthews at sun.com Sun Microsystems, Inc. phone:+1(651) 554-1518 1270 Eagan Industrial Road phone(internal): 54418 Suite 160 fax: +1(651) 554-1540 Eagan, MN 55121-1231 USA main: +1(651) 554-1500 ---------------------------------------------------------------------
Hello First of all, thanks for your remarks. Information explained in the architecture documents from the Arch Wiki have not been re-explained in the HLD. So some points could be unclear, but read or check the arch docs first. If the HLD must be self sufficent or more details are really needed, let me know. I will clarify some points anyway in the new document version. Rick Matthews a ?crit :> Page 10, 4.1, 1) When archived? (probably in Space Manager portion) > SAM-QFS archives well ahead of space need.Concerning the archived copies vunlerability, I''m not sure this is Lustre responsability to manage several copies of each of its file versions into the HSM...> 6.1, 100,000 migrations make current migration list operations > problematic (lets say want to move last migration to > be next migration).You speak about pending migrations ? This is just pointer manipulation. I do not see a real problem at this level. This value is only algorithmic indications, not about resources (memory, ...) But we could decrease this value to 10,000.> Page 13, Lustre object mtime may not be good enough. There are several > mechanisms (like touch) to manipulate mtime, which makes it > unusable as a last written time.If fact, this value is only needed for user information, not for Lustre internals. Lustre will based is comparison on the FID version. The mtime field is used for listing the file copies in the HSM, and as the lustre fid version is not relevant for the user, will indicates the associated file date at this time. (just a quick example, not the final output) user$ list_hsm_copies ./foo Storage Date Size Version ===========================================HSM1 Feb 2 2006 1566162 1 HSM1 Jun 18 2007 1423540 2 HSM1 Jun 18 2007 1900051 54 But the touch could be problematic. Lustre gurus, is there another time field we could use instead ? Should we add a "last-modification-field-which-ignore-touch" ? Is this really a problem is we use display a "touched" time ? In this case, we display what the user set on the file, we suppose he did it in purpose.> Page 15, a variant on 1.5, ask for/return last valid byte offset > (perhaps within a range).Why not... But do you have use cases were the current "Data available" feature as explained in 1.5 is not sufficent ?> Page 28, protection of Lustre extended attributes?I do not see what you mean.> Issues: > The purge (3.2, Space manager needs to make room) and 4.1 > "needs to be atomic" is a complex operations. Sequencing is > important.Does "transactionnal" fit ? I will add a Bugzilla entry and a new updated version the HLD on it, next Monday. Regards, -- Aurelien Degremont CEA/DAM - DIF/DSSI/SISR
DEGREMONT Aurelien wrote:> Hello > > Here is a first draft for comments of the Lustre HSM HLD. > It is intended to be a support for further analyzes and comments from > CFS/Sun. > > The document covers the main parts of the HSM features but some elements > are still lacking. > The policy management and the space manager will be describe later. > > Let us know your comments and ideas about it. > > Regards,5.1 external storage list - is this to be stored on the MGS device or a separate device? If the coordinator lives on the MGS, why not it''s storage as well? In any case, it should be possible to co-locate the coordinator on the MGS and used the MGS''s storage device, in the same way that the MGS can currently co-locate with the MDT. 6.3 object ref should include version number. Also include checksum? How does the coordinator request activity from an agent? If the coordinator is the RPC server, then it''s up to the agents to make requests; agents aren''t listening for RPC requests themselves. 2.1Archiving one Lustre file There should not be a cache miss when archiving a lustre file; perhaps open-by-fid is intended to bypass atime updates so that the file isn''t marked as "recently accessed"? 2.2Restoring a file "External ID" presumably contains all information required to retrieve the file - tape #, path name, etc? Once file is copied back, we should probably restore original ctime, mtime, atime - coordinator is storing this, correct? IV2 - why not multiple purged windows? Seems like if you''re going to purge 1 object out of a file, you might want to purge more. Specifically, it will probably be a common case to purge every object of a file from a particular OST. This is not contiguous in a striped file. I don''t see any reason to purge anything smaller than an entire object on an OST - is there good reason for this? If that''s the case, then it the OST must keep track of purged objects, not ranges within an existing object. If the MDT is tracking purged areas also, then there''s a good potential synergy here with a missing OST -- If the missing OST''s objects are marked as purged, then we can potentially recover them automatically from HSM... 4.2 How is a purge request recovered? For example, MDT says purge obj1 from ost1, ost1 replies "ok", but then dies before it actually does the purge. Reboots, doesn''t know anything about purge request now, but MDT has marked it as purged. Transparent access - should this avoid modification of atime/mtime? V2.1 How long does OST wait for completion? Is there a timeout? We probably need a "no timeout if progress is being made" kind of function - clients currently do this kind of thing with OSTs. V2.2 No need to copy-in purged data on full-object-size writes. Page 13, Lustre object mtime may not be good enough. There are several mechanisms (like touch) to manipulate mtime, which makes it unusable as a last written time. If a user make a touch in the past this change the mtime and can hide previous writes. If we want to keep real write time we need to add a new time field in Lustre backend (may be ZFS has it) If a user touches or otherwise modifies the mtime on purpose, they presumably know what they are doing. Besides, we''re using the object version number, not the mtime, to determine whether a file is up to date. I think this can be ignored.
Nathaniel Rutman a ?crit :> 5.1 external storage list - is this to be stored on the MGS device or a > separate device? If the coordinator lives on the MGS, why not it''s > storage as well? In any case, it should be possible to co-locate the > coordinator on the MGS and used the MGS''s storage device, in the same > way that the MGS can currently co-locate with the MDT. > How does the coordinator request activity from an agent? If the > coordinator is the RPC server, then it''s up to the agents to make > requests; agents aren''t listening for RPC requests themselves.Presently, it is never said that the coordinator will live on the MGS. The Coordinator constrains are: 1 - Must receive various migration requests from OST/MDT. 2 - Should be able to communicate with Agents and asks them migrations. 3 - Should store configuration and migration logs. I think #1 and #2 are two differents API. The coordinator is clearly a RPC server for the first one. How #2 should be implemented is not so clear. What would be be the "Lustre-way" here? For #3, the few logs that will be backed up here are not huge, and it surely could be colocated with another Target, but I''m not sure this should be mandatory. This device should be available to several servers, for failover like the other Targets. We could imagine having more than 1 coordinator at long term. I''m not sure it is a good idea to stick it to another target.> 6.3 object ref should include version number. Also include checksum?For data coherency? Should we add a explicit checksum for those values (stored in an EA) or used a possible backend feature (Can ZFS and ldiskfs detect EA value corruption by themselves?) ?> 2.1Archiving one Lustre file > There should not be a cache miss when archiving a lustre file; perhaps > open-by-fid is intended to bypass atime updates > so that the file isn''t marked as "recently accessed"?> Transparent access - should this avoid modification of atime/mtime? I would say yes.> 2.2Restoring a file > "External ID" presumably contains all information required to retrieve > the file - tape #, path name, etc? > Once file is copied back, we should probably restore original ctime, > mtime, atime - coordinator is storing this, correct?External ID is an opaque value manage by the archiving tool. If the HSM can store a lot of metadata, only a ref is needed, if not, the tool is responsible for storing all the data it needs. Anyway, this is totally opaque for Lustre. I hope the HSMs will not need so many data in this field. HPSS does not need so many data, it uses its internal DB to store them. I suppose SAM also.> IV2 - why not multiple purged windows? Seems like if you''re going to > purge 1 object out of a file, you might want to purge more. > Specifically, it will probably be a common case to purge every object of > a file from a particular OST. This is not contiguous in a > striped file. > I don''t see any reason to purge anything smaller than an entire object > on an OST - is there good reason for this?Multiple purged window is subtle. If you permit this feature, you could technically have, in the worst case, one purged window per byte, and this could be very huge to store. Do you think you will do several holes in the same file? In which cases? In fact, the more common case is to totally purge a file which have been migrated on HSM, and it is only an optimisation to keep the start and the end of the file on disk, to avoid triggering tons of cache misses with commands like "file foo/*" or a tool like Nautilus or Windows Explorer browsing the directory. The purged window is stored by per object, OST object and MDT object. So, if several objects are purged, each object will store its own purged window. But the MDT object describing this file will store a special purged window which starts at the smallest unavailable bytes and ends at the first available one. The MDT purged window indicates "if you do I/O in this range, you''re not sure the date are there." or "Outside of this area, I guarantee data are present." Maintain multiple purged windows will be an headache, with no real need I think. Moreover, people have asked for an OST-object based migration, even if I think whole file migration will be the most common case. > If that''s the case, then it> the OST must keep track of purged objects, not ranges within an existing > object.Objects are not removed, only their datas. All metadata are kept.> If the MDT is tracking purged areas also, then there''s a good potential > synergy here with a missing OST -- > If the missing OST''s objects are marked as purged, then we can > potentially recover them automatically from > HSM...What do you call a "missing OST" ? A corrupt one ? A offline one? Unavailable? Where will you copy back the object data ? On another OST object ? With the purged window on each OST object and MDT and the file stripping info, we could easily restore the missing parts.> 4.2 How is a purge request recovered? For example, MDT says purge obj1 > from ost1, ost1 replies "ok", but then dies before it actually > does the purge. Reboots, doesn''t know anything about purge request now, > but MDT has marked it as purged.The OST asynchronously acknowledges the purge when it is done. The MDT marks it purged only when it is really done. I will clarify this.> V2.1 How long does OST wait for completion? Is there a timeout? We > probably need a "no timeout if progress is being > made" kind of function - clients currently do this kind of thing with OSTs.I''m sure Lustre already has similar mechanisms for optimized timeout in this kind of situation we could reused here. What you describe is a good approach I think.> V2.2 No need to copy-in purged data on full-object-size writes.True. We could had such optimization. But this is only useful for small files or very widely stripped ones, doesn''t it? Thanks for your comments. -- Aurelien Degremont CEA
On Feb 08, 2008 16:55 +0100, Aurelien Degremont wrote:> But the touch could be problematic. Lustre gurus, is there another time > field we could use instead ? Should we add a > "last-modification-field-which-ignore-touch" ? Is this really a problem > is we use display a "touched" time ? In this case, we display what the > user set on the file, we suppose he did it in purpose.There was work done in ext4/ldiskfs to add a 64-bit "version" field to the on-disk inode, for use by lustre and NFSv4. In the ldiskfs case Lustre was free to store any information in this field it wanted. The planned use for this field is for "version based recovery" and it has the semantic that it is an increasing (though not necessarily sequential) version number that tracks any change to the file. This is stored in each inode on the MDT and each object on the OSTs. In ZFS I believe there is also a "last modified transaction group" (txg) number stored with each dnode that could be used in a similar manner. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Versions are critical - we need them for multiple things, let''s make sure we get exactly the right thing in ZFS also. - Peter - Andreas Dilger wrote:> On Feb 08, 2008 16:55 +0100, Aurelien Degremont wrote: > >> But the touch could be problematic. Lustre gurus, is there another time >> field we could use instead ? Should we add a >> "last-modification-field-which-ignore-touch" ? Is this really a problem >> is we use display a "touched" time ? In this case, we display what the >> user set on the file, we suppose he did it in purpose. >> > > There was work done in ext4/ldiskfs to add a 64-bit "version" field to > the on-disk inode, for use by lustre and NFSv4. In the ldiskfs case > Lustre was free to store any information in this field it wanted. The > planned use for this field is for "version based recovery" and it has > the semantic that it is an increasing (though not necessarily sequential) > version number that tracks any change to the file. This is stored in > each inode on the MDT and each object on the OSTs. > > In ZFS I believe there is also a "last modified transaction group" (txg) > number stored with each dnode that could be used in a similar manner. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/98333d16/attachment-0004.html
Aurelien Degremont wrote:> Nathaniel Rutman a ?crit : > >> 5.1 external storage list - is this to be stored on the MGS device or a >> separate device? If the coordinator lives on the MGS, why not it''s >> storage as well? In any case, it should be possible to co-locate the >> coordinator on the MGS and used the MGS''s storage device, in the same >> way that the MGS can currently co-locate with the MDT. >> How does the coordinator request activity from an agent? If the >> coordinator is the RPC server, then it''s up to the agents to make >> requests; agents aren''t listening for RPC requests themselves. >> > > Presently, it is never said that the coordinator will live on the MGS. > The Coordinator constrains are: > 1 - Must receive various migration requests from OST/MDT. > 2 - Should be able to communicate with Agents and asks them migrations. > 3 - Should store configuration and migration logs. > > I think #1 and #2 are two differents API. The coordinator is clearly a > RPC server for the first one. How #2 should be implemented is not so > clear. What would be be the "Lustre-way" here? >With userspace servers, presumably we have some way of passing LNET messages from kernel to userspace. We should probably still go through LNET for #2 in order to use the broadest range of network fabrics. So it could be the same or similar RPC. There is no "Lustre-way" for this area - we''ve never done this kind of thing before.> For #3, the few logs that will be backed up here are not huge, and it > surely could be colocated with another Target, but I''m not sure this > should be mandatory. This device should be available to several servers, > for failover like the other Targets. We could imagine having more than 1 > coordinator at long term. I''m not sure it is a good idea to stick it to > another target. >Not mandatory, but possible is nice. Minimize the number of required partitions.> >> 6.3 object ref should include version number. Also include checksum? >> > > For data coherency? Should we add a explicit checksum for those values > (stored in an EA) or used a possible backend feature (Can ZFS and > ldiskfs detect EA value corruption by themselves?) ? >ZFS can, ldiskfs cannot. Anyhow, it was just a thought. Doesn''t hurt to allow space for it.> >> 2.1Archiving one Lustre file >> There should not be a cache miss when archiving a lustre file; perhaps >> open-by-fid is intended to bypass atime updates >> so that the file isn''t marked as "recently accessed"? >> > > Transparent access - should this avoid modification of atime/mtime? > > I would say yes. > > >> 2.2Restoring a file >> "External ID" presumably contains all information required to retrieve >> the file - tape #, path name, etc? >> Once file is copied back, we should probably restore original ctime, >> mtime, atime - coordinator is storing this, correct? >> > > External ID is an opaque value manage by the archiving tool. If the HSM > can store a lot of metadata, only a ref is needed, if not, the tool is > responsible for storing all the data it needs. Anyway, this is totally > opaque for Lustre. > I hope the HSMs will not need so many data in this field. HPSS does not > need so many data, it uses its internal DB to store them. I suppose SAM > also. >What about restore of original ctime, mtime, atime? I think we must store it in the coordinator because we must work with all HSMs, and I think it is important to restore it.> >> IV2 - why not multiple purged windows? Seems like if you''re going to >> purge 1 object out of a file, you might want to purge more. >> Specifically, it will probably be a common case to purge every object of >> a file from a particular OST. This is not contiguous in a >> striped file. >> I don''t see any reason to purge anything smaller than an entire object >> on an OST - is there good reason for this? >> > > Multiple purged window is subtle. If you permit this feature, you could > technically have, in the worst case, one purged window per byte, and > this could be very huge to store. Do you think you will do several holes > in the same file? In which cases? >Like I said, I don''t see any reason to purge anything smaller than a full object; I would in fact disallow purging of an arbitrary byte range, and only allow purging on full-object boundaries.> In fact, the more common case is to totally purge a file which have been > migrated on HSM, and it is only an optimisation to keep the start and > the end of the file on disk, to avoid triggering tons of cache misses > with commands like "file foo/*" or a tool like Nautilus or Windows > Explorer browsing the directory. >Again, since Lustre is optimized to work with 1MB chunks anyhow, I don''t think it helps much to keep less than that in the beginning / end objects, so I would say just keep the first and last blocks instead.> The purged window is stored by per object, OST object and MDT object. > So, if several objects are purged, each object will store its own purged > window. But the MDT object describing this file will store a special > purged window which starts at the smallest unavailable bytes and ends at > the first available one. The MDT purged window indicates "if you do I/O > in this range, you''re not sure the date are there." or "Outside of this > area, I guarantee data are present." > Maintain multiple purged windows will be an headache, with no real need > I think. > Moreover, people have asked for an OST-object based migration, even if I > think whole file migration will be the most common case. > > > > If that''s the case, then it > >> the OST must keep track of purged objects, not ranges within an existing >> object. >> > > Objects are not removed, only their datas. All metadata are kept. > > >> If the MDT is tracking purged areas also, then there''s a good potential >> synergy here with a missing OST -- >> If the missing OST''s objects are marked as purged, then we can >> potentially recover them automatically from >> HSM... >> > > What do you call a "missing OST" ? A corrupt one ? A offline one? > Unavailable? >Yes. All of the above. Obviously we need to distinguish between "permanently gone" and "temporarily gone".> Where will you copy back the object data ? On another OST object ? >Yes. Some kind of recovery will take place to generate a new object on a different OST and we can restore the data there.> With the purged window on each OST object and MDT and the file stripping > info, we could easily restore the missing parts. >Exactly. This is why I say we should think about this now, to allow for this possibility.> >> 4.2 How is a purge request recovered? For example, MDT says purge obj1 >> from ost1, ost1 replies "ok", but then dies before it actually >> does the purge. Reboots, doesn''t know anything about purge request now, >> but MDT has marked it as purged. >> > > The OST asynchronously acknowledges the purge when it is done. The MDT > marks it purged only when it is really done. I will clarify this. > > >> V2.1 How long does OST wait for completion? Is there a timeout? We >> probably need a "no timeout if progress is being >> made" kind of function - clients currently do this kind of thing with OSTs. >> > > I''m sure Lustre already has similar mechanisms for optimized timeout in > this kind of situation we could reused here. > What you describe is a good approach I think. > > >> V2.2 No need to copy-in purged data on full-object-size writes. >> > > True. We could had such optimization. But this is only useful for small > files or very widely stripped ones, doesn''t it? >No, we very frequently write entire stripes (objects). Lustre clients can optimize for this.> > Thanks for your comments. > >
Hi, On Seg, 2008-02-11 at 11:18 -0700, Andreas Dilger wrote:> On Feb 08, 2008 16:55 +0100, Aurelien Degremont wrote: > > But the touch could be problematic. Lustre gurus, is there another time > > field we could use instead ? Should we add a > > "last-modification-field-which-ignore-touch" ? Is this really a problem > > is we use display a "touched" time ? In this case, we display what the > > user set on the file, we suppose he did it in purpose. > > (snip) > > In ZFS I believe there is also a "last modified transaction group" (txg) > number stored with each dnode that could be used in a similar manner.Hmm.. I think ZFS only has zp_gen in the dnode/znode, which is the txg of the file creation. We also cannot use the txg birth time of the block where the dnode is stored because a metadnode block holds several dnodes. I may be missing something here, but isn''t the "ctime" the appropriate value to use here? Regards, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/03b52971/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/03b52971/attachment-0004.gif
On Feb 11, 2008 21:11 +0000, Ricardo Correia wrote:> On Seg, 2008-02-11 at 11:18 -0700, Andreas Dilger wrote: > > > On Feb 08, 2008 16:55 +0100, Aurelien Degremont wrote: > > > But the touch could be problematic. Lustre gurus, is there another time > > > field we could use instead ? Should we add a > > > "last-modification-field-which-ignore-touch" ? Is this really a problem > > > is we use display a "touched" time ? In this case, we display what the > > > user set on the file, we suppose he did it in purpose. > > > > (snip) > > > > In ZFS I believe there is also a "last modified transaction group" (txg) > > number stored with each dnode that could be used in a similar manner. > > > Hmm.. I think ZFS only has zp_gen in the dnode/znode, which is the txg > of the file creation. We also cannot use the txg birth time of the block > where the dnode is stored because a metadnode block holds several > dnodes. > > I may be missing something here, but isn''t the "ctime" the appropriate > value to use here?The problem with ctime (on Linux as well) is that it is possible for the system clock to go backward, whether due to ntp, or because the hardware clock is incorrect/reset, so it cannot be depended upon to be monotonically increasing for the life of the lustre filesystem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:> The problem with ctime (on Linux as well) is that it is possible for the > system clock to go backward, whether due to ntp, or because the hardware > clock is incorrect/reset, so it cannot be depended upon to be monotonically > increasing for the life of the lustre filesystem.Ok. In that case, we could either add a new 64-bit version field to the dnode (or znode) similar to the one in ldiskfs, or we could look at the birth time (txg nr) of all the block pointers in the dnode. Using txg numbers might not be very useful if an object is migrated from one storage device to another, but I have not read the HSM HLD so I''m not sure if this is a problem or not. Cheers, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/4de0ee30/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/4de0ee30/attachment-0004.gif
Ricardo M. Correia wrote:> > On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote: >> The problem with ctime (on Linux as well) is that it is possible for the >> system clock to go backward, whether due to ntp, or because the hardware >> clock is incorrect/reset, so it cannot be depended upon to be monotonically >> increasing for the life of the lustre filesystem. >> > > Ok. In that case, we could either add a new 64-bit version field to > the dnode (or znode) similar to the one in ldiskfs, or we could look > at the birth time (txg nr) of all the block pointers in the dnode. > Using txg numbers might not be very useful if an object is migrated > from one storage device to another, but I have not read the HSM HLD so > I''m not sure if this is a problem or not.I''m missing the point of this discussion. Clearly we shouldn''t/can''t use ctime/mtime for anything internal to Lustre; that is what object versions are all about. Why are we talking about adding new fields or anything else?
I''m probably responsible for opening this can of worms. I inferred from the HSM HLD that mtime was proposed to be used for state change, or version of the file/object. As the discussion bears out, mtime for this purpose would be a bad idea. A reliable way of detecting change is needed, and if it already exists withing Lustre, great!. What I think is far more significant is the involvement of the community on issues such as this. More folks examining (and critiquing) the details, the better. Nice to see such an active community. -- Nathaniel Rutman wrote:> Ricardo M. Correia wrote: >> >> On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote: >>> The problem with ctime (on Linux as well) is that it is possible for >>> the >>> system clock to go backward, whether due to ntp, or because the >>> hardware >>> clock is incorrect/reset, so it cannot be depended upon to be >>> monotonically >>> increasing for the life of the lustre filesystem. >>> >> >> Ok. In that case, we could either add a new 64-bit version field to >> the dnode (or znode) similar to the one in ldiskfs, or we could look >> at the birth time (txg nr) of all the block pointers in the dnode. >> Using txg numbers might not be very useful if an object is migrated >> from one storage device to another, but I have not read the HSM HLD >> so I''m not sure if this is a problem or not. > I''m missing the point of this discussion. Clearly we shouldn''t/can''t > use ctime/mtime for anything internal to Lustre; that is what object > versions are all about. Why are we talking about adding new fields or > anything else? > >-- --------------------------------------------------------------------- Rick Matthews email: Rick.Matthews at sun.com Sun Microsystems, Inc. phone:+1(651) 554-1518 1270 Eagan Industrial Road phone(internal): 54418 Suite 160 fax: +1(651) 554-1540 Eagan, MN 55121-1231 USA main: +1(651) 554-1500 ---------------------------------------------------------------------
On Seg, 2008-02-11 at 14:32 -0800, Nathaniel Rutman wrote:> I''m missing the point of this discussion. Clearly we shouldn''t/can''t > use ctime/mtime for anything internal to Lustre; that is what object > versions are all about. Why are we talking about adding new fields or > anything else?If by object versions you are referring to the version field in the ldiskfs inodes that Andreas mentioned, then we need to add a similar field/attribute in ZFS. It seems that Andreas has already filed bug 14865 for this. Cheers, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080212/70ae527c/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080212/70ae527c/attachment-0004.gif
On Feb 11, 2008 12:33 -0800, Nathaniel Rutman wrote:> Aurelien Degremont wrote: > > Nathaniel Rutman a ?crit : > >> IV2 - why not multiple purged windows? Seems like if you''re going to > >> purge 1 object out of a file, you might want to purge more. > >> Specifically, it will probably be a common case to purge every object of > >> a file from a particular OST. This is not contiguous in a > >> striped file. > >> I don''t see any reason to purge anything smaller than an entire object > >> on an OST - is there good reason for this? > > > > Multiple purged window is subtle. If you permit this feature, you could > > technically have, in the worst case, one purged window per byte, and > > this could be very huge to store. Do you think you will do several holes > > in the same file? In which cases?One issue is that if you are purging individual objects from a file your windows will be quite disjoint at the file level. That may not be a serious problem for applications that only look at the first and last chunks of a file. I can imagine use cases for extremely large files and limited-sized caches where there is a need to access only subsets of the file (i.e. the entire file cannot be resident at one time). That said, it may be this is too complex for the initial implementation.> Like I said, I don''t see any reason to purge anything smaller than a > full object; I would in fact disallow purging of an arbitrary byte range, > and only allow purging on full-object boundaries.That is impractical, for the reasons that Aurelien mentioned - we want to avoid file re-staging for tools like "file" and GUIs that read the start/end of files to determine file type and icons.> > In fact, the more common case is to totally purge a file which have been > > migrated on HSM, and it is only an optimisation to keep the start and > > the end of the file on disk, to avoid triggering tons of cache misses > > with commands like "file foo/*" or a tool like Nautilus or Windows > > Explorer browsing the directory. > > Again, since Lustre is optimized to work with 1MB chunks anyhow, I don''t > think it helps much to keep less than that in the beginning / end objects, > so I would say just keep the first and last blocks instead.What if file is N*1MB + 1 byte? We need to be able to keep something like 64kB for a windows icon, so having some arbitrary byte range seems reasonable.> > The purged window is stored by per object, OST object and MDT object. > > So, if several objects are purged, each object will store its own purged > > window. But the MDT object describing this file will store a special > > purged window which starts at the smallest unavailable bytes and ends at > > the first available one.I think this should read "ends at the highest range contiguous to the end of the file" or similar, or it will be misleading in the multi-object case.> >> the OST must keep track of purged objects, not ranges within an existing > >> object. > > > > Objects are not removed, only their datas. All metadata are kept.The one drawback with this approach is that it is not possible to HSM copy-in objects to a different OST than where they were originally stored. BUT... in conjunction with the migration tool it should be able to migrate an (empty) object from one OST to another before the copy-in from HSM, so long as there is no OST-specific data in the HSM identifier (i.e. the HSM label is truely opaque).> >> If the MDT is tracking purged areas also, then there''s a good potential > >> synergy here with a missing OST -- > >> If the missing OST''s objects are marked as purged, then we can > >> potentially recover them automatically from > >> HSM... > > > > What do you call a "missing OST" ? A corrupt one ? A offline one? > > Unavailable? > > Yes. All of the above. Obviously we need to distinguish between > "permanently gone" and "temporarily gone".I suppose this leads to a requirement to store the object in HSM so that it can be accessed just by the object FID+version. That would allow the OST to be restored from HSM even if the entire OST filesystem is lost, potentially modifying the FLDB to relocate the FID to a different OST.> > Where will you copy back the object data ? On another OST object ? > > Yes. Some kind of recovery will take place to generate a new object on > a different OST and we can restore the data there. > > With the purged window on each OST object and MDT and the file stripping > > info, we could easily restore the missing parts. > > Exactly. This is why I say we should think about this now, to allow for > this possibility.Right. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi, Sorry if these questions duplicates previous debate. Have I understood correctly that the design allows individual objects within a lustre file (i.e. stripes?) to be purged independently? If so why is this needed? I would have thought that when you purge a file, you need only record the purged extent as an attribute of the whole lustre file and punch its stripes to free the space. Am I missing a use case? -- Cheers, Eric
Eric Barton a ?crit :> Hi, > > Sorry if these questions duplicates previous debate. > > Have I understood correctly that the design allows individual objects > within a lustre file (i.e. stripes?) to be purged independently? > > If so why is this needed? > > I would have thought that when you purge a file, you need only record > the purged extent as an attribute of the whole lustre file and punch > its stripes to free the space. Am I missing a use case?Since the beginning CFS required this feature. It seems a lab ask for it. I do not know who. Unfortunately we have no use case for what they want to do with this. I''m wondering if their need could not be met with other features like the internal Lustre migration... -- Aurelien Degremont CEA
It is important to note that all comparisons and modifications are done at Lustre-object level: OST stripe object or MDT file object, each of those objects already has a version field, in the FID. This is the version inside the FID that we will use for all treatments. All purges are always requested for a specific FID. The mtime is stored only for information, for the users. It is simpler to display to the user: user$ list_hsm_copies ./foo Date ===========Feb 2 2006 Jun 18 2007 Jun 19 2007 than: user$ list_hsm_copies ./foo Version ===========0x0012356 0x001a250 0x001a011 If the user "touched" the file sometime, he knew what he has done. Just the output will be different, but internaly, we manipulate Lustre FID and so we don''t care of mtime. So the "version" in the backend is not a problem. We do not rely on the ldiskfs/zfs inode versioning. Aurelien Degremont Rick Matthews a ?crit :> I''m probably responsible for opening this can of worms. I inferred from > the HSM HLD that > mtime was proposed to be used for state change, or version of the > file/object. As the discussion > bears out, mtime for this purpose would be a bad idea. A reliable way of > detecting change is > needed, and if it already exists withing Lustre, great!. > > What I think is far more significant is the involvement of the community > on issues > such as this. More folks examining (and critiquing) the details, the > better. > Nice to see such an active community. > -- > > Nathaniel Rutman wrote: >> Ricardo M. Correia wrote: >>> On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote: >>>> The problem with ctime (on Linux as well) is that it is possible for >>>> the >>>> system clock to go backward, whether due to ntp, or because the >>>> hardware >>>> clock is incorrect/reset, so it cannot be depended upon to be >>>> monotonically >>>> increasing for the life of the lustre filesystem. >>>> >>> Ok. In that case, we could either add a new 64-bit version field to >>> the dnode (or znode) similar to the one in ldiskfs, or we could look >>> at the birth time (txg nr) of all the block pointers in the dnode. >>> Using txg numbers might not be very useful if an object is migrated >>> from one storage device to another, but I have not read the HSM HLD >>> so I''m not sure if this is a problem or not. >> I''m missing the point of this discussion. Clearly we shouldn''t/can''t >> use ctime/mtime for anything internal to Lustre; that is what object >> versions are all about. Why are we talking about adding new fields or >> anything else? >> >> > >-- Aurelien Degremont CEA
On Feb 12, 2008 16:25 +0100, Aurelien Degremont wrote:> Eric Barton a ?crit : > > Sorry if these questions duplicates previous debate. > > > > Have I understood correctly that the design allows individual objects > > within a lustre file (i.e. stripes?) to be purged independently? > > > > If so why is this needed? > > > > I would have thought that when you purge a file, you need only record > > the purged extent as an attribute of the whole lustre file and punch > > its stripes to free the space. Am I missing a use case? > > Since the beginning CFS required this feature. It seems a lab ask for > it. I do not know who. Unfortunately we have no use case for what they > want to do with this. > I''m wondering if their need could not be met with other features like > the internal Lustre migration...That is my understanding also - I believe one of the Labs wanted this (to be able to do HSM on a per-stripe basis instead of a per-file basis). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas, Is this requirement documented? I''d appreciate any pointers...> -----Original Message----- > From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] > On Behalf Of Andreas Dilger > Sent: 12 February 2008 5:23 PM > To: Aurelien Degremont > Cc: Eric Barton; lustre-devel at lists.lustre.org > Subject: Re: [Lustre-devel] Lustre HSM HLD draft > > On Feb 12, 2008 16:25 +0100, Aurelien Degremont wrote: > > Eric Barton a ?crit : > > > Sorry if these questions duplicates previous debate. > > > > > > Have I understood correctly that the design allows > individual objects > > > within a lustre file (i.e. stripes?) to be purged independently? > > > > > > If so why is this needed? > > > > > > I would have thought that when you purge a file, you need > only record > > > the purged extent as an attribute of the whole lustre > file and punch > > > its stripes to free the space. Am I missing a use case? > > > > Since the beginning CFS required this feature. It seems a > lab ask for > > it. I do not know who. Unfortunately we have no use case > for what they > > want to do with this. > > I''m wondering if their need could not be met with other > features like > > the internal Lustre migration... > > That is my understanding also - I believe one of the Labs wanted this > (to be able to do HSM on a per-stripe basis instead of a > per-file basis). > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >
Andreas Dilger wrote:> On Feb 12, 2008 16:25 +0100, Aurelien Degremont wrote: > >> Eric Barton a ?crit : >> >>> Sorry if these questions duplicates previous debate. >>> >>> Have I understood correctly that the design allows individual objects >>> within a lustre file (i.e. stripes?) to be purged independently? >>> >>> If so why is this needed? >>> >>> I would have thought that when you purge a file, you need only record >>> the purged extent as an attribute of the whole lustre file and punch >>> its stripes to free the space. Am I missing a use case? >>> >> Since the beginning CFS required this feature. It seems a lab ask for >> it. I do not know who. Unfortunately we have no use case for what they >> want to do with this. >> I''m wondering if their need could not be met with other features like >> the internal Lustre migration... >> > > That is my understanding also - I believe one of the Labs wanted this > (to be able to do HSM on a per-stripe basis instead of a per-file basis). >This doesn''t make any sense to me. Layouts may change; a stripe on one filesystem may not correspond to a stripe on a replica of the filesystem; exposing stripes to user apps is a bad idea. I''m going to propose what I think we need: 1. Punch a single, arbitrary byte range from the middle of a file (thus leaving beginning and end for file type, icons, filesize. 2. No other arbitrary punch patterns. 3. The punched range is stored on the MDT alone. 4. Once punched, the OST may forget about any fully-punched stripes it used to hold. 5. Clients must take a layout lock (CR) when they retrieve the layout from the MDT. If the MDT punches from the middle, it revokes the layout lock, and clients must re-enqueue it for further read/write on the file. The MDT is the sole keeper of the layout, and it must be protected by a lock. 6. Client access within a punched range results in an RPC to the MDT. The MDT decides where to put the restored data, organizes the restoration (via the coordinator), and rewrites the layout (under lock, of course). Client gets the new layout, and can contact the appropriate OST.
Aurelien and JC, Sorry that my feedback is late. Here are my questions/remarks. General * Any thought on how quotas will be handled? Coordinator * 3.4 - I was curious what the precise use case was that was driving this? I don''t disagree with it, but I was curious for more background * 3.7.1 - The coordinator could become a scaling bottleneck. We should think about how this will be scaled in the future * 4.1 - Does the coordinator store the ext obj id or does the agent * 4.3 item 2 - This looks like the coordinator could become a bottle neck for unlinks and slow down performance. Could this be put in some type of async queue to be processed later (or some type of attic space)? Use Cases * 2.3 (Use cases) - I''m really keen on this feature. I think it is very important in order to make small file performance work well. Unfortunately, it isn''t clear how the file list gets communicated to the archive tool. The coordinator and agent seem to only take one file at a time. So how would this work exactly? * 2.4 - The copy tool should be allowed to preemptively restage files. I think this will work with the design, but we should make sure of this. This would be useful for restaging a whole tar file versus doing things piece-meal. Part IV 2 EAs - I''m worried that the EA list could get huge for holes. 3.2 -item 3 - Who insures a file is archived before punches are made? 3.3 - Another use case... The user checks to see if a file has been archived. Also, someone earlier made the point about the archive tool being able to reorder request. This is really important since an archival system wants to know all the files being restaged in order to order tape mounts and reads. Thanks for taking the lead on this. It looks like there is a lot of interest in it. --Shane -----Original Message----- From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of DEGREMONT Aurelien Sent: Thursday, February 07, 2008 5:53 AM To: lustre-devel at lists.lustre.org Subject: [Lustre-devel] Lustre HSM HLD draft Hello Here is a first draft for comments of the Lustre HSM HLD. It is intended to be a support for further analyzes and comments from CFS/Sun. The document covers the main parts of the HSM features but some elements are still lacking. The policy management and the space manager will be describe later. Let us know your comments and ideas about it. Regards, Aurelien Degremont CEA
Canon, Richard Shane a ?crit :> General > * Any thought on how quotas will be handled?That''s a very good question. I think this point should be discussed. The purge possibility introduces two values which could be under quota. 1 - File size (current case) 2 - The disk occupation are used (migrated files free quota) The first point are the simplest to implement and will need fewer modifications, but users could not free quota even if all their files are migrated. The second point could help users but this will be problematic when they will copy back some of their file, because this will trigger space issues and purge requests on theirs other files, and so on. IMO, the best way is to take choice #1 and possibly add a ''real disk use'' quota value that could be tuned by admins. I''m not a Lustre quota specialist and AFAIK this code is a bit touchy.> Coordinator > * 3.4 - I was curious what the precise use case was that was driving > this? I don''t disagree with it, but I was curious for more backgroundCoordinator is designed to also manage internal Lustre migrations.> * 3.7.1 - The coordinator could become a scaling bottleneck. We should > think about how this will be scaled in the futureI think we should be able to have several coordinators in the future. Each of them dealing with different external storages.> * 4.1 - Does the coordinator store the ext obj id or does the agentThe agent does not have a storage device. It stores nothing. The external IDs are in MDT device.> * 4.3 item 2 - This looks like the coordinator could become a bottle > neck for unlinks and slow down performance. Could this be put in some > type of async queue to be processed later (or some type of attic space)?Yes, I think unlinks should be handled asynchronously.> Use Cases > * 2.3 (Use cases) - I''m really keen on this feature. I think it is very > important in order to make small file performance work well. > Unfortunately, it isn''t clear how the file list gets communicated to the > archive tool. The coordinator and agent seem to only take one file at a > time. So how would this work exactly?In fact, we have presently designed the archiving tool to support this feature and only it because the archiving tool could be developped by anyone and we want this API being as stable as possible. The current Lustre component design does not handle it. But it will be added later, in a second step, and the copy tool developped since will be already compatible with it.> * 2.4 - The copy tool should be allowed to preemptively restage files. > I think this will work with the design, but we should make sure of this. > This would be useful for restaging a whole tar file versus doing things > piece-meal.That''s an interesting point. I think we could avoid it but it is an interesting feature. I must think how we should modify the design to permit it. (The tool should be able to warn the coordinator: oh, i''m staging this file also! please note it)> 2 EAs - I''m worried that the EA list could get huge for holes.This part has been redesigned. The data that were stored in EA have been moved. It will be explained in the new document version.> 3.2 -item 3 - Who insures a file is archived before punches are made?The space manager did it. It is the only one which will make punch request. May be MDT could ensure it before dealing with it.> 3.3 - Another use case... The user checks to see if a file has been > archived.Ok> Also, someone earlier made the point about the archive tool being able > to reorder request. This is really important since an archival system > wants to know all the files being restaged in order to order tape mounts > and reads.I do not see any problem with this. I will add this point in the doc.> Thanks for taking the lead on this. It looks like there is a lot of > interest in it.Thanks you for your very interesting comments. -- Aurelien Degremont CEA
Hello I''ve got several wondering about some specific point in HSM implementation and I would like your opinion about them. Coordinator: This element will manage migration externally (HSM) and internally of Lustre (space balancing?). Is the current API acceptable (specific calls for external migration, and other ones for internal migration)? The best way could have been to have generic call for migration, but we must also have generic objects to describe the migration sources and destinations and those are not simples. We finally conclude with the API presented in the HLD document. Tell me if this is *really* a bad idea or if only adjustments are needed. We presented two modes of migration, explicit and implicit migrations. The first one result of an administrative request, the second one was triggered automatically (cache miss by example). Is that ok? (See the doc for all details). Agent: It seems, to support Lustre internal migration, you have planned to implement specific Agents which will reside on OST. HSM will need specific agent on clients. Do those two kinds of agent are acceptable ? The current API only describe HSM-based agent. Maybe we should think of a generic agent framework and add specialized implementations for ost,hsm,etc ? -- Aurelien Degremont CEA
Aurelien Degremont wrote:> Hello > > I''ve got several wondering about some specific point in HSM > implementation and I would like your opinion about them. > > Coordinator: > > This element will manage migration externally (HSM) and internally of > Lustre (space balancing?). Is the current API acceptable (specific calls > for external migration, and other ones for internal migration)?I would like to see a parameter indicating what agent will be used and keep all other parameters the same.> The best > way could have been to have generic call for migration, but we must also > have generic objects to describe the migration sources and destinations > and those are not simples.For migration to and from external sources, Lustre must already manage this data in an extended attribute (e.g. to describe the file on tape to which a Lustre file was migrated). This data is opaque to Lustre and can be passed as a blob.> We finally conclude with the API presented in > the HLD document. Tell me if this is *really* a bad idea or if only > adjustments are needed. > >I have not yet looked at these.> We presented two modes of migration, explicit and implicit migrations. > The first one result of an administrative request, the second one was > triggered automatically (cache miss by example). Is that ok? (See the > doc for all details). >Yes, that seems ok.> Agent: > > It seems, to support Lustre internal migration, you have planned to > implement specific Agents which will reside on OST.To avoid many complications involving locks, we decided that even the agents used for internal migrations will layer on the file system. The Lustre file system will be mounted on the OST''s and it will use the "LOLND" to transport the data efficiently between the OST process and the client file system cache. In the internal case source and destination lie in Lustre in the HSM case only one of them. As a result I believe these two cases are closer together than you may think, and should be one "type". The key aspect we/you need to design is what an agent has to make sure happens, for example in terms of locking file extents and in terms of avoiding triggering a recursive cache miss (open by fid with a flag?). - Peter -> HSM will need > specific agent on clients. Do those two kinds of agent are acceptable ? > The current API only describe HSM-based agent. Maybe we should think of > a generic agent framework and add specialized implementations for > ost,hsm,etc ? > >> >
Just a few initial responses from me, I haven''t read things systematically yet. Canon, Richard Shane wrote:> Aurelien and JC, > > Sorry that my feedback is late. Here are my questions/remarks. > > General > * Any thought on how quotas will be handled? > >This is very very important and will require a lot of detail. Well spotted Shane!!!> Coordinator > * 3.4 - I was curious what the precise use case was that was driving > this? I don''t disagree with it, but I was curious for more background >In internal migrations many objects will be restriped to another set of objects to move the data. The coordinator handles the completion and abortion of the agents accomplishing this.> * 3.7.1 - The coordinator could become a scaling bottleneck. We should > think about how this will be scaled in the future >In my writings I was always anticipating a family of load balancing coordinators.> * 4.1 - Does the coordinator store the ext obj id or does the agent >Coordinator, I suggest, in view of the fact that many agents may be required to move one file.> * 4.3 item 2 - This looks like the coordinator could become a bottle > neck for unlinks and slow down performance. Could this be put in some > type of async queue to be processed later (or some type of attic space)? > >I agree with this.> Use Cases > * 2.3 (Use cases) - I''m really keen on this feature. I think it is very > important in order to make small file performance work well. > Unfortunately, it isn''t clear how the file list gets communicated to the > archive tool. The coordinator and agent seem to only take one file at a > time. So how would this work exactly? > * 2.4 - The copy tool should be allowed to preemptively restage files. > I think this will work with the design, but we should make sure of this. > This would be useful for restaging a whole tar file versus doing things > piece-meal. > > Part IV > 2 EAs - I''m worried that the EA list could get huge for holes. >The EA merely points to an extent tree (similar to the allocation extent tree).> 3.2 -item 3 - Who insures a file is archived before punches are made? >The coordinator.> 3.3 - Another use case... The user checks to see if a file has been > archived. > > > Also, someone earlier made the point about the archive tool being able > to reorder request. This is really important since an archival system > wants to know all the files being restaged in order to order tape mounts > and reads. > > Thanks for taking the lead on this. It looks like there is a lot of > interest in it. > > --Shane > > -----Original Message----- > From: lustre-devel-bounces at lists.lustre.org > [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of DEGREMONT > Aurelien > Sent: Thursday, February 07, 2008 5:53 AM > To: lustre-devel at lists.lustre.org > Subject: [Lustre-devel] Lustre HSM HLD draft > > Hello > > Here is a first draft for comments of the Lustre HSM HLD. > It is intended to be a support for further analyzes and comments from > CFS/Sun. > > The document covers the main parts of the HSM features but some elements > are still lacking. > The policy management and the space manager will be describe later. > > Let us know your comments and ideas about it. > > Regards, > > Aurelien Degremont > CEA > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
Peter J Braam a ?crit :>> Coordinator: >> >> This element will manage migration externally (HSM) and internally of >> Lustre (space balancing?). Is the current API acceptable (specific >> calls for external migration, and other ones for internal migration)? > I would like to see a parameter indicating what agent will be used and > keep all other parameters the same.Agreed.>> The best way could have been to have generic call for migration, but >> we must also have generic objects to describe the migration sources >> and destinations and those are not simples. > For migration to and from external sources, Lustre must already manage > this data in an extended attribute (e.g. to describe the file on tape to > which a Lustre file was migrated). This data is opaque to Lustre and > can be passed as a blob. >> It seems, to support Lustre internal migration, you have planned to >> implement specific Agents which will reside on OST. > To avoid many complications involving locks, we decided that even the > agents used for internal migrations will layer on the file system. The > Lustre file system will be mounted on the OST''s and it will use the > "LOLND" to transport the data efficiently between the OST process and > the client file system cache. In the internal case source and > destination lie in Lustre in the HSM case only one of them. > > As a result I believe these two cases are closer together than you may > think, and should be one "type".If we unify the API, we must have a way to request some data movement like: copy elemA in placeP copy elemA,stored in placeP bak into Lustre copy elemA into placeC move elemB into elemB The elem could be unified using Lustre FID, but the places could be an external storage, or a precise OST. If we want a unify API, the API call should manipulate a generic object which could describe a Lustre storage element (ost) or a external storage (hsm,...) ie: struct storage_place { ... } copy(fid,storage_place*) move(fid,storage_place*) and their is some specific cases to handle. The other possibity: ext_copyout(fid, external storage) ext_copyin(fid, external object) int_copy(fid, fid, ost) int_move(fid, fid, ost) I think this one, even if the design is not the most beautiful one, if the easiest one. Instead you want to create some new generic objects to manipulate lustre object data and generic storage areas, the second case is the best one IMO. -- Aurelien Degremont CEA
The discussion below about the API''s is a standard element of data abstraction taught in advanced programming courses (see e.g. Abelson et. al. Structure and Interpretation of Computer Programs (SICP)). From this one concludes that the coordinator and agents will use abstract data types and call abstract methods that accommodate multiple: - source and destination descriptors for the data - data movers implementing the methods to move data If you proceed along the lines you outline you will get a big matrix of movers and data types to keep track of. If you follow my approach you will encapsulate things much more cleanly. Think in terms of virtual classes data movers acting on source and destination objects. - peter - On 2/27/08 9:51 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote:> > Peter J Braam a ?crit : >>> Coordinator: >>> >>> This element will manage migration externally (HSM) and internally of >>> Lustre (space balancing?). Is the current API acceptable (specific >>> calls for external migration, and other ones for internal migration)? >> I would like to see a parameter indicating what agent will be used and >> keep all other parameters the same. > > Agreed. > >>> The best way could have been to have generic call for migration, but >>> we must also have generic objects to describe the migration sources >>> and destinations and those are not simples. >> For migration to and from external sources, Lustre must already manage >> this data in an extended attribute (e.g. to describe the file on tape to >> which a Lustre file was migrated). This data is opaque to Lustre and >> can be passed as a blob. >>> It seems, to support Lustre internal migration, you have planned to >>> implement specific Agents which will reside on OST. >> To avoid many complications involving locks, we decided that even the >> agents used for internal migrations will layer on the file system. The >> Lustre file system will be mounted on the OST''s and it will use the >> "LOLND" to transport the data efficiently between the OST process and >> the client file system cache. In the internal case source and >> destination lie in Lustre in the HSM case only one of them. >> >> As a result I believe these two cases are closer together than you may >> think, and should be one "type". > > > If we unify the API, we must have a way to request some data movement like: > > copy elemA in placeP > copy elemA,stored in placeP bak into Lustre > copy elemA into placeC > move elemB into elemB > > > The elem could be unified using Lustre FID, but the places could be an > external storage, or a precise OST. If we want a unify API, the API call > should manipulate a generic object which could describe a Lustre storage > element (ost) or a external storage (hsm,...) > > ie: > struct storage_place { > ... > } > copy(fid,storage_place*) > move(fid,storage_place*) > > and their is some specific cases to handle. The other possibity: > > ext_copyout(fid, external storage) > ext_copyin(fid, external object) > int_copy(fid, fid, ost) > int_move(fid, fid, ost) > > I think this one, even if the design is not the most beautiful one, if > the easiest one. > > Instead you want to create some new generic objects to manipulate lustre > object data and generic storage areas, the second case is the best one IMO. >