HSM MetaData For Lustre HSM project, it will be needed to store, for each file, a list of information describing how many copies the file has in the HSM, what is their HSM ID, the copy date, and so on. This data could easily reach 500 bytes (I think we will need between 40 and 50 bytes per HSM copy, and we should be able to save at least 10 copies, surely more). The question is: where could we store this data on MDT, in which place (EA?) and how manage this. We had a discussion about this with Andreas and Nathan and it is not very clear what is the best solution here regarding to: - We must keep in mind that there is 2 available backends for MDT: ldiskfs and ZFS and both must be supported here. - EA space is not very wide on ldiskfs and quite used by several other features (stripping, ACL, ...) - Clients will need to read this data and so the RPC mechanism should be available and large enough to handle it. Moreover, we will store a purged data range on OST and MDT. This could easily fit in a EA. What is the possible solutions we have here ? -- Aurelien Degremont CEA
I do understand that we need HSM related metadata, but I learned more from Rick Matthews (cc''d) who is the architect of Sun''s ADM project. Now I am not sure I am in agreement with what has been discussed so far. If there is more than one copy in the archive, it would be preferable if the archive could maintain a mapping from the Lustre fid of the file to the archived copies. Associated with the FID of the data would then be a list of archived copies, timestamps etc. Can that be done in HPSS? If not, policy related operations like purging older files etc will become very complex and not scalable. For example, a search to find older files in the archive would require an e2scan operation to find the inodes and then the objects in the archive. If the file system was not available anymore (for whatever reason), it is not even clear that such a purge could still happen. With an archive based database this can be an indexed search in the archive, which is faster and more appropriate. Clearly this has a major impact on how much attribute space we need. Thoughts? Peter On 7/3/08 5:43 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote:> HSM MetaData > > For Lustre HSM project, it will be needed to store, for each file, a > list of information describing how many copies the file has in the HSM, > what is their HSM ID, the copy date, and so on. This data could easily > reach 500 bytes (I think we will need between 40 and 50 bytes per HSM > copy, and we should be able to save at least 10 copies, surely more). > The question is: where could we store this data on MDT, in which place > (EA?) and how manage this. > > We had a discussion about this with Andreas and Nathan and it is not > very clear what is the best solution here regarding to: > - We must keep in mind that there is 2 available backends for MDT: > ldiskfs and ZFS and both must be supported here. > - EA space is not very wide on ldiskfs and quite used by several other > features (stripping, ACL, ...) > - Clients will need to read this data and so the RPC mechanism should be > available and large enough to handle it. > > Moreover, we will store a purged data range on OST and MDT. This could > easily fit in a EA. > > What is the possible solutions we have here ?
Peter Braam a ?crit :> If there is more than one copy in the archive, it would be preferable if the > archive could maintain a mapping from the Lustre fid of the file to the > archived copies. Associated with the FID of the data would then be a list > of archived copies, timestamps etc.Do you mean that the HSM will be aware of various versions of one same file, identified in Lustre by a FID ? Or this will be masked by the archiving tool , doing some tricks to simulate it ?> Can that be done in HPSS?HPSS alone cannot do versioning on its files presently.> If not, policy related operations like purging older files etc will become > very complex and not scalable. For example, a search to find older files in > the archive would require an e2scan operation to find the inodes and then > the objects in the archive. If the file system was not available anymore > (for whatever reason), it is not even clear that such a purge could still > happen. > > With an archive based database this can be an indexed search in the archive, > which is faster and more appropriate.By purgin do mean purging in Lustre or in the HSM? There''s no issue with purging in Lustre because this do not imply the HSM. And removal of oldest copies in the HSM could be done asynchronously, slowly. I''m not sure I see what you mean here -- Aurelien Degremont CEA
On Jul 04, 2008 16:37 +0200, Aurelien Degremont wrote:> Peter Braam a ?crit : > > If there is more than one copy in the archive, it would be preferable if the > > archive could maintain a mapping from the Lustre fid of the file to the > > archived copies. Associated with the FID of the data would then be a list > > of archived copies, timestamps etc. > > Do you mean that the HSM will be aware of various versions of one same > file, identified in Lustre by a FID ? > Or this will be masked by the archiving tool , doing some tricks to > simulate it ? > > > Can that be done in HPSS? > > HPSS alone cannot do versioning on its files presently.When HPSS acts as both backup and HSM, is it still dependent on an external space/backup manager to track all of the files for the filesystem, or does it have a space manager built into it?> > If not, policy related operations like purging older files etc will become > > very complex and not scalable. For example, a search to find older files in > > the archive would require an e2scan operation to find the inodes and then > > the objects in the archive. If the file system was not available anymore > > (for whatever reason), it is not even clear that such a purge could still > > happen. > > > > With an archive based database this can be an indexed search in the archive, > > which is faster and more appropriate. > > By purgin do mean purging in Lustre or in the HSM?Purging old backups of the file in the offline storage (it isn''t quite right to call this the HSM at this point, because there are multiple backup copies of the file, not strictly a heirarchy).> There''s no issue with purging in Lustre because this do not imply the HSM. > And removal of oldest copies in the HSM could be done asynchronously, > slowly.What manages removal of the older copies in HPSS? If HPSS can purge older files based on policy (leaving at least the most recent copy always), then it would be possible to defer the backup policy to HPSS and Lustre would only ever need to reference a single offline file. Any queries for listing older versions of the file would be passed on from Lustre to HPSS in that case. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On 7/5/08 10:50 AM, "Andreas Dilger" <adilger at sun.com> wrote:> What manages removal of the older copies in HPSS? If HPSS can purge older > files based on policy (leaving at least the most recent copy always), then > it would be possible to defer the backup policy to HPSS and Lustre would > only ever need to reference a single offline file. Any queries for listing > older versions of the file would be passed on from Lustre to HPSS in that > case. >The point is that there is a proposal here to have multiple pointers PER INODE - that is not a good idea. Regardless of what policies are in place and where they are managed, they should not affect every inode with a new pointer for every backup object of that inode. Peter> Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
On 7/4/08 8:37 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote:> Peter Braam a ?crit : >> If there is more than one copy in the archive, it would be preferable if the >> archive could maintain a mapping from the Lustre fid of the file to the >> archived copies. Associated with the FID of the data would then be a list >> of archived copies, timestamps etc. > > Do you mean that the HSM will be aware of various versions of one same > file, identified in Lustre by a FID ? > Or this will be masked by the archiving tool , doing some tricks to > simulate it ? > >> Can that be done in HPSS? > > HPSS alone cannot do versioning on its files presently.But your archiving utility that copies from Lustre to HPSS can maintain database of these objects - no need to store anything in Lustre.> > >> If not, policy related operations like purging older files etc will become >> very complex and not scalable. For example, a search to find older files in >> the archive would require an e2scan operation to find the inodes and then >> the objects in the archive. If the file system was not available anymore >> (for whatever reason), it is not even clear that such a purge could still >> happen. >> >> With an archive based database this can be an indexed search in the archive, >> which is faster and more appropriate. > > By purgin do mean purging in Lustre or in the HSM?The HSM.> There''s no issue with purging in Lustre because this do not imply the HSM. > And removal of oldest copies in the HSM could be done asynchronously, > slowly.There is a rule in Lustre - no scanning, ever. This rule will not be broken by HSM. So, you have to move your management of ID''s of the archvied copies outside of Lustre, in some database. This will actually save you time - doing this in the MDS will be no fun. The MDS should only get attributes to indicate if and what version of a file is in the archive and a cursor (maybe other information) in relation with ongoing restores. Peter> > I''m not sure I see what you mean here >
Are you all talking about HSM, really, or simply backup? If backup, read no further. If HSM, then, do you intend that the user be allowed to specify *which* version of the file content is desired? If yes and you also want the standard API and utilities to function, seamlessly, then the version must be exposed in the name space, no? I.e. For any file named "foo" with 3 versions, for instance, there would be foo;1, foo;2, foo;3, and "foo" which is an alias for "foo;1". If no, then, you''ll have to craft a special API that will motivate special tools. However, HPSS already has this API and set of tools so what''s the point? Wouldn''t it be better to just modify HPSS to understand versions? If HSM, then, do you intend that two users might be allowed to work with two, or more, versions of the file content simultaneously? If yes then same problem as above since those two versions might need to be in the same directory, at the same time, right? No matter what you do, you have problems that can''t be resolved when mixing a POSIX name space with file versions, I believe. Since POSIX reserves no characters you can''t pick a scheme that includes version information in the name without at least being confusing and the API provides no other way to specify the version, no? My personal choice would be to shy off direct version support by the native file system. It doesn''t seem to have a reasonable solution without involving the user somehow to specify names or naming schemes. That kind of involvement just begs for a special utility and, once there, relieves the file system of the need to support any but the most recent version itself, anyway. --Lee On Sat, 2008-07-05 at 21:24 -0600, Peter Braam wrote:> > > On 7/4/08 8:37 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote: > > > Peter Braam a ?crit : > >> If there is more than one copy in the archive, it would be preferable if the > >> archive could maintain a mapping from the Lustre fid of the file to the > >> archived copies. Associated with the FID of the data would then be a list > >> of archived copies, timestamps etc. > > > > Do you mean that the HSM will be aware of various versions of one same > > file, identified in Lustre by a FID ? > > Or this will be masked by the archiving tool , doing some tricks to > > simulate it ? > > > >> Can that be done in HPSS? > > > > HPSS alone cannot do versioning on its files presently. > > But your archiving utility that copies from Lustre to HPSS can maintain > database of these objects - no need to store anything in Lustre. > > > > > > > >> If not, policy related operations like purging older files etc will become > >> very complex and not scalable. For example, a search to find older files in > >> the archive would require an e2scan operation to find the inodes and then > >> the objects in the archive. If the file system was not available anymore > >> (for whatever reason), it is not even clear that such a purge could still > >> happen. > >> > >> With an archive based database this can be an indexed search in the archive, > >> which is faster and more appropriate. > > > > By purgin do mean purging in Lustre or in the HSM? > > The HSM. > > > There''s no issue with purging in Lustre because this do not imply the HSM. > > And removal of oldest copies in the HSM could be done asynchronously, > > slowly. > > There is a rule in Lustre - no scanning, ever. This rule will not be broken > by HSM. > > So, you have to move your management of ID''s of the archvied copies outside > of Lustre, in some database. This will actually save you time - doing this > in the MDS will be no fun. > > The MDS should only get attributes to indicate if and what version of a file > is in the archive and a cursor (maybe other information) in relation with > ongoing restores. > > Peter > > > > > > I''m not sure I see what you mean here > > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
Lee - Thank you for this clear explanation. If solely the HSM can store multiple versions, we have already some difficulties. One might imagine setting a particular version in the HSM as the primary one, meaning that this primary one will be transparently restored or that a pre-staging utility will select this by default. If the file is fully absent in the file system staging or restoring it will work correctly. However, if a part of the file remains in the file system, this HSM versioning becomes complicated because the file will again have to remember what HSM versions the fragments belong to, and we are almost back where we were. I think the emails so far make it clear that we don''t want to have one Lustre inode be associated with multiple objects in the HSM. If the HSM system is used as a backup then the restore operations will have user or operator involvement and this objection to storing multiple versions in the HSM does not apply. However, the we still don''t'' want to store a pointer to each version in the file system, that belongs in the HSM/backup metadata store. However, I don''t want to end the discussion right here. With DMU (or otherwise) we will get file systems where snapshots become possible and common, and these snapshots will contain different versions of the same file. The way the namespace distinguishes these is that in the pair (fsid, fid) the fsid is different for each snapshot. So probably the id in the HSM should allow for an fsid component. Now DMU snapshot versions of one inode share blocks, and this leads to the question if/how we can efficiently share blocks in the HSM also. This discussion would probably equally apply to upcoming "dedup" efforts for the DMU, which the virtualization and "email attachment" community think are very important. Rick, Jeff - how will we handle this? Peter On 7/6/08 1:24 PM, "Lee Ward" <lee at sandia.gov> wrote:> Are you all talking about HSM, really, or simply backup? > > If backup, read no further. > > If HSM, then, do you intend that the user be allowed to specify *which* > version of the file content is desired? > > If yes and you also want the standard API and utilities to function, > seamlessly, then the version must be exposed in the name space, no? I.e. > For any file named "foo" with 3 versions, for instance, there would be > foo;1, foo;2, foo;3, and "foo" which is an alias for "foo;1". > > If no, then, you''ll have to craft a special API that will motivate > special tools. However, HPSS already has this API and set of tools so > what''s the point? Wouldn''t it be better to just modify HPSS to > understand versions? > > If HSM, then, do you intend that two users might be allowed to work with > two, or more, versions of the file content simultaneously? > > If yes then same problem as above since those two versions might need to > be in the same directory, at the same time, right? > > No matter what you do, you have problems that can''t be resolved when > mixing a POSIX name space with file versions, I believe. Since POSIX > reserves no characters you can''t pick a scheme that includes version > information in the name without at least being confusing and the API > provides no other way to specify the version, no? > > My personal choice would be to shy off direct version support by the > native file system. It doesn''t seem to have a reasonable solution > without involving the user somehow to specify names or naming schemes. > That kind of involvement just begs for a special utility and, once > there, relieves the file system of the need to support any but the most > recent version itself, anyway. > > --Lee > > On Sat, 2008-07-05 at 21:24 -0600, Peter Braam wrote: >> >> >> On 7/4/08 8:37 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote: >> >>> Peter Braam a ?crit : >>>> If there is more than one copy in the archive, it would be preferable if >>>> the >>>> archive could maintain a mapping from the Lustre fid of the file to the >>>> archived copies. Associated with the FID of the data would then be a list >>>> of archived copies, timestamps etc. >>> >>> Do you mean that the HSM will be aware of various versions of one same >>> file, identified in Lustre by a FID ? >>> Or this will be masked by the archiving tool , doing some tricks to >>> simulate it ? >>> >>>> Can that be done in HPSS? >>> >>> HPSS alone cannot do versioning on its files presently. >> >> But your archiving utility that copies from Lustre to HPSS can maintain >> database of these objects - no need to store anything in Lustre. >> >> >>> >>> >>>> If not, policy related operations like purging older files etc will become >>>> very complex and not scalable. For example, a search to find older files >>>> in >>>> the archive would require an e2scan operation to find the inodes and then >>>> the objects in the archive. If the file system was not available anymore >>>> (for whatever reason), it is not even clear that such a purge could still >>>> happen. >>>> >>>> With an archive based database this can be an indexed search in the >>>> archive, >>>> which is faster and more appropriate. >>> >>> By purgin do mean purging in Lustre or in the HSM? >> >> The HSM. >> >>> There''s no issue with purging in Lustre because this do not imply the HSM. >>> And removal of oldest copies in the HSM could be done asynchronously, >>> slowly. >> >> There is a rule in Lustre - no scanning, ever. This rule will not be broken >> by HSM. >> >> So, you have to move your management of ID''s of the archvied copies outside >> of Lustre, in some database. This will actually save you time - doing this >> in the MDS will be no fun. >> >> The MDS should only get attributes to indicate if and what version of a file >> is in the archive and a cursor (maybe other information) in relation with >> ongoing restores. >> >> Peter >> >> >>> >>> I''m not sure I see what you mean here >>> >> >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >> > >
Lee Ward a ?crit :> If HSM, then, do you intend that the user be allowed to specify *which* > version of the file content is desired?User could say: "overwrite the current version of this file with this older copies which was made few time ago." -The current file content is lost. -That is the only way to access the older copies content. There is no namespace tricks, no huge API changes, always one version of a file in Lustre, just few functions added to ''lfs'' command. The purpose is just, using the HSM infrastructure, simply add few feature to help people asking us for backup features, but this will not be a true backup system. This kind of utility requires much more development. -- Aurelien Degremont CEA
Peter and Lee, Lee, you are correct when pointing out the versioning of a file with a backup copy is a backup style function. One desirable to users of backup ans some HSM products, but still primarily driven by the "coincidence" that older copies remain, and reference to them may be desired. (In most instances, these references are used to either "restore this copy to that directory" or "restore this directory tree to its prior state".) So, while primarily a backup function, one that if an HSM is the basis for backup copies may be important in the future. HSM as the basis of backup copies is a desirable trait IMHO. The HSM is already retaining an instance of the file, one which could easily be captured as a "backup" copy. That said, HSM and snapshot seems to bring a better mix, particularly to the user. A snapshot of the file system presents a consistent view, and the backup would only need to include data (metadata) from files previously resident in the HSM. As for HSM and deduplication, I see the deduplication being an optimization tradeoff with consumed space. In a relatively expensive random access media (like disk), deduplication provides a reduced total data footprint while not affecting the retrieval rate significantly. When the media is sequentially oriented and relatively less expensive to have (like tape), deduplication seems to not make as much sense. So, I see deduplication as important on disk based archive copies, and not all that useful in tape archiving. Of course, tape striping is important, but is still a sequential store/retrieve. Also, if it is convenient to deduplicate full sequential images of a file (while not violating a numbers of copies policy), that should be done on the sequentially oriented media. There may also be some policy (sequential affinity) reasons where even the full image deduplication is not desirable. Thank you for letting me participate in this discussion. -- Rick Peter Braam wrote:> Lee - Thank you for this clear explanation. > > If solely the HSM can store multiple versions, we have already some > difficulties. One might imagine setting a particular version in the HSM as > the primary one, meaning that this primary one will be transparently > restored or that a pre-staging utility will select this by default. > > If the file is fully absent in the file system staging or restoring it will > work correctly. However, if a part of the file remains in the file system, > this HSM versioning becomes complicated because the file will again have to > remember what HSM versions the fragments belong to, and we are almost back > where we were. > > I think the emails so far make it clear that we don''t want to have one > Lustre inode be associated with multiple objects in the HSM. > > If the HSM system is used as a backup then the restore operations will have > user or operator involvement and this objection to storing multiple versions > in the HSM does not apply. However, the we still don''t'' want to store a > pointer to each version in the file system, that belongs in the HSM/backup > metadata store. > > However, I don''t want to end the discussion right here. > > With DMU (or otherwise) we will get file systems where snapshots become > possible and common, and these snapshots will contain different versions of > the same file. The way the namespace distinguishes these is that in the > pair (fsid, fid) the fsid is different for each snapshot. So probably the > id in the HSM should allow for an fsid component. > > Now DMU snapshot versions of one inode share blocks, and this leads to the > question if/how we can efficiently share blocks in the HSM also. This > discussion would probably equally apply to upcoming "dedup" efforts for the > DMU, which the virtualization and "email attachment" community think are > very important. > > Rick, Jeff - how will we handle this? > > Peter > > > > > > On 7/6/08 1:24 PM, "Lee Ward" <lee at sandia.gov> wrote: > > >> Are you all talking about HSM, really, or simply backup? >> >> If backup, read no further. >> >> If HSM, then, do you intend that the user be allowed to specify *which* >> version of the file content is desired? >> >> If yes and you also want the standard API and utilities to function, >> seamlessly, then the version must be exposed in the name space, no? I.e. >> For any file named "foo" with 3 versions, for instance, there would be >> foo;1, foo;2, foo;3, and "foo" which is an alias for "foo;1". >> >> If no, then, you''ll have to craft a special API that will motivate >> special tools. However, HPSS already has this API and set of tools so >> what''s the point? Wouldn''t it be better to just modify HPSS to >> understand versions? >> >> If HSM, then, do you intend that two users might be allowed to work with >> two, or more, versions of the file content simultaneously? >> >> If yes then same problem as above since those two versions might need to >> be in the same directory, at the same time, right? >> >> No matter what you do, you have problems that can''t be resolved when >> mixing a POSIX name space with file versions, I believe. Since POSIX >> reserves no characters you can''t pick a scheme that includes version >> information in the name without at least being confusing and the API >> provides no other way to specify the version, no? >> >> My personal choice would be to shy off direct version support by the >> native file system. It doesn''t seem to have a reasonable solution >> without involving the user somehow to specify names or naming schemes. >> That kind of involvement just begs for a special utility and, once >> there, relieves the file system of the need to support any but the most >> recent version itself, anyway. >> >> --Lee >> >> On Sat, 2008-07-05 at 21:24 -0600, Peter Braam wrote: >> >>> On 7/4/08 8:37 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote: >>> >>> >>>> Peter Braam a ?crit : >>>> >>>>> If there is more than one copy in the archive, it would be preferable if >>>>> the >>>>> archive could maintain a mapping from the Lustre fid of the file to the >>>>> archived copies. Associated with the FID of the data would then be a list >>>>> of archived copies, timestamps etc. >>>>> >>>> Do you mean that the HSM will be aware of various versions of one same >>>> file, identified in Lustre by a FID ? >>>> Or this will be masked by the archiving tool , doing some tricks to >>>> simulate it ? >>>> >>>> >>>>> Can that be done in HPSS? >>>>> >>>> HPSS alone cannot do versioning on its files presently. >>>> >>> But your archiving utility that copies from Lustre to HPSS can maintain >>> database of these objects - no need to store anything in Lustre. >>> >>> >>> >>>> >>>>> If not, policy related operations like purging older files etc will become >>>>> very complex and not scalable. For example, a search to find older files >>>>> in >>>>> the archive would require an e2scan operation to find the inodes and then >>>>> the objects in the archive. If the file system was not available anymore >>>>> (for whatever reason), it is not even clear that such a purge could still >>>>> happen. >>>>> >>>>> With an archive based database this can be an indexed search in the >>>>> archive, >>>>> which is faster and more appropriate. >>>>> >>>> By purgin do mean purging in Lustre or in the HSM? >>>> >>> The HSM. >>> >>> >>>> There''s no issue with purging in Lustre because this do not imply the HSM. >>>> And removal of oldest copies in the HSM could be done asynchronously, >>>> slowly. >>>> >>> There is a rule in Lustre - no scanning, ever. This rule will not be broken >>> by HSM. >>> >>> So, you have to move your management of ID''s of the archvied copies outside >>> of Lustre, in some database. This will actually save you time - doing this >>> in the MDS will be no fun. >>> >>> The MDS should only get attributes to indicate if and what version of a file >>> is in the archive and a cursor (maybe other information) in relation with >>> ongoing restores. >>> >>> Peter >>> >>> >>> >>>> I''m not sure I see what you mean here >>>> >>>> >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>> >>> >> > > >-- --------------------------------------------------------------------- Rick Matthews email: Rick.Matthews at sun.com Sun Microsystems, Inc. phone:+1(651) 554-1518 1270 Eagan Industrial Road phone(internal): 54418 Suite 160 fax: +1(651) 554-1540 Eagan, MN 55121-1231 USA main: +1(651) 554-1500 ---------------------------------------------------------------------
I think we have come to the following conclusions: 1. The HSM or a database associated with it implements a table to map FIDs to stored HSM versions of a file, with other metadata it may need to maintain its archives. 2. An HSM utility can query and learn about the versions stored for a fid (or file name). A "restore" function can copy any version out of the HSM and place it in the file system. This is similar to restoring a file from a backup archive. 3. The file system only has attributes to indicate the state of the primary archived copy (probably the last fully archived copy of the file), and can retrieve that file on demand (without user intervention). 4. The HSM database will allow files in snapshots to be encoded with (fsid, fid) or something similar. 5. for now we ignore block level dedup in the HSM Can the owner of the HLD make updates? Please also read on - I have some more questions below. On 7/8/08 2:52 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote:> Lee Ward a ?crit : >> If HSM, then, do you intend that the user be allowed to specify *which* >> version of the file content is desired? > > User could say: > "overwrite the current version of this file with this older copies > which was made few time ago." > > -The current file content is lost. > -That is the only way to access the older copies content.Yes, that is reasonable.> There is no namespace tricks, no huge API changes, always one version of > a file in Lustre, just few functions added to ''lfs'' command.NO - this will not be an lfs command. This is an HSM command.> The purpose is just, using the HSM infrastructure, simply add few > feature to help people asking us for backup features, but this will not > be a true backup system. This kind of utility requires much more > development.I think it would be good to review one more time the following aspects of the design: 1. how is a bare metal restore arranged (ie. How is metadata moved into the HSM)? Can this restore put files in a file system different than Lustre? 2. how are small files grouped then "tar''d up" and how are we setting the attributes of the inodes of the files that have been placed in the HSM after this? How does the index entry for the fids in the HSM database function? 3. how are multiple coordinators and agents utilized to distribute load so that the HSM can keep up with massive small file creation? For all of these we have seen sketchy answers in the past, let''s dig in and make sure that we have this right. Regards, Peter
Peter Braam a ?crit :> 1. The HSM or a database associated with it implements a table to map FIDs > to stored HSM versions of a file, with other metadata it may need to > maintain its archives.Ok> 2. An HSM utility can query and learn about the versions stored for a fid > (or file name). A "restore" function can copy any version out of the HSM > and place it in the file system. This is similar to restoring a file from a > backup archive.Ok, that''s copy-in.> 3. The file system only has attributes to indicate the state of the primary > archived copy (probably the last fully archived copy of the file), and can > retrieve that file on demand (without user intervention).Ok. Still need to store the purge window on MDT and OST to raise cache misses. How Lustre will update this information if user can use a HSM command directly, by-passing Lustre? He can change the file copies present in the HSM without Lustre knowing it.> 4. The HSM database will allow files in snapshots to be encoded with (fsid, > fid) or something similar.Can we consider there is always a default snapshot? The ID will always be done with FSID+FID ? Or should we consider a special case when snapshotting is not enabled ?>> There is no namespace tricks, no huge API changes, always one version of >> a file in Lustre, just few functions added to ''lfs'' command. > > NO - this will not be an lfs command. This is an HSM command.Could you present a use case of how user will explicitly make backups and restore an older copy using the HSM command and no Lustre component? Doing this, the client nodes should be able to communicate with the HSM infrastructure, using specific network protocols, and so on. You will need to set up your Lustre network and then your HSM network even if HSM just need to talk with the Lustre agent.> 1. how is a bare metal restore arranged (ie. How is metadata moved into the > HSM)? Can this restore put files in a file system different than Lustre?Until now, the metadata were stored inside Lustre, so this was not needed. Now, we must add a way for the archiving tool to "setattr" this data when restoring a file. About a different filesystem, this will depend on the features used by the archiving tool to copy back the data and metadata. If those are standard, the file could be put in a different filesystem.> 2. how are small files grouped then "tar''d up" and how are we setting the > attributes of the inodes of the files that have been placed in the HSM after > this? How does the index entry for the fids in the HSM database function?Presently, just the archiving tool was supposed to support such feature, to avoid having to recode them later (various tools will be needed for the various existing HSMs and their development won''t be centralized) when we will add this kind of feature. There is no defined mechanism for grouping file into the HSM presently.> 3. how are multiple coordinators and agents utilized to distribute load so > that the HSM can keep up with massive small file creation?One coordinator per MDT. The coordinator deals only with its MDT files. The coordinator dispatches their requests on the agents with a round-robin. Agents can refuse requests if they cannot handle them (too busy). Coordinator try another one. If no agent are available, it postpones the request. -- Aurelien Degremont CEA
On 7/9/08 7:25 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote:> >> 3. The file system only has attributes to indicate the state of the primary >> archived copy (probably the last fully archived copy of the file), and can >> retrieve that file on demand (without user intervention). > > Ok. Still need to store the purge window on MDT and OST to raise cache > misses.Yes.> How Lustre will update this information if user can use a HSM command > directly, by-passing Lustre? He can change the file copies present in > the HSM without Lustre knowing it.NO - we said that the only operation we do is placing an entire file into Lustre.> >> 4. The HSM database will allow files in snapshots to be encoded with (fsid, >> fid) or something similar. > > Can we consider there is always a default snapshot? > The ID will always be done with FSID+FID ? Or should we consider a > special case when snapshotting is not enabled ?Why would you? You need to make sure that the index field is large enough. Almost all our customers have more than one file system anyway, regardless of snapshots.> >>> There is no namespace tricks, no huge API changes, always one version of >>> a file in Lustre, just few functions added to ''lfs'' command. >> >> NO - this will not be an lfs command. This is an HSM command. > > Could you present a use case of how user will explicitly make backups > and restore an older copy using the HSM command and no Lustre component?Hsm_copy_to_fs <FID> /mnt/lustre/braams_lost_file> Doing this, the client nodes should be able to communicate with the HSM > infrastructure, using specific network protocols, and so on. > You will need to set up your Lustre network and then your HSM network > even if HSM just need to talk with the Lustre agent.The utility for restore is not essentially different from what the agent invokes as a mover.> >> 1. how is a bare metal restore arranged (ie. How is metadata moved into the >> HSM)? Can this restore put files in a file system different than Lustre? > > Until now, the metadata were stored inside Lustre, so this was not > needed. Now, we must add a way for the archiving tool to "setattr" this > data when restoring a file. > > About a different filesystem, this will depend on the features used by > the archiving tool to copy back the data and metadata. If those are > standard, the file could be put in a different filesystem.Hmm. This description has no content. If you don''t want to do this, say so, or describe the entire process in detail.> >> 2. how are small files grouped then "tar''d up" and how are we setting the >> attributes of the inodes of the files that have been placed in the HSM after >> this? How does the index entry for the fids in the HSM database function? > > Presently, just the archiving toolHOW?> was supposed to support such feature, > to avoid having to recode them later (various tools will be needed for > the various existing HSMs and their development won''t be centralized) > when we will add this kind of feature. > There is no defined mechanism for grouping file into the HSM presently.You need to describe this in detail - so far you are just repeating my questions pretending they are answers. What events are generated for small files? How are they grouped into something that is "tarred up"? What happens to all the individual inodes when the tarball hits the HSM?>> 3. how are multiple coordinators and agents utilized to distribute load so >> that the HSM can keep up with massive small file creation? > > One coordinator per MDT.No - these must be independent considerations. A coordinator may be much slower than an MDS node in handling a single file. I say this because this has been the experience in the industry so far - with small files the HSM can not at all keep up.> The coordinator deals only with its MDT files. > The coordinator dispatches their requests on the agents with a > round-robin.No, I think a more sophisticated policy is needed. Eg. Small files to these agents, big files to others.> Agents can refuse requests if they cannot handle them (too > busy).NO NO NO.> Coordinator try another one. If no agent are available, it > postpones the request. >Please take time to respond with details. Thanks. Peter
Jacques-Charles Lafoucriere
2008-Jul-11 14:31 UTC
[Lustre-devel] How store HSM metadata in MDT ?
Peter Braam wrote:> > There is a rule in Lustre - no scanning, ever. This rule will not be broken > by HSM. >If a site wants to connect an HSM to an existing file system (close to be full) how do we do without fast scanning ? We cannot restrict the use of HSM binding only on new file systems JC -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080711/8f0e1d63/attachment.html
Jacques-Charles Lafoucriere
2008-Jul-11 14:32 UTC
[Lustre-devel] How store HSM metadata in MDT ?
Peter Braam wrote:>>> 3. how are multiple coordinators and agents utilized to distribute load so >>> that the HSM can keep up with massive small file creation? >>> >> One coordinator per MDT. >> > > No - these must be independent considerations. A coordinator may be much > slower than an MDS node in handling a single file. I say this because this > has been the experience in the industry so far - with small files the HSM > can not at all keep up. >Why a coordinator should be on a slower node than a MDS ? Coordinator is a Lustre service like other Lustre services so it will be on a right hardware Do you mean a coordinator is not part of the Lustre cluster ? JC -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080711/506825c7/attachment.html
Jacques-Charles Lafoucriere
2008-Jul-11 14:37 UTC
[Lustre-devel] How store HSM metadata in MDT ?
Hello following latest discussions I understand a large change is coming on Lustre/HSM interactions. In the HLD the HSM is following Lustre requests : - lustre triggers copy-out and copy-in - all the copy requests are made through the coordinator control (so the Hsm_copy_to_fs command is a command line interface to the coordinator). Note that this Hsm_copy_to_fs is different from the copy tool. The central role of coordinator allows us to control all the requests and avoid duplicated requests to copy tool (and give a global view). Now it seems Lustre will have to be able to follow HSM requests to send file back in Lustre and so independently of the coordinator (in a previous email it was requested Hsm_copy_to_fs to trigger copy independently of Lustre) I do not agree on this change because HSM has to be seen as a backend storage for Lustre and the decisions to copy have to be in Lustre. Lustre must no suffer HSM but must use it To manage this new requirement the copy tool has to implement a central entity that will : - avoid duplicate requests - choose which agent has to make the copy This will have to be duplicated for each HSM (or backend) supported and also will duplicate coordinator role so I think it is better to have it in Lustre instead of in the copy tool. About file grouping, the planed features are : - for copy out: in one request a list of files can be provided to copy tool so it can choose to group them in one HSM "group archive". - for copy in: if a file is in a HSM "group archive", the copy tool will copy back this file in Lustre (and not all the archive file) - grouped request can come from a user request or a the space manager (for a generic policy) The space manager design is today on stand by because of the lack of information on changelogs, feeds, Lustre policy engine I think there is a very strong need in Lustre to have a generic policy database that can be used to allocate files, copy out files, purge files, choose which agent will copy files ..... One use case for this database is to provide users an interface to said : I want all my *.avi files to be striped on 6 OST and all other files to be not striped JC
A one time event to build a database or log is acceptable. However, the proposal so far would require frequent scans because the metadata would be in the wrong place, that is what I want to avoid. Maintaining that metadata is very simple, and can solve some problems that you cannot easily solve without maintaining a database in conjunction with the HSM (such as cleaning partial, aborted copies into the HSM). You?d simply ask the moving script to contact a database and make a record. The database may not be able to keep up with all file creations in the FS but it can probably keep up with the activities that one moving node does to/from the HSM. Peter On 7/11/08 8:31 AM, "Jacques-Charles Lafoucriere" <jc.lafoucriere at cea.fr> wrote:> > > Peter Braam wrote: >> >> >> There is a rule in Lustre - no scanning, ever. This rule will not be broken >> by HSM. >> > If a site wants to connect an HSM to an existing file system (close to be > full) > how do we do without fast scanning ? > We cannot restrict the use of HSM binding only on new file systems > > JC >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080711/61dbfe02/attachment.html
On 7/11/08 8:32 AM, "Jacques-Charles Lafoucriere" <jc.lafoucriere at cea.fr> wrote:> > > Peter Braam wrote: >> >>> >>>> >>>> 3. how are multiple coordinators and agents utilized to distribute load so >>>> that the HSM can keep up with massive small file creation? >>>> >>>> >>> >>> One coordinator per MDT. >>> >>> >> >> >> No - these must be independent considerations. A coordinator may be much >> slower than an MDS node in handling a single file. I say this because this >> has been the experience in the industry so far - with small files the HSM >> can not at all keep up. >> > > Why a coordinator should be on a slower node than a MDS ? > Coordinator is a Lustre service like other Lustre services so it will be on a > right hardware >I see no reason whatsoever to couple them to MDT?s. Keeping them de-coupled is more flexible and can scale better. Am I missing issues here? As for performance the following: a coordinator may have a lot of work to do to track which files still need to be handled by agents and which are in progress already and may have a fair amount of interaction with agents (not with HSM to tape but when a coordinator is handling a re-striping migration it will). But its interaction with the MDT?s is very limited. So we could degrade the performance of an MDS by placing this on the same node. Let?s keep this flexible please. Peter> > > Do you mean a coordinator is not part of the Lustre cluster ? > > JC >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080711/3f45bcee/attachment.html
On 7/11/08 8:37 AM, "Jacques-Charles Lafoucriere" <jc.lafoucriere at cea.fr> wrote:> Hello > > following latest discussions I understand a large change is coming on > Lustre/HSM interactions. > In the HLD the HSM is following Lustre requests : > - lustre triggers copy-out and copy-inWe can feed a list to the coordinator also to pre-stage ("primary versions")> - all the copy requests are made through the coordinator control (so the > Hsm_copy_to_fs command is a command line interface to the coordinator).No. I think we need to design this, but it will sipmly create a new file in the file system, that doesn''t require a coordinator.> Note that this Hsm_copy_to_fs is different from the copy tool. > The central role of coordinator allows us to control all the requests > and avoid duplicated requests to copy tool (and give a global view).The coordinator will do copy in form kernel, automatically triggered by cache miss or by feeding a list to restore HSM primary copies. The hsm_copy_to_fs tool is ONLY needed when secondary copies held in the HSM are being restored.> > Now it seems Lustre will have to be able to follow HSM requests to send > file back in Lustre and so independently of the coordinator (in a > previous email it was requested Hsm_copy_to_fs to trigger copy > independently of Lustre) > I do not agree on this change because HSM has to be seen as a backend > storage for Lustre and the decisions to copy have to be in Lustre. > Lustre must no suffer HSM but must use itNo - I think you perhaps misunderstood the proposal.> > To manage this new requirement the copy tool has to implement a central > entity that will : > - avoid duplicate requests > - choose which agent has to make the copy > This will have to be duplicated for each HSM (or backend) supported and > also will duplicate coordinator role so I think it is better to have it > in Lustre instead of in the copy tool.No.> > About file grouping, the planed features are : > - for copy out: in one request a list of files can be provided to copy > tool so it can choose to group them in one HSM "group archive".This again just rephrasing my question. HOW is a list of small files formed?> - for copy in: if a file is in a HSM "group archive", the copy tool will > copy back this file in Lustre (and not all the archive file)No, because with your proposal restoring a 1000 small files will cause 1000 tape actions to get the archive. I think the entire archive should come back in one blow.> - grouped request can come from a user request or a the space manager > (for a generic policy)Yes, how is the metadata handled? This is the case where the HSM DB does see significant load to make a mapping for each lustre fid to the archived file.> > The space manager design is today on stand by because of the lack of > information on changelogs, feeds, Lustre policy engineWe will get there shortly.> > I think there is a very strong need in Lustre to have a generic policy > database that can be used to allocate files, copy out files, purge > files, choose which agent will copy files .....Policy database yes, but NOT a database with HSM related data. There is a secondary side of policy which is how to treat the data that is held in the HSM and that doesn''t belong in Lustre, but belongs in user space. Yet, this should be two parts of one policy management interface. If Lustre runs with Sun HSM it would be one tool, if Lustre runs with HPSS the tool would have two sides - one to Lustre from Sun and one to HPSS from the HPSS community.> One use case for this database is to provide users an interface to said > : I want all my *.avi files to be striped on 6 OST and all other files > to be not stripedThat is not an HSM policy, this should be a pool data placement policy and I agree we need it. The HSM should have sufficient metadata to restore files in this manner if bare metal restore takes place. The good news is that I see no serious disagreements, just some minor misunderstandings. Agree? Happy quartorze juillet!! Regards Peter> > JC > >
Jacques-Charles Lafoucriere
2008-Jul-16 10:26 UTC
[Lustre-devel] How store HSM metadata in MDT ?
Space Manager needs are: 1) generate a candidate list for copy out (pre-migration) 2) generate a candidate list for purge For 1) the criteria is : not up to date in HSM and not recently modified For 2) the criteria is : up to date in HSM and not recently accessed Needed changelogs events are "modifications" like : - file creation - mtime change - atime change The things I do not like in events mode are: - if a file is created, filled and remove before copy-out (like a temporary file), we will have useless interaction with the spacemanger (and useless load) - if for some issue, events are missed we will have HSM unknown files in Lustre. To resume this issue we can use a scan or find a way to warranty we will never missed an event. This last point is a strong constraint because Lustre should be able to operate with a dead space manager. I agree, I not fond of scanning, but a low priority, background scan will solve these 2 issues. For me the spacemanager and it''s DB are common to all HSM and will have no HSM specific information. HSM specific rules (like in which HSM internal storage class I will put a file) will be managed by HSM copy tool Do you agree ? JC
This continues the Lustre design discussion for HSM. On 7/16/08 4:26 AM, "Jacques-Charles Lafoucriere" <jc.lafoucriere at cea.fr> wrote:> Space Manager needs are: > 1) generate a candidate list for copy out (pre-migration) > 2) generate a candidate list for purge > > For 1) the criteria is : not up to date in HSM and not recently modified > For 2) the criteria is : up to date in HSM and not recently accessed > > Needed changelogs events are "modifications" like : > - file creation > - mtime change > - atime change1) the files are in the log (and in ZFS the log can be reconstructed through a fast search) The issue here that makes me worried is the following. Is the coordinator managing "archiving" or is the space manager? Whatever entity does it, it needs to WAIT until a file is quiescent for some time. ADM''s event manager can do that, but how do we do it with HPSS? Now interestingly the Size On MDS (SOM) project does almost precisely this, it monitors a file going idle and transfers size from OSS''s / clients to the MDS inode. So Lustre is pretty close, but this completes too quickly, commonly archiving is postponed 20 minutes or so.> > The things I do not like in events mode are: > - if a file is created, filled and remove before copy-out (like a > temporary file), we will have useless interaction with the spacemanger > (and useless load)A stat call to the file is quick and required anyway to eliminate race conditions.> - if for some issue, events are missed we will have HSM unknown files in > Lustre. To resume this issue we can use a scan or find a way to warranty > we will never missed an event.Lustre logs and ZFS searches are guaranteed NOT to miss anything. No finds are necessary.> This last point is a strong constraint because Lustre should be able to > operate with a dead space manager. > > I agree, I not fond of scanning, but a low priority, background scan > will solve these 2 issues.We only need the scan for 2), and as indicated earlier it can be a rare scan. I will not accept scanning for 1).> > For me the spacemanager and it''s DB are common to all HSM and will have > no HSM specific information.And they will be a major bottleneck. I definitely want to avoid a DB. It is fair to state that all events belong with Lustre. Lustre should define adequate high performance features for distributed storage of events. The logs or ZFS searches are a good example, similarly file sets, collecting small files might be good examples. A key consideration for the design is that by integrating it into Lustre we control the performance of these event management systems much better than through upcalls and databases. Keep two systems in sync has proven to be the problem when archiving small files (a ridiculously small number of small files can be archived by current HSM''s (100''s) we need MILLIONS. So our architecture has to plan a major improvement here. The database should go away.> HSM specific rules (like in which HSM internal storage class I will put > a file) will be managed by HSM copy tool > Do you agree ?Yes. Peter> > JC > >