We spoke about the HSM plans some 10 days ago. I think that the conclusions are roughly as follows: 1. It is desirable to reach a first implementation as soon as possible. 2. Some design puzzles remain to insure that HSM can keep up with Lutre metadata clusters The steps to reach a first implementation can be summarized as: 1. Include file closes in the changelog, if the file was opened for write. Include timestamps in the changelog entries. This allows the changelog processor to see files that have become inactive and pass them on for archiving. 2. Build an open call that blocks for file retrieval and adapts timeouts to avoid error returns. 3. Until a least-recently-used log is built, use the e2scan utility to generate lists of candidates for purging. 4. Translate events and scan results into a form that they can be understood by ADM. 5. Work with a single coordinator, whose role it is to avoid getting multiple ?close? records for the same file (a basic filter for events). 6. Do not use initiators ? these can come later and assist with load balancing and free-ing space on demand (both of which we can ignore for the first release) 7. Do not use multiple agents ? the agents can move stripes of files etc, and this is not needed with a basic user level solution, based on consuming the log. The only thing the agent must do in release one is get the attention of a data mover to restore files on demand. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080804/4d098c7f/attachment.html
Peter, Apologies, but I am having some difficulty determining where the boundary between Lustre and the HSM lies within the first release. Plus I have a few newbie Lustre questions. I think your #4 is saying that Lustre will still provide a Space Manager in the first release, responsible for monitoring filesystem fullness, using e2scan to pick archive/purge candidates and issuing archive/purge requests to the HSM, that these tasks are not performed by the HSM itself. True? Is the Space Manager logic part of the Coordinator, or is it a separate entity? Is there one Coordinator/Space Manager pair per filesystem or one total per site? Users will need commands that allow them to archive and recall their files and/or directory subtrees, and they will want commands like ls and find that show them the current HSM archive/purge state of their files so that they can pre-stage purged files before they are needed, and so that they can purge unneeded files to effectively manage their own quotas. Will these commands be provided by Lustre, or by the HSM? Given that files are only recalled on open, this implies that a file which is open for either read or write by any client can never be purged, correct? And a file open for write by any client should never be archived since it could be silently changing while the archive is in progress. And if a file is opened for write after an archive has begun, the HSM will be sent a cancel request? Is the necessary information available to the Space Manager and/or Coordinator so that these rules can be enforced? The HSM data mover needs to be able to open a file by FID without encountering the adaptive timeout that other users are seeing. The data mover''s I/Os must not change the file''s read and write timestamps. The data mover needs a get_extents(int fd) function to read the file''s extent map so that it can find the location of holes in sparse files and preserve those holes within its HSM copies. Is there an interface available that provides this functionality? In the FID HLD I find mention of a object version field within the FID, which apparently gets incremented with each modification of the file. Is that currently implemented in Lustre? I''m thinking of the case where a file is archived, recalled, modified, archived, recalled, modified... The HSM will need a way to map the correct HSM copies to the correct version of the file, so hopefully the version field is already supported. Does Lustre already support snapshot capabilities, or will it in the future? When a snapshot is made, each archived/purged file within the snapshot effectively creates another reference to its copies within the HSM database. An HSM file copy cannot be removed until it is known that no references remain to that particular version of the file, either within the live filesystem or within any snapshot. Will the Coordinator be able to see the snapshot references, and avoid sending delete requests to the HSM until all snapshot references for a particular file version have been removed? Are snapshots read-write or read-only? If read-only, how to you intend to have users access purged files in snapshots? I haven''t been able to figure out how backup/restore works, or will work in Lustre. Standard utilities like tar will wreak havoc by triggering file recall storms within the HSM. Better is an intelligent backup package which understands that the HSM already has multiple copies of the file data, and so the backup program only needs to back up the metadata. The problem here again is that new references to the HSM copies are being created, yet those references are not visible to the HSM, it doesn''t know they exist, so methods are needed to ensure that seemly-obsolete HSM copies are not deleted before the backups that reference them have also been deleted. If you could provide a short description of how you intend backup/ restore to work in combination with an HSM, or if you could provide pointers, that would be great. Regards, Kevan On Aug 4, 1:06 pm, Peter Braam <Peter.Br... at Sun.COM> wrote:> We spoke about the HSM plans some 10 days ago. I think that the conclusions > are roughly as follows: > > 1. It is desirable to reach a first implementation as soon as possible. > 2. Some design puzzles remain to insure that HSM can keep up with Lutre > metadata clusters > > The steps to reach a first implementation can be summarized as: > > 1. Include file closes in the changelog, if the file was opened for write. > Include timestamps in the changelog entries. This allows the changelog > processor to see files that have become inactive and pass them on for > archiving. > 2. Build an open call that blocks for file retrieval and adapts timeouts to > avoid error returns. > 3. Until a least-recently-used log is built, use the e2scan utility to > generate lists of candidates for purging. > 4. Translate events and scan results into a form that they can be understood > by ADM. > 5. Work with a single coordinator, whose role it is to avoid getting > multiple ?close? records for the same file (a basic filter for events). > 6. Do not use initiators ? these can come later and assist with load > balancing and free-ing space on demand (both of which we can ignore for the > first release) > 7. Do not use multiple agents ? the agents can move stripes of files etc, > and this is not needed with a basic user level solution, based on consuming > the log. The only thing the agent must do in release one is get the > attention of a data mover to restore files on demand. > > Peter > > _______________________________________________ > Lustre-devel mailing list > Lustre-de... at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel
There are rather a lot of questions here - let''s give this a go. On 8/12/08 12:40 PM, "Kevan" <kfr at sgi.com> wrote:> Peter, > > Apologies, but I am having some difficulty determining where the > boundary > between Lustre and the HSM lies within the first release. Plus I have > a few newbie Lustre questions. > > I think your #4 is saying that Lustre will still provide a Space > Manager in > the first release, responsible for monitoring filesystem fullness, > using > e2scan to pick archive/purge candidates and issuing archive/purge > requests > to the HSM, that these tasks are not performed by the HSM itself.The list will be generated by e2scan initially, in due course by a more efficient and scalable LRU log (see HLD). The list will be digested and acted upon by the HSM policy manager.> True? > Is the Space Manager logic part of the Coordinator, or is it a > separate entity?separate> Is there one Coordinator/Space Manager pair per filesystem or one > total > per site?List generation will probably be per server target (per MDT or OST tbd). Rick Matthews can tell us if the policy manager manages sites or file systems.> > Users will need commands that allow them to archive and recall their > files > and/or directory subtrees, and they will want commands like ls and > find > that show them the current HSM archive/purge state of their files so > that > they can pre-stage purged files before they are needed, and so that > they > can purge unneeded files to effectively manage their own quotas. Will > these > commands be provided by Lustre, or by the HSM?These will be commands issued to lustre, extensions of the "lfs" commands.> > Given that files are only recalled on open, this implies that a file > which > is open for either read or write by any client can never be purged, > correct?yes> And a file open for write by any client should never be archived since > it > could be silently changing while the archive is in progress.If the HSM is used for backup one probably wants to back-up the file anyway, and this is a decision of the policy manager.>And if > a file is opened for write after an archive has begun, the HSM will > be sent a cancel request?The file system will generate events, the policy manager can decide how it acts on it. Is the necessary information available to> the > Space Manager and/or Coordinator so that these rules can be enforced? > > The HSM data mover needs to be able to open a file by FID without > encountering the adaptive timeout that other users are seeing. The > data > mover''s I/Os must not change the file''s read and write timestamps. > The > data mover needs a get_extents(int fd) function to read the file''s > extent > map so that it can find the location of holes in sparse files and > preserve > those holes within its HSM copies. Is there an interface available > that > provides this functionality?Planned in detail. See HLD.> > In the FID HLD I find mention of a object version field within the > FID, > which apparently gets incremented with each modification of the file. > Is that currently implemented in Lustre?Yes.> I''m thinking of the case > where > a file is archived, recalled, modified, archived, recalled, > modified... > The HSM will need a way to map the correct HSM copies to the correct > version > of the file, so hopefully the version field is already supported.Only one version of a file is present in the file system. The version is merely a unique indicator that a file has changed.> > Does Lustre already support snapshot capabilities, or will it in the > future? > When a snapshot is made, each archived/purged file within the snapshot > effectively creates another reference to its copies within the HSM > database. > An HSM file copy cannot be removed until it is known that no > references > remain to that particular version of the file, either within the live > filesystem or within any snapshot. Will the Coordinator be able to > see the > snapshot references, and avoid sending delete requests to the HSM > until all > snapshot references for a particular file version have been removed? > Are snapshots read-write or read-only? If read-only, how to you > intend to > have users access purged files in snapshots?TBD. The key issue with snapshots is where multiple files in snapshots have shared blocks. Dedup in ZFS brings similar issues.> > I haven''t been able to figure out how backup/restore works, or will > work > in Lustre. Standard utilities like tar will wreak havoc by triggering > file recall storms within the HSM. Better is an intelligent backup > package > which understands that the HSM already has multiple copies of the file > data, > and so the backup program only needs to back up the metadata. The > problem > here again is that new references to the HSM copies are being created, > yet those > references are not visible to the HSM, it doesn''t know they exist, so > methods are needed to ensure that seemly-obsolete HSM copies are not > deleted before the backups that reference them have also been deleted. > If you could provide a short description of how you intend backup/ > restore to > work in combination with an HSM, or if you could provide pointers, > that would > be great.The HSM should have a metadata database to implement "tape side" (as opposed to file system side) policy. That database might hold all metadata and manage references. Examples of such policies are compliance policies (e.g. Delete files from this year), and backup policies, e.g. retain this or that set of files. I expect that like future file systems a new concept of fileset is required to be very flexible in what policies are applied to. Rick ... Peter> Regards, Kevan > > On Aug 4, 1:06 pm, Peter Braam <Peter.Br... at Sun.COM> wrote: >> We spoke about the HSM plans some 10 days ago. I think that the conclusions >> are roughly as follows: >> >> 1. It is desirable to reach a first implementation as soon as possible. >> 2. Some design puzzles remain to insure that HSM can keep up with Lutre >> metadata clusters >> >> The steps to reach a first implementation can be summarized as: >> >> 1. Include file closes in the changelog, if the file was opened for write. >> Include timestamps in the changelog entries. This allows the changelog >> processor to see files that have become inactive and pass them on for >> archiving. >> 2. Build an open call that blocks for file retrieval and adapts timeouts to >> avoid error returns. >> 3. Until a least-recently-used log is built, use the e2scan utility to >> generate lists of candidates for purging. >> 4. Translate events and scan results into a form that they can be understood >> by ADM. >> 5. Work with a single coordinator, whose role it is to avoid getting >> multiple ?close? records for the same file (a basic filter for events). >> 6. Do not use initiators ? these can come later and assist with load >> balancing and free-ing space on demand (both of which we can ignore for the >> first release) >> 7. Do not use multiple agents ? the agents can move stripes of files etc, >> and this is not needed with a basic user level solution, based on consuming >> the log. The only thing the agent must do in release one is get the >> attention of a data mover to restore files on demand. >> >> Peter >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-de... at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre- >> devel > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Rick Matthews wrote:> On 08/29/08 15:38, Nathaniel Rutman wrote: >> Rick - I''m finally getting a chance to look at the ADM docs. >> Most notably as far as I''m concerned, it looks like ADM depends on >> DMAPI filesystem interfaces. >> What we have in Lustre at the moment is a changelog, which includes >> all namespace ops (file create, destroy, rename, etc.), and will >> include the closed-after-write (#1 below); and e2scan which can be >> used to semi-efficiently walk the filesystem gathering mtime/atme >> info (#3.) > DMAPI is an implementation choice. You are correct in assuming what it > needs is event information from which an informed decision is made. > If the necessary information is not with the event (because of later > change, or efficiency) the event/policy piece will gather the needed > info. I don''t > think there is anything outside of standard POSIX needed. >> We''ll have to add a flag into the lov_ea indicating "in HSM", and >> then block for file retrieval (#2). > Correct...with a small twist...the HSM holds copies of data even when > they continue to exist in native disk. The "release" of this space > then doesn''t need to > wait for a slower data mover. So, change "in HSM" to "only in HSM" and > you are correct.right, that''s what I had in mind.>> So we need to take these three items and provide some kind of >> interface that ADM is comfortable with, while not strictly following >> the DMAPI "check with us for every system event" paradigm. >> The only synchronous event here is #2, where we are requesting a file >> out of HSM. > Yep. >> >> From the ADM spec: >> Changes to ZFS will be fasttracked separately and putback to the ONNV >> gate. Much of >> DMAPI''s interaction with ZFS for dm_xxx APIs is done through VFS >> interfaces. Imported >> VFS interfaces are in the table below. A few additional changes are >> necessary, such as >> calling DMAPI to send events, and not updating timestamps for >> invisible IO. The plan and >> current prototype adds a flag value (FINVIS) to be passed into the >> VOP_READ, >> VOP_WRITE, and VOP_SPACE interfaces for invisible IO. >> >> If I''m understanding things correctly, if Lustre just honors the >> open(...,O_WRONLY | FINVIS) call, and sends the cache miss request >> (#2), that is sufficient interaction to pull an HSM file back into >> Lustre. >> We would need a second element that would read the changelogs and >> e2scan results to determine when/which files to archive, and the >> open(..., O_RDONLY | FINVIS) call to get the data. This element could >> be userspace and is asynchronous. Would this talk directly to ADM? >> Use DMAPI calls? > Correct...we would create an interface for consuming your events. (By > we, I mean some subset of the two teams). Our DMAPI implementation relies > heavily on filtering to prevent event floods. As we''ve discussed, > since filters just remove unwanted things, they can occur in the > "kernel" / log generation, > and in user space without impact on the resulting event chain. The > "invisible I/O" just prevents additional events size and > modtime/access time changes. > Need not be DMAPI.Ok, so this is the "event/policy piece", JC I think this is a subcomponent of the "coordinator" piece from the old HSM HLD. I see no reason why this can''t be a userspace program. I imagine this piece feeds the events/LRU list into the ADM policy engine, and then somebody (this piece? ADM itself?) starts doing the FINVIS copyouts into ADM. Does it make sense to send the cache miss request to this same event/policy piece, or to ADM directly? Somebody needs to do the FINVIS copyin. Thinking a little more about the Lustre internals for step #2, instead of blocking the open call on the MDT, maybe it makes sense to grant the open lock to the client, who receives the "only in HSM" flagged LOV md, and locally blocks read/write requests until the LOV md has been updated (maybe signalled through a lock callback on file) (We''ve talked about adding a file layout lock in the past; maybe that is appropriate here).>> >> Does this sound right? > Yep. >> >> >> Peter Braam wrote: >>> The steps to reach a first implementation can be summarized as: >>> >>> 1. Include file closes in the changelog, if the file was opened for >>> write. Include timestamps in the changelog entries. This allows >>> the changelog processor to see files that have become inactive >>> and pass them on for archiving. >>> 2. Build an open call that blocks for file retrieval and adapts >>> timeouts to avoid error returns. >>> 3. Until a least-recently-used log is built, use the e2scan utility >>> to generate lists of candidates for purging. >>> 4. Translate events and scan results into a form that they can be >>> understood by ADM. >>> 5. Work with a single coordinator, whose role it is to avoid >>> getting multiple ?close? records for the same file (a basic >>> filter for events). >>> 6. Do not use initiators ? these can come later and assist with >>> load balancing and free-ing space on demand (both of which we >>> can ignore for the first release) >>> 7. Do not use multiple agents ? the agents can move stripes of >>> files etc, and this is not needed with a basic user level >>> solution, based on consuming the log. The only thing the agent >>> must do in release one is get the attention of a data mover to >>> restore files on demand. >>> >>> >>> Peter >>> ------------------------------------------------------------------------ >>> >>> >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>> >> > >
Hello, sorry for breaking into discussion. Please find inlined Nathaniel Rutman wrote:> Rick Matthews wrote: > >> On 08/29/08 15:38, Nathaniel Rutman wrote: >>(Snip)>> >>> We''ll have to add a flag into the lov_ea indicating "in HSM", and >>> then block for file retrieval (#2). >>> >> Correct...with a small twist...the HSM holds copies of data even when >> they continue to exist in native disk. The "release" of this space >> then doesn''t need to >> wait for a slower data mover. So, change "in HSM" to "only in HSM" and >> you are correct. >> > right, that''s what I had in mind. >- What is definition of "ONLY in HSM" ? - Are these flags exposed to end user ? Consider use case : User has someFile striped across two osts: OST1 and OST2. File is in HSM as well. OST2 is down. User reads the file and reaches stripe residing on OST2 (or open() checks ost status ) In this case it will be nice to stage file from tape as a whole or only stripes residing on OST2. Also, when OST2 restarts it shall remove stale stripes and MDT points to right OST set after retrieval. I realize it makes things more complicated and adds more triggers to #2 for file retrieval. Back to flags definition : Thus staging from tape may be triggered by several conditions including ( File_is_Resident ) and (File_is_in_HSM) and (OST_is_Not_Available) in addition to ( ! File_is_Resident ) and (File_is_in_HSM) It may worth to keep flags (File_is_Resident) and (File_is_in_HSM) separate as "File_is_in_HSM" is a fundamental file property indicating "permanent" storage of the file and other flags reflect file state (file is resident on disk) or transient condition (ost is down). The other use case when end user writes file to lustre/HSM system and waits till file reaches the tape before deleting the original while checking file status time to time. It can be done if "File_is_in_HSM" flag is exposed to end user by some command or if HSM fileID is set in EA. In this case user wants to know "is in hsm" part of the flag regardless "file is resident on disk". Keeping flags separate will help with logic and synchronization. Best regards, Alex. (snip)>>> Peter Braam wrote: >>> >>>> The steps to reach a first implementation can be summarized as: >>>> >>>> 1. Include file closes in the changelog, if the file was opened for >>>> write. Include timestamps in the changelog entries. This allows >>>> the changelog processor to see files that have become inactive >>>> and pass them on for archiving. >>>> 2. Build an open call that blocks for file retrieval and adapts >>>> timeouts to avoid error returns. >>>> 3. Until a least-recently-used log is built, use the e2scan utility >>>> to generate lists of candidates for purging. >>>> 4. Translate events and scan results into a form that they can be >>>> understood by ADM. >>>> 5. Work with a single coordinator, whose role it is to avoid >>>> getting multiple ?close? records for the same file (a basic >>>> filter for events). >>>> 6. Do not use initiators ? these can come later and assist with >>>> load balancing and free-ing space on demand (both of which we >>>> can ignore for the first release) >>>> 7. Do not use multiple agents ? the agents can move stripes of >>>> files etc, and this is not needed with a basic user level >>>> solution, based on consuming the log. The only thing the agent >>>> must do in release one is get the attention of a data mover to >>>> restore files on demand. >>>> >>>> >>>> Peter >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Lustre-devel mailing list >>>> Lustre-devel at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>>> >>>> >> > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
On 9/3/08 10:58 PM, "Alex Kulyavtsev" <aik at fnal.gov> wrote:> Hello, > sorry for breaking into discussion. Please find inlined > > Nathaniel Rutman wrote: >> Rick Matthews wrote: >> >>> On 08/29/08 15:38, Nathaniel Rutman wrote: >>> > (Snip) >>> >>>> We''ll have to add a flag into the lov_ea indicating "in HSM", and >>>> then block for file retrieval (#2). >>>> >>> Correct...with a small twist...the HSM holds copies of data even when >>> they continue to exist in native disk. The "release" of this space >>> then doesn''t need to >>> wait for a slower data mover. So, change "in HSM" to "only in HSM" and >>> you are correct. >>> >> right, that''s what I had in mind. >> > - What is definition of "ONLY in HSM" ? > - Are these flags exposed to end user ?They will be extended attributes accessible with the xattr utilities. If there is a standard for such attributes, we should use it to avoid introducing yet another set of product specific attributes.> > Consider use case : > User has someFile striped across two osts: OST1 and OST2. File is in > HSM as well. > OST2 is down. User reads the file and reaches stripe residing on OST2 > (or open() checks ost status ) > In this case it will be nice to stage file from tape as a whole or only > stripes residing on OST2. > Also, when OST2 restarts it shall remove stale stripes and MDT points to > right OST set after retrieval. > I realize it makes things more complicated and adds more triggers to #2 > for file retrieval.Nice idea, but building this into the FS is really a refinement that we should not be going after too soon. When OST2 returns to the cluster, we have cleanup work, and building all the administration infrastructure for this is a lot of work. With a copy_from_hsm command users should be able to do this. Lsxattr <pathname> -- see file is on tape Lfs getfid <pathname> -- get its fid Copy_from_hsm <fid> <path> -- copy it in> > Back to flags definition : > Thus staging from tape may be triggered by several conditions including > ( File_is_Resident ) and (File_is_in_HSM) and (OST_is_Not_Available) > in addition to > ( ! File_is_Resident ) and (File_is_in_HSM) > > It may worth to keep flags (File_is_Resident) and (File_is_in_HSM) > separate as > "File_is_in_HSM" is a fundamental file property indicating "permanent" > storage of the file and other flags reflect file state (file is resident > on disk) > or transient condition (ost is down).Reminder: File_is_in_HSM needs to be cleared if the file changes again. Files change on the OSS, not on the MDS, where are the flags? We can trust version propagation from OSS to MDS only when SOM is present.> > The other use case when end user writes file to lustre/HSM system and > waits till file reaches the tape before deleting the original while > checking file status time to time.Yes.> It can be done if "File_is_in_HSM" flag is exposed to end user by some > command or if HSM fileID is set in EA.The HSM fileID will NOT be in the EA at all. The flag can be exposed.> In this case user wants to know "is in hsm" part of the flag regardless > "file is resident on disk". Keeping flags separate will help with logic > and synchronization.We want a flag file is NOT resident on disk, since otherwise we need to tag all files, but that is a detail. Peter> > Best regards, Alex. > > (snip) >>>> Peter Braam wrote: >>>> >>>>> The steps to reach a first implementation can be summarized as: >>>>> >>>>> 1. Include file closes in the changelog, if the file was opened for >>>>> write. Include timestamps in the changelog entries. This allows >>>>> the changelog processor to see files that have become inactive >>>>> and pass them on for archiving. >>>>> 2. Build an open call that blocks for file retrieval and adapts >>>>> timeouts to avoid error returns. >>>>> 3. Until a least-recently-used log is built, use the e2scan utility >>>>> to generate lists of candidates for purging. >>>>> 4. Translate events and scan results into a form that they can be >>>>> understood by ADM. >>>>> 5. Work with a single coordinator, whose role it is to avoid >>>>> getting multiple ?close? records for the same file (a basic >>>>> filter for events). >>>>> 6. Do not use initiators ? these can come later and assist with >>>>> load balancing and free-ing space on demand (both of which we >>>>> can ignore for the first release) >>>>> 7. Do not use multiple agents ? the agents can move stripes of >>>>> files etc, and this is not needed with a basic user level >>>>> solution, based on consuming the log. The only thing the agent >>>>> must do in release one is get the attention of a data mover to >>>>> restore files on demand. >>>>> >>>>> >>>>> Peter >>>>> ------------------------------------------------------------------------ >>>>> >>>>> >>>>> _______________________________________________ >>>>> Lustre-devel mailing list >>>>> Lustre-devel at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>>>> >>>>> >>> >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >> > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel