Aurelien Degremont wrote:> Hello all > > I''m sending this e-mail directly because for some unknown reasons, it > did not reach the mailing lists (either lustre-hsm-core-ext or > lustre-devel) last week. > > Find attached, 2 schemas presenting the various messages exchanged by > Lustre HSM components for copyin and copyout. Tell me if this is not > what you were expecting, I can fix them for tomorrow conf call. I > think I will add some other schemas also. > > > > Some points we did not discuss at last conf call: > > * File unlinks: > > - HSM object removal should be async.Agreed, trigger should just be changelog unlink entry.> > - We should not link hsm object, even in v1. Could we manage hsm object > deletion like ost object deletion and manage orphan in the same way?Since the unlink event trigger is the changelog record, the policy engine should simply not cancel the changelog record until the HSM confirms the unlink.> > - Presently, we could also leak hsm objects if the file is dirtied when > being copied out. In this case, the file will be tagged dirtied, with no > copyout_begin/complete flag. So the MDT will not request for HSM removal > but their is something to delete there. Maybe the copyout mechanisms > should be adapted.How about we never clear the copyout_begin bit? This is really for the coordinator''s benefit so it knows a copyout is in progress on that file, but since we''re having regular status updates to the coordinator from the agent, there''s no real need for that bit. So instead we have the bit "a_file_exists_in_hsm" aka hsm_exists. But we don''t even need that - the MDT does not "request for HSM removal", but instead the policy engine just watches the changelog for unlink events. Ah, now I see the problem with using the changelog - this forces the policy engine to remember which files are on HSM, or accept an error return code, but in any case may result in much undue load on the HSM when deleting non-HSM''ed files. So what do we do? Ignore the changelog and have the MDT directly signal the coordinator to do HSM unlinks? That may be fine. In that case, I think if we leak files after we tell the coordinator to delete them it is not much of a problem.> > * HSM dirty bit. > > - should be updated with laziness. > - Is it possible to implement it like the lazy file size? That means, > manage the dirty bit, per OST object, and lazily update it on the MDT?Since file mtime/size is already updated this way, we can just use any attr change as the dirty indicator; we don''t need an actual bit per object. Any setattr should update MDT dirty bit, most setxattr should (not the hardlink/parent xattr however, maybe no XATTR_TRUSTED_PREFIX ones).> > - Also, if, instead of setting hsm_dirty bit to 1 when the file is > modified, can we do counter += 1 ? That way ''counter'' could be use as > ''light'' file revision. You compare two versions of this variable, is > their differ, the file has been modified (this is not > intended to check ''counter_c1 < counter_c2'' but just ''counter_c1 !> counter_c2'', that way, you can have circular counters.)I have no objection, although I don''t see the benefit right now. E.g. how is that different than checking the mtime?> - Could a policy test could be based on file path (not just filename and > properties) ? This is a rule we presently used in our hsm tools. I do > not see how have the filepath from the changelogs data ?The changelog data has file and parent FID, if you want more path than this you can do a "lfs fid2path" to reconstruct the entire path name. Note however this returns only the "first" path of a hardlinked file. (Is this a limitation? Do I need to fix fid2path?)> > - Could this flag be exposed to userspace via liblustreapi? Maybe this > flag should be set on file creation also? Doing this, Policy Engines > could use this flag to know easily if the file is udate to date in hsm > or not.Sounds good.> > * Policy Engine > > - It needs to: > . read changelogs (mdt) > . df (mdt/client) > . lfs df (per ost) (mdt/client) > . scan namespace (client) > . lfs getstripe by fid (client) > . stat file by fid (client) > > The only thing the engine will lack on a client is the changelogs. May > be it could be a good idea to export the changelogs on some ''trusted > clients'' ?I think it''s sticky to impose certain priviledged clients, but maybe exporting to all clients isn''t so bad. Superuser privs on any client gives them access. If anyone really hates this, we can add a tunable on the MDT to allow/disallow all client access.> If not, we will be force to have MDT, client mount and policy engine on > the same node or split the policy engine into two components (very bad > idea to impose that on the engine). Potential overhead? > >ok, client access to changelogs sounds like a reasonable requirement. [Note: this actually happens to solve a problem I haven''t figured out yet, which is to limit access to only disk-committed changelog records.]> > > ------------------------------------------------------------------------ > >#10 is "open reply", not "i/o reply", but a very nice diagram! Can you add these to the wiki?> ------------------------------------------------------------------------ >
On Oct 27, 2008 17:49 +0100, Aurelien Degremont wrote:> Some points we did not discuss at last conf call: > > * File unlinks: > > - HSM object removal should be async. > - We should not link hsm object, even in v1. Could we manage hsm object > deletion like ost object deletion and manage orphan in the same way?I wouldn''t object to this - there could be an "HSM unlink" llog, similar to the OST unlink llog that the HSM coordinator (either in the kernel, or in userspace) processes at startup. The difficulty is to know when the llog record can be cancelled.> - Presently, we could also leak hsm objects if the file is dirtied when > being copied out. In this case, the file will be tagged dirtied, with no > copyout_begin/complete flag. So the MDT will not request for HSM removal > but their is something to delete there. Maybe the copyout mechanisms > should be adapted.I would recommend that we can keep a reference to a "dirty" HSM object even if the copyout did not complete successfully, and HSM policy engine should decide if the dirty object is kept or deleted. In some cases it may never be possible to do a complete copyout of the file, and having some copy of the file would be better than having none at all.> * HSM dirty bit. > > - should be updated with laziness. > - Is it possible to implement it like the lazy file size? That means, > manage the dirty bit, per OST object, and lazily update it on the MDT?> - Also, if, instead of setting hsm_dirty bit to 1 when the file is > modified, can we do counter += 1 ? That way ''counter'' could be use as > ''light'' file revision. You compare two versions of this variable, is > their differ, the file has been modified (this is not intended to check > ''counter_c1 < counter_c2'' but just ''counter_c1 != counter_c2'', that way, > you can have circular counters.)The MDS in 1.8 (and soon 2.0) will already keep a version counter for all changes to the MDS inode. The OSTs will also keep version numbers for all of the objects there.> - Could this flag be exposed to userspace via liblustreapi? Maybe this > flag should be set on file creation also? Doing this, Policy Engines > could use this flag to know easily if the file is udate to date in hsm > or not. > > * Policy Engine > > - It needs to: > . read changelogs (mdt) > . df (mdt/client) > . lfs df (per ost) (mdt/client) > . scan namespace (client) > . lfs getstripe by fid (client) > . stat file by fid (client) > > The only thing the engine will lack on a client is the changelogs. May > be it could be a good idea to export the changelogs on some ''trusted > clients'' ?Yes, that is already considered.> If not, we will be force to have MDT, client mount and policy engine on > the same node or split the policy engine into two components (very bad > idea to impose that on the engine). Potential overhead?If the policy engine is running via an external database (e.g. MySQL), it wouldn''t be impossible to just have the Changelog reader do the database insertions remotely, after looking up the pathname.> - Could a policy test could be based on file path (not just filename and > properties) ? This is a rule we presently used in our hsm tools. I do > not see how have the filepath from the changelogs data (lfs fid2path?) ?I believe the Changelog will report a full pathname (relative to the root of the filesystem). This will be exported via llapi to userspace. Nathan? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Nathaniel Rutman a ?crit :>> - HSM object removal should be async. > Agreed, trigger should just be changelog unlink entry.I''m not sure Lustre need the policy engine for managing the hsm removals. It could triggers them automatically (like the copy-in mechanisms) when the file is deleted in Lustre. Lustre could still live for a long moment without the PolicyEngine/Space Manager, we could imagine this for several hours.>> - We should not link hsm object, even in v1. Could we manage hsm object >> deletion like ost object deletion and manage orphan in the same way? > Since the unlink event trigger is the changelog record, the policy > engine should simply not cancel the changelog record until the HSM > confirms the unlink.For the moment, the PolicyEngine has no way to know the copytool has successfully deleted the file.> How about we never clear the copyout_begin bit? This is really for > the coordinator''s benefit so it knows a copyout is in progress on that > file, but since we''re having regular status updates to the coordinator > from the agent, there''s no real need for that bit. So instead we have > the bit "a_file_exists_in_hsm" aka hsm_exists. > But we don''t even need that - the MDT does not "request for HSM > removal", but instead the policy engine just watches the changelog for > unlink events. Ah, now I see the problem with using the changelog - > this forces the policy engine to remember which files are on HSM, or > accept an error return code, but in any case may result in much undue > load on the HSM when deleting non-HSM''ed files. So what do we do? > Ignore the changelog and have the MDT directly signal the coordinator > to do HSM unlinks? That may be fine. In that case, I think if we > leak files after we tell the coordinator to delete them it is not much > of a problem.If we store in a llog the hsm objects that need to be removed and only delete them when copytool says it''s fine, we will not leak files and if coordinator crashes, the copyin and removal requests will be resent automatically. The PolicyEngine will also re-send copy-out requests.>> * HSM dirty bit. >> >> - should be updated with laziness. >> - Is it possible to implement it like the lazy file size? That means, >> manage the dirty bit, per OST object, and lazily update it on the MDT? > > Since file mtime/size is already updated this way, we can just use any > attr change as the dirty indicator; we don''t need an actual bit per > object.Dirty means data were changed, not metadata.>> - Also, if, instead of setting hsm_dirty bit to 1 when the file is >> modified, can we do counter += 1 ? That way ''counter'' could be use as >> ''light'' file revision. You compare two versions of this variable, is >> their differ, the file has been modified (this is not >> intended to check ''counter_c1 < counter_c2'' but just ''counter_c1 !>> counter_c2'', that way, you can have circular counters.) > I have no objection, although I don''t see the benefit right now. E.g. > how is that different than checking the mtime?mtime could not be trust. mtime is a user exposed value that could be changed by user as he likes it. $ touch -t 200101010000 foo $ ls -l foo -rw-r--r-- 1 degremont user 0 Jan 1 2001 foo> The changelog data has file and parent FID, if you want more path than > this you can do a "lfs fid2path" to reconstruct the entire path name. > Note however this returns only the "first" path of a hardlinked file. > (Is this a limitation? Do I need to fix fid2path?)Ok this is fine, enough for our needs.> #10 is "open reply", not "i/o reply", but a very nice diagram! Can > you add these to the wiki? >Thanks. Done. Aur?lien
On Oct 28, 2008 15:41 +0100, DEGREMONT Aurelien wrote:> Andreas Dilger a ?crit : >> I would recommend that we can keep a reference to a "dirty" HSM object >> even if the copyout did not complete successfully, and HSM policy engine >> should decide if the dirty object is kept or deleted. In some cases >> it may never be possible to do a complete copyout of the file, and having >> some copy of the file would be better than having none at all. > > So we will have three states for the hsm object: > - exist > - completed (exist & copy was completed) > - uptodate (exist, copy was completed and the lustre file was not dirtied)We already need to have the distinction between "completed" and "uptodate" because an "uptodate" HSM copy stop being uptodate as soon as the file is again modified. I was a bit unclear when I wrote "...a complete copyout of the file". What I meant was "impossible to ever complete an uptodate copyout of the file if it is continually changing". I don''t think we need to make any distinction between "exist" and "complete" because both mean "not uptodate".>>> - Also, if, instead of setting hsm_dirty bit to 1 when the file is >>> modified, can we do counter += 1 ? That way ''counter'' could be use as >>> ''light'' file revision. You compare two versions of this variable, is >>> their differ, the file has been modified (this is not intended to check >>> ''counter_c1 < counter_c2'' but just ''counter_c1 != counter_c2'', that way, >>> you can have circular counters.) >>> >> The MDS in 1.8 (and soon 2.0) will already keep a version counter for all >> changes to the MDS inode. The OSTs will also keep version numbers for >> all of the objects there. >> > In our first (and old) hlds, we based several mechanisms on such > information. But we were told that the version will be available > for MDT but not for OST objectThis hasn''t changed - the OST object versions are local to the OSTs. We could get this OST version information at the agent (just like with file size being fetched from all OSTs) and store it with the HSM archive copy. This can be a later optimization, however. I think mtime is itself sufficient for an initial indication for file data changes. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Our conclusions from the meeting about HSM unlink, from http://arch.lustre.org/index.php?title=HSM_Migration 5.4 unlink 1. A client issues an unlink for a file to the MDT. 2. The MDT includes the "hsm_exists" bit in the changelog unlink entry 3. The policy engine determines if the file should be removed from HSM 4. Policy engine sends HSMunlink FID to coordinator via MDT ioctl 1. Yuck - we can''t do direct ioctls on the MDT from a client node. We can only do ioctls on a file. maybe we need to implement a .lustre/device/XXX dir, where all MDT/OSTs are listed, and act as stub files for handling ioctls. or maybe policy engine talks to agent / tool directly for unlinks? 5. The coordinator sends a request to one of its agent for the corresponding removal. 6. The agent spawns the HSM tool to do this removal. 7. When HSM removal is complete, policy engine cancels changelog unlink record 1. How does agent/HSM tool signal to policy engine that HSM removal is complete? 8. In case of agent crash, unlink record will remain uncancelled in the changelog; policy engine should restart processing at the first uncancelled record. There''s two open issues: - How for policy engine to tell coordinator to unlink an HSM object, when no corresponding object exists on the MDT for us to ioctl() on -which coordinator to talk to for CMD? -since unlink isn''t data movement, maybe all unlinks can be originated from policy engine directly? (direct call to HSMunlinkHelper executable) - How does HSMunlinkHelper return a signal to the policy engine that the removal is complete -if policy engine directly calls HSMunlinkHelper this is easy... DEGREMONT Aurelien wrote:> >>> * HSM dirty bit. >>> >>> - should be updated with laziness. >>> - Is it possible to implement it like the lazy file size? That means, >>> manage the dirty bit, per OST object, and lazily update it on the MDT? >>> >> Since file mtime/size is already updated this way, we can just use any >> attr change as the dirty indicator; we don''t need an actual bit per >> object. >> > Dirty means data were changed, not metadata. >Actually a file is dirty if either changed, depending on what you are storing in HSM: filename / path? EAs? My point was that you can use the mtime attr change as an indicator that some data possibly changed. It is not sufficient to show that it absolutely has changed, so policy engine could do something else to try to verify the change, or simply assume that the file is changed, mark it dirty, and reschedule for copyout -- no real harm done. Yes, I agree it would be ideal to have a true verifyable "this file has changed" versioning, but since that doesn''t exist yet, I don''t think we need to hold up HSM development for it.
Nathaniel Rutman a ?crit :> 5.4 unlink > > 1. A client issues an unlink for a file to the MDT. > 2. The MDT includes the "hsm_exists" bit in the changelog unlink entry > 3. The policy engine determines if the file should be removed from HSM > 4. Policy engine sends HSMunlink FID to coordinator via MDT ioctl > 1. Yuck - we can''t do direct ioctls on the MDT from a client > node. We can only do ioctls on a file. > > maybe we need to implement a .lustre/device/XXX dir, > where all MDT/OSTs are listed, and act as stub files for > handling ioctls. or maybe policy engine > talks to agent / tool directly > for unlinks?Can''t we add an ioctl on /dev/obd or /mnt/lustre root dir ? or even on .lustre/fid and passing the fid in ioctl args? I''m not fond of .lustre/device/XXX dirs...> 5. The coordinator sends a request to one of its agent for the > corresponding removal. > 6. The agent spawns the HSM tool to do this removal. > 7. When HSM removal is complete, policy engine cancels changelog > unlink record > 1. How does agent/HSM tool signal to policy engine that HSM > removal is complete? > 8. In case of agent crash, unlink record will remain uncancelled in > the changelog; policy engine should restart processing at the > first uncancelled record. > > > There''s two open issues: > - How for policy engine to tell coordinator to unlink an HSM object, > when no corresponding object exists on the MDT for us to ioctl() on > -which coordinator to talk to for CMD?If we implement an ioctl like ioctl(.lustre/fid, HSMUnlink, fid=0x0000121561), can the API find the good MDT from the FID ? FLD can do this for an already removed file?)> - How does HSMunlinkHelper return a signal to the policy engine that > the removal is complete > -if policy engine directly calls HSMunlinkHelper this is easy...I think there is a more general issue concerning feedback for the PolicyEngine. Surely the PolicyEngine will need information for other request it sent to the Coordinator. We should think of a more general mechanism to inform it of the success or failure of its requests. Should the HSM (succesfull) event become changelog events....(hsm_copyin/hsm_copyout/hsm_remove)? Can another program be interested in such events? Aur?lien