thr3ads.net - Lustre devel - [Lustre-devel] HSM comments [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Nathaniel Rutman

2008-Oct-27 23:02 UTC

[Lustre-devel] HSM comments

Aurelien Degremont wrote:> Hello all
>
> I''m sending this e-mail directly because for some unknown reasons,
it
> did not reach the mailing lists (either lustre-hsm-core-ext or 
> lustre-devel) last week.
>
> Find attached, 2 schemas presenting the various messages exchanged by 
> Lustre HSM components for copyin and copyout. Tell me if this is not 
> what you were expecting, I can fix them for tomorrow conf call. I 
> think I will add some other schemas also.
>
>
>
> Some points we did not discuss at last conf call:
>
> * File unlinks:
>
> - HSM object removal should be async.
Agreed, trigger should just be changelog unlink entry.>
> - We should not link hsm object, even in v1. Could we manage hsm object
> deletion like ost object deletion and manage orphan in the same way?Since the unlink event trigger is the changelog record, the policy 
engine should simply not cancel the changelog record until the HSM 
confirms the unlink.  >
> - Presently, we could also leak hsm objects if the file is dirtied when
> being copied out. In this case, the file will be tagged dirtied, with no
> copyout_begin/complete flag. So the MDT will not request for HSM removal
> but their is something to delete there. Maybe the copyout mechanisms
> should be adapted.How about we never clear the copyout_begin bit?  This is really for the 
coordinator''s benefit so it knows a copyout is in progress on that
file,
but since we''re having regular status updates to the coordinator from 
the agent, there''s no real need for that bit.  So instead we have the 
bit "a_file_exists_in_hsm" aka hsm_exists.
But we don''t even need that - the MDT does not "request for HSM 
removal", but instead the policy engine just watches the changelog for 
unlink events.  Ah, now I see the problem with using the changelog - 
this forces the policy engine to remember which files are on HSM, or 
accept an error return code, but in any case may result in much undue 
load on the HSM when deleting non-HSM''ed files.  So what do we do?  
Ignore the changelog and have the MDT directly signal the coordinator to 
do HSM unlinks?  That may be fine.  In that case, I think if we leak 
files after we tell the coordinator to delete them it is not much of a 
problem.
>
> * HSM dirty bit.
>
> - should be updated with laziness.
> - Is it possible to implement it like the lazy file size? That means,
> manage the dirty bit, per OST object, and lazily update it on the MDT?Since file mtime/size is already updated this way, we can just use any 
attr change as the dirty indicator; we don''t need an actual bit per 
object.  Any setattr should update MDT dirty bit, most setxattr should 
(not the hardlink/parent xattr however, maybe no XATTR_TRUSTED_PREFIX 
ones).  >
> - Also, if, instead of setting hsm_dirty bit to 1 when the file is
> modified, can we do counter += 1 ? That way ''counter''
could be use as
> ''light'' file revision. You compare two versions of this
variable, is
> their differ, the file has been modified  (this is not
> intended to check ''counter_c1 < counter_c2'' but just
''counter_c1 !> counter_c2'', that way, you can have circular
counters.)I have no objection, although I don''t see the benefit right now.  E.g. 
how is that different than checking the mtime?> - Could a policy test could be based on file path (not just filename and
> properties) ? This is a rule we presently used in our hsm tools. I do
> not see how have the filepath from the changelogs data ?The changelog data has file and parent FID, if you want more path than 
this you can do a "lfs fid2path" to reconstruct the entire path name.
Note however this returns only the "first" path of a hardlinked file.
(Is this a limitation?  Do I need to fix fid2path?)>
> - Could this flag be exposed to userspace via liblustreapi? Maybe this
> flag should be set on file creation also? Doing this, Policy Engines
> could use this flag to know easily if the file is udate to date in hsm
> or not.
Sounds good.>
> * Policy Engine
>
> - It needs to:
> . read changelogs (mdt)
> . df (mdt/client)
> . lfs df (per ost) (mdt/client)
> . scan namespace (client)
> . lfs getstripe by fid (client)
> . stat file by fid (client)
>
> The only thing the engine will lack on a client is the changelogs. May
> be it could be a good idea to export the changelogs on some
''trusted
> clients'' ?I think it''s sticky to impose certain priviledged clients, but maybe 
exporting to all clients isn''t so bad.  Superuser privs on any client 
gives them access.  If anyone really hates this, we can add a tunable on 
the MDT to allow/disallow all client access.> If not, we will be force to have MDT, client mount and policy engine on
> the same node or split the policy engine into two components (very bad
> idea to impose that on the engine). Potential overhead?
>
>ok, client access to changelogs sounds like a reasonable requirement.  
[Note: this actually happens to solve a problem I haven''t figured out 
yet, which is to limit access to only disk-committed changelog
records.]>
>
> ------------------------------------------------------------------------
>
>#10 is "open reply", not "i/o reply", but a very nice
diagram!  Can you
add these to the wiki?> ------------------------------------------------------------------------
>

Andreas Dilger

2008-Oct-27 23:12 UTC

head link

[Lustre-devel] HSM comments

On Oct 27, 2008  17:49 +0100, Aurelien Degremont wrote:> Some points we did not discuss at last conf call:
>
> * File unlinks:
>
> - HSM object removal should be async.
> - We should not link hsm object, even in v1. Could we manage hsm object
> deletion like ost object deletion and manage orphan in the same way?
I wouldn''t object to this - there could be an "HSM unlink"
llog, similar
to the OST unlink llog that the HSM coordinator (either in the kernel,
or in userspace) processes at startup.  The difficulty is to know when
the llog record can be cancelled.
> - Presently, we could also leak hsm objects if the file is dirtied when
> being copied out. In this case, the file will be tagged dirtied, with no
> copyout_begin/complete flag. So the MDT will not request for HSM removal
> but their is something to delete there. Maybe the copyout mechanisms
> should be adapted.
I would recommend that we can keep a reference to a "dirty" HSM object
even if the copyout did not complete successfully, and HSM policy engine
should decide if the dirty object is kept or deleted.  In some cases
it may never be possible to do a complete copyout of the file, and having
some copy of the file would be better than having none at all.
> * HSM dirty bit.
>
> - should be updated with laziness.
> - Is it possible to implement it like the lazy file size? That means,
> manage the dirty bit, per OST object, and lazily update it on the MDT?
> - Also, if, instead of setting hsm_dirty bit to 1 when the file is
> modified, can we do counter += 1 ? That way ''counter''
could be use as
> ''light'' file revision. You compare two versions of this
variable, is
> their differ, the file has been modified  (this is not intended to check
> ''counter_c1 < counter_c2'' but just
''counter_c1 != counter_c2'', that way,
> you can have circular counters.)
The MDS in 1.8 (and soon 2.0) will already keep a version counter for all
changes to the MDS inode.  The OSTs will also keep version numbers for
all of the objects there.
> - Could this flag be exposed to userspace via liblustreapi? Maybe this
> flag should be set on file creation also? Doing this, Policy Engines
> could use this flag to know easily if the file is udate to date in hsm
> or not.
>
> * Policy Engine
>
> - It needs to:
> . read changelogs (mdt)
> . df (mdt/client)
> . lfs df (per ost) (mdt/client)
> . scan namespace (client)
> . lfs getstripe by fid (client)
> . stat file by fid (client)
>
> The only thing the engine will lack on a client is the changelogs. May
> be it could be a good idea to export the changelogs on some
''trusted
> clients'' ?
Yes, that is already considered.
> If not, we will be force to have MDT, client mount and policy engine on
> the same node or split the policy engine into two components (very bad
> idea to impose that on the engine). Potential overhead?
If the policy engine is running via an external database (e.g. MySQL),
it wouldn''t be impossible to just have the Changelog reader do the
database insertions remotely, after looking up the pathname.
> - Could a policy test could be based on file path (not just filename and
> properties) ? This is a rule we presently used in our hsm tools. I do
> not see how have the filepath from the changelogs data (lfs fid2path?) ?
I believe the Changelog will report a full pathname (relative to the root
of the filesystem).  This will be exported via llapi to userspace.  Nathan?



Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

DEGREMONT Aurelien

2008-Oct-28 15:42 UTC

head link

[Lustre-devel] HSM comments

Nathaniel Rutman a ?crit :>> - HSM object removal should be async.
> Agreed, trigger should just be changelog unlink entry.I''m not sure Lustre need the policy engine for managing the hsm
removals.
It could triggers them automatically (like the copy-in mechanisms) when 
the file is deleted in Lustre.
Lustre could still live for a long moment without the PolicyEngine/Space 
Manager, we could imagine this for several hours.
>> - We should not link hsm object, even in v1. Could we manage hsm object
>> deletion like ost object deletion and manage orphan in the same way?
> Since the unlink event trigger is the changelog record, the policy 
> engine should simply not cancel the changelog record until the HSM 
> confirms the unlink.For the moment, the PolicyEngine has no way to know the copytool has 
successfully deleted the file.> How about we never clear the copyout_begin bit?  This is really for 
> the coordinator''s benefit so it knows a copyout is in progress on
that
> file, but since we''re having regular status updates to the
coordinator
> from the agent, there''s no real need for that bit.  So instead we
have
> the bit "a_file_exists_in_hsm" aka hsm_exists.
> But we don''t even need that - the MDT does not "request for
HSM
> removal", but instead the policy engine just watches the changelog for
> unlink events.  Ah, now I see the problem with using the changelog - 
> this forces the policy engine to remember which files are on HSM, or 
> accept an error return code, but in any case may result in much undue 
> load on the HSM when deleting non-HSM''ed files.  So what do we do?
> Ignore the changelog and have the MDT directly signal the coordinator 
> to do HSM unlinks?  That may be fine.  In that case, I think if we 
> leak files after we tell the coordinator to delete them it is not much 
> of a problem.If we store in a llog the hsm objects that need to be removed and only 
delete them when copytool says it''s fine, we will not leak files and if
coordinator crashes, the copyin and removal requests will be resent 
automatically. The PolicyEngine will also re-send copy-out
requests.>> * HSM dirty bit.
>>
>> - should be updated with laziness.
>> - Is it possible to implement it like the lazy file size? That means,
>> manage the dirty bit, per OST object, and lazily update it on the MDT?
>
> Since file mtime/size is already updated this way, we can just use any 
> attr change as the dirty indicator; we don''t need an actual bit
per
> object.  Dirty means data were changed, not metadata.
>> - Also, if, instead of setting hsm_dirty bit to 1 when the file is
>> modified, can we do counter += 1 ? That way ''counter''
could be use as
>> ''light'' file revision. You compare two versions of
this variable, is
>> their differ, the file has been modified  (this is not
>> intended to check ''counter_c1 < counter_c2'' but
just ''counter_c1 !>> counter_c2'', that way, you can have
circular counters.)
> I have no objection, although I don''t see the benefit right now. 
E.g.
> how is that different than checking the mtime?mtime could not be trust. mtime is a user exposed value that could be 
changed by user as he likes it.

$ touch -t 200101010000 foo
$ ls -l foo
-rw-r--r-- 1 degremont user 0 Jan  1  2001 foo
> The changelog data has file and parent FID, if you want more path than 
> this you can do a "lfs fid2path" to reconstruct the entire path
name.
> Note however this returns only the "first" path of a hardlinked
file.
> (Is this a limitation?  Do I need to fix fid2path?)Ok this is fine, enough for our needs.
> #10 is "open reply", not "i/o reply", but a very nice
diagram!  Can
> you add these to the wiki?
>Thanks. Done.


Aur?lien

Andreas Dilger

2008-Oct-28 18:26 UTC

head link

[Lustre-devel] HSM comments

On Oct 28, 2008  15:41 +0100, DEGREMONT Aurelien wrote:> Andreas Dilger a ?crit :
>> I would recommend that we can keep a reference to a "dirty"
HSM object
>> even if the copyout did not complete successfully, and HSM policy
engine
>> should decide if the dirty object is kept or deleted.  In some cases
>> it may never be possible to do a complete copyout of the file, and
having
>> some copy of the file would be better than having none at all.
>   
> So we will have three states for the hsm object:
> - exist
> - completed (exist & copy was completed)
> - uptodate (exist, copy was completed and the lustre file was not dirtied)
We already need to have the distinction between "completed" and
"uptodate"
because an "uptodate" HSM copy stop being uptodate as soon as the file
is
again modified.

I was a bit unclear when I wrote "...a complete copyout of the file". 
What
I meant was "impossible to ever complete an uptodate copyout of the file
if it is continually changing".  I don''t think we need to make any
distinction
between "exist" and "complete" because both mean "not
uptodate".
>>> - Also, if, instead of setting hsm_dirty bit to 1 when the file is
>>> modified, can we do counter += 1 ? That way
''counter'' could be use as
>>> ''light'' file revision. You compare two versions
of this variable, is
>>> their differ, the file has been modified  (this is not intended to
check
>>> ''counter_c1 < counter_c2'' but just
''counter_c1 != counter_c2'', that way,
>>> you can have circular counters.)
>>>     
>> The MDS in 1.8 (and soon 2.0) will already keep a version counter for
all
>> changes to the MDS inode.  The OSTs will also keep version numbers for
>> all of the objects there.
>>   
> In our first (and old) hlds, we based several mechanisms on such  
> information.  But we were told that the version will be available
> for MDT but not for OST object
This hasn''t changed - the OST object versions are local to the OSTs.
We could get this OST version information at the agent (just like with
file size being fetched from all OSTs) and store it with the HSM archive
copy.  This can be a later optimization, however.  I think mtime is
itself sufficient for an initial indication for file data changes.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nathaniel Rutman

2008-Oct-31 20:32 UTC

head link

[Lustre-devel] HSM comments

Our conclusions from the meeting about HSM unlink, from 
http://arch.lustre.org/index.php?title=HSM_Migration


      5.4 unlink

   1. A client issues an unlink for a file to the MDT.
   2. The MDT includes the "hsm_exists" bit in the changelog unlink
entry
   3. The policy engine determines if the file should be removed from HSM
   4. Policy engine sends HSMunlink FID to coordinator via MDT ioctl
         1. Yuck - we can''t do direct ioctls on the MDT from a client
            node. We can only do ioctls on a file.

                maybe we need to implement a .lustre/device/XXX dir,
                where all MDT/OSTs are listed, and act as stub files for
                handling ioctls. 
                or maybe policy engine talks to agent / tool directly
                for unlinks? 

   5. The coordinator sends a request to one of its agent for the
      corresponding removal.
   6. The agent spawns the HSM tool to do this removal.
   7. When HSM removal is complete, policy engine cancels changelog
      unlink record
         1. How does agent/HSM tool signal to policy engine that HSM
            removal is complete?
   8. In case of agent crash, unlink record will remain uncancelled in
      the changelog; policy engine should restart processing at the
      first uncancelled record.


There''s two open issues:
- How for policy engine to tell coordinator to unlink an HSM object, 
when no corresponding object exists on the MDT for us to ioctl() on
    -which coordinator to talk to for CMD?
    -since unlink isn''t data movement, maybe all unlinks can be 
originated from policy engine directly?  (direct call to HSMunlinkHelper 
executable)
- How does HSMunlinkHelper return a signal to the policy engine that the 
removal is complete
    -if policy engine directly calls HSMunlinkHelper this is easy...

DEGREMONT Aurelien wrote:>
>>> * HSM dirty bit.
>>>
>>> - should be updated with laziness.
>>> - Is it possible to implement it like the lazy file size? That
means,
>>> manage the dirty bit, per OST object, and lazily update it on the
MDT?
>>>       
>> Since file mtime/size is already updated this way, we can just use any 
>> attr change as the dirty indicator; we don''t need an actual
bit per
>> object.  
>>     
> Dirty means data were changed, not metadata.
>   Actually a file is dirty if either changed, depending on what you are 
storing in HSM:
filename / path?
EAs?
My point was that you can use the mtime attr change as an indicator that 
some data
possibly changed.  It is not sufficient to show that it absolutely has 
changed, so
policy engine could do something else to try to verify the change, or 
simply assume
that the file is changed, mark it dirty, and reschedule for copyout -- 
no real harm done.
Yes, I agree it would be ideal to have a true verifyable "this file has 
changed" versioning,
but since that doesn''t exist yet, I don''t think we need to
hold up HSM
development for it.

DEGREMONT Aurelien

2008-Nov-03 15:30 UTC

head link

[Lustre-devel] HSM comments

Nathaniel Rutman a ?crit :>      5.4 unlink
>
>   1. A client issues an unlink for a file to the MDT.
>   2. The MDT includes the "hsm_exists" bit in the changelog
unlink entry
>   3. The policy engine determines if the file should be removed from HSM
>   4. Policy engine sends HSMunlink FID to coordinator via MDT ioctl
>         1. Yuck - we can''t do direct ioctls on the MDT from a
client
>            node. We can only do ioctls on a file.
>
>                maybe we need to implement a .lustre/device/XXX dir,
>                where all MDT/OSTs are listed, and act as stub files for
>                handling ioctls.                or maybe policy engine 
> talks to agent / tool directly
>                for unlinks?
Can''t we add an ioctl on /dev/obd or /mnt/lustre root dir ? or even on 
.lustre/fid and passing the fid in ioctl args?
I''m not fond of .lustre/device/XXX dirs...>   5. The coordinator sends a request to one of its agent for the
>      corresponding removal.
>   6. The agent spawns the HSM tool to do this removal.
>   7. When HSM removal is complete, policy engine cancels changelog
>      unlink record
>         1. How does agent/HSM tool signal to policy engine that HSM
>            removal is complete?
>   8. In case of agent crash, unlink record will remain uncancelled in
>      the changelog; policy engine should restart processing at the
>      first uncancelled record.
>
>
> There''s two open issues:
> - How for policy engine to tell coordinator to unlink an HSM object, 
> when no corresponding object exists on the MDT for us to ioctl() on
>    -which coordinator to talk to for CMD?If we implement an ioctl like
ioctl(.lustre/fid, HSMUnlink, fid=0x0000121561), can the API find the 
good MDT from the FID ? FLD can do this for an already removed file?)
> - How does HSMunlinkHelper return a signal to the policy engine that 
> the removal is complete
>    -if policy engine directly calls HSMunlinkHelper this is easy...I think there is a more general issue concerning feedback for the 
PolicyEngine. Surely the PolicyEngine will need information for other 
request it sent to the Coordinator. We should think of a more general 
mechanism to inform it of the success or failure of its requests.

Should the HSM (succesfull) event become changelog 
events....(hsm_copyin/hsm_copyout/hsm_remove)? Can another program be 
interested in such events?


Aur?lien

Lustre devel - Oct 2008 - HSM comments

[Lustre-devel] HSM comments

[Lustre-devel] HSM comments

[Lustre-devel] HSM comments

[Lustre-devel] HSM comments

[Lustre-devel] HSM comments

[Lustre-devel] HSM comments