thr3ads.net - Lustre devel - [Lustre-devel] Lustre HSM HLD draft [Feb 2008]

If this information is useful, please help other people find it:
Share via:

DEGREMONT Aurelien

2008-Feb-07 10:52 UTC

[Lustre-devel] Lustre HSM HLD draft

Hello

Here is a first draft for comments of the Lustre HSM HLD.
It is intended to be a support for further analyzes and comments from
CFS/Sun.

The document covers the main parts of the HSM features but some elements
are still lacking.
The policy management and the space manager will be describe later.

Let us know your comments and ideas about it.

Regards,

Aurelien Degremont
CEA


-------------- next part --------------
A non-text attachment was scrubbed...
Name: hld_hsm.pdf
Type: application/pdf
Size: 159329 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080207/dd334e91/attachment-0004.pdf

Rick Matthews

2008-Feb-07 16:19 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

All,
  I''m new to this list, so I''ll start with apologies. My
Lustre
background is
also limited; a situation I hope to fix.
  As part of the Solaris Software Archiving group, I was asked to review 
the HSM HLD
by my management. That review was sent to Peter Bojanic. He suggested I 
get involved in
the community discussion.
  This is a posting of my original response, based on a copy of the HLD 
which seems to
be the one posted. I''ve made a couple of minor corrections.

Page 1, 1, Define coordinator (space coordinator?),
        define agent, (condense Part II intro, page 14)
        (for me, MDT, MGS and OST)

Page 8, 3.8, "use" not "used" in second sentence

Page 9, 3.8.2 et.al., "precised" (maybe, explicit or precise)

Page 9, 3.8.4, Lustre ID "if" no path

Page 10, 4.1, 1) When archived? (probably in Space Manager portion)
        SAM-QFS archives well ahead of space need.

        4) External object reference must be unusable, until 5.

        4.2, 2) Implies only one copy per "version"...bad idea

Page 12, 5.3, Last Sentence, This enables, not This ables

        6.1, 100,000 migrations make current migration list operations
                problematic (lets say want to move last migration to
                be next migration).

Page 13, Lustre object mtime may not be good enough. There are several
        mechanisms (like touch) to manipulate mtime, which makes it
        unusable as a last written time.

Page 15, a variant on 1.5, ask for/return last valid byte offset
        (perhaps within a range).

Page 19, Special Path, does this boil down to invisible I/O?

Page 23, 2.3 and 2.4, I''m assuming that lists of tuples can be
processed
        in any order.

Page 25, 1, Punch - becomes "sparse" not "spare"

I think this spec needs to be more consistent with its use of data range.
It is confusing as laid out.

Page 26, 3.2 space will be exhausted, or space will be low, not space 
will be
        missing.

Page 28, protection of Lustre extended attributes?

Issues:
        The Space manager is likely the most important piece. There is no
        detail on it. This is where archive and other policy is enforced.

        The described HSM seems to follow the "copy out" when space
needed,
        then purge, model. This function (a Space Manager function) is 
contrary
        to SAM, and a shortfall of many HSMs.

        File/object association is an important component of SAM.
        For example, if I access a file in a source tree, I''m likely
        to access the others as well.

        The purge (3.2, Space manager needs to make room) and 4.1
        "needs to be atomic" is a complex operations. Sequencing is
        important.

        Coordination between agents seems important. For example,
        if agents requested new copy-outs on objects striped on
        10 different stores, ordering them on tape seems difficult.

        What is the backup story for Lustre? How does that play with
        the HSM?

-- 
---------------------------------------------------------------------
Rick Matthews                           email: Rick.Matthews at sun.com
Sun Microsystems, Inc.                  phone:+1(651) 554-1518
1270 Eagan Industrial Road              phone(internal): 54418
Suite 160                               fax:  +1(651) 554-1540
Eagan, MN 55121-1231 USA                main: +1(651) 554-1500		
---------------------------------------------------------------------

JC.LAFOUCRIERE at CEA.FR

2008-Feb-08 00:03 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Hello

thank you for your review, I add some comments in the following

Page 1, 1, Define coordinator (space coordinator?),
define agent, (condense Part II intro, page 14)
(for me, MDT, MGS and OST)
These are defined in the arch wiki pages

Page 10,
4.2, 2) Implies only one copy per "version"...bad idea
Different versions correspond to different files in the external storage. We
take the more recent.
Not sure I understand your remark

Page 13, Lustre object mtime may not be good enough. There are several
mechanisms (like touch) to manipulate mtime, which makes it
unusable as a last written time.
If a user make a touch in the past this change the mtime and can hide previous
writes.
If we want to keep real write time we need to add a new time field in Lustre
backend
(may be ZFS has it)

Page 19, Special Path, does this boil down to invisible I/O?
The path is /mnt_mount/.lustre/fid/FID_NUMBER. When a file is open through this
path a
flag is carried to the OSS to avoid copy in trigger (this used to fill the file)

Page 23, 2.3 and 2.4, I''m assuming that lists of tuples can be
processed
in any order.
yes

Issues:
The Space manager is likely the most important piece. There is no
detail on it. This is where archive and other policy is enforced.
The space manager is based on changelogs/feed Lustre feature which are very new
(draft HLD has just been
published). This is why it not described at this time.

The described HSM seems to follow the "copy out" when space
needed,
then purge, model. This function (a Space Manager function) is contrary
to SAM, and a shortfall of many HSMs.
no spacemanger is doing pre-migration and when free space is needed, it only has
to make punc

Coordination between agents seems important. For example,
if agents requested new copy-outs on objects striped on
10 different stores, ordering them on tape seems difficult.
Tape access optimization has to be made by the archival system. We try to put as
few external storage knowledge
as possible in Lustre to be external storage independant.

What is the backup story for Lustre? How does that play with
the HSM?
HSM do not backup the namespace. It has to be done with a separate tool like a
MDT scannner.
The copy tool can use the FID2PATH() function to save the object pathname with
the file.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080208/ac88ef54/attachment-0004.html

Rick Matthews

2008-Feb-08 11:52 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

JC.LAFOUCRIERE at CEA.FR wrote:

Thanks for allowing me to participate.> Hello
>
> thank you for your review, I add some comments in the following
>
> Page 1, 1, Define coordinator (space coordinator?),
>         define agent, (condense Part II intro, page 14)
>         (for me, MDT, MGS and OST)
> These are defined in the arch wiki pages
>   Thank you, I still haven''t got to them yet...but plan
to.> Page 10, 
>         4.2, 2) Implies only one copy per "version"...bad idea
> Different versions correspond to different files in the external storage.
We take the more recent.
> Not sure I understand your remark
>   A basic mantra of SAM-QFS and other data retention systems is that one 
image of the data is vulnerable (a tape breaks,
or is otherwise overwritten). While the archival system can be 
responsible for making multiple identical images, it
can still represent a single point of failure. Note: I am using version 
to represent a point in time image of the files data,
and copy to represent an image of that version. (See LOCKSS for 
additional references on copies).> Page 13, Lustre object mtime may not be good enough. There are several
>         mechanisms (like touch) to manipulate mtime, which makes it
>         unusable as a last written time.
> If a user make a touch in the past this change the mtime and can hide
previous writes.
> If we want to keep real write time we need to add a new time field in
Lustre backend
> (may be ZFS has it) 
>   What the archival system needs to know is that the copy previously made 
(or a first copy need to be made),
which seems to be triggered by a user (not archive or other - like 
restore) write operation.> Page 19, Special Path, does this boil down to invisible I/O?
> The path is /mnt_mount/.lustre/fid/FID_NUMBER. When a file is open through
this path a
> flag is carried to the OSS to avoid copy in trigger (this used to fill the
file)
>
> Page 23, 2.3 and 2.4, I''m assuming that lists of tuples can be
processed
>         in any order.
> yes
>
> Issues:
>         The Space manager is likely the most important piece. There is no
>         detail on it. This is where archive and other policy is enforced.
> The space manager is based on changelogs/feed Lustre feature which are very
new (draft HLD has just been
> published). This is why it not described at this time.
>   OK...also consider using change logs as a trigger for need of a new 
archive version (not copy). Alleviates the mtime issue
above.>         The described HSM seems to follow the "copy out" when
space needed,
>         then purge, model. This function (a Space Manager function) is
contrary
>         to SAM, and a shortfall of many HSMs.
> no spacemanger is doing pre-migration and when free space is needed, it
only has to make punc
>   OK, so who schedules the pre-migration to the archive
system?>         Coordination between agents seems important. For example,
>         if agents requested new copy-outs on objects striped on
>         10 different stores, ordering them on tape seems difficult.
> Tape access optimization has to be made by the archival system. We try to
put as few external storage knowledge
> as possible in Lustre to be external storage independant.
>   The isolation between archive system and file system is (to me) a good 
idea. I''d just like you to
consider that the recall (stage-in) events can be optimized. At least, 
make sure the archive system
is allowed to reorder as needed (hence the async - list of tuples in any 
order - question above).
Think of other association between files to live storage as 1) a 
pre-stage operation, or 2)
a disk cache pre-fetch operation. I hope I''m using understandable words
;>)>         What is the backup story for Lustre? How does that play with
>         the HSM?
> HSM do not backup the namespace. It has to be done with a separate tool
like a MDT scannner.
> The copy tool can use the FID2PATH() function to save the object pathname
with the file.
>
>   One point here is that an HSM + namespace/metadata backup + unarchived 
data capture can be used to be a
nearly continuous backup operation with a relatively tiny backup window.

-- 
---------------------------------------------------------------------
Rick Matthews                           email: Rick.Matthews at sun.com
Sun Microsystems, Inc.                  phone:+1(651) 554-1518
1270 Eagan Industrial Road              phone(internal): 54418
Suite 160                               fax:  +1(651) 554-1540
Eagan, MN 55121-1231 USA                main: +1(651) 554-1500		
---------------------------------------------------------------------

Aurelien Degremont

2008-Feb-08 15:55 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Hello

First of all, thanks for your remarks.

Information explained in the architecture documents from the Arch Wiki 
have not been re-explained in the HLD. So some points could be unclear, 
but read or check the arch docs first.
If the HLD must be self sufficent or more details are really needed, let 
me know.
I will clarify some points anyway in the new document version.

Rick Matthews a ?crit :> Page 10, 4.1, 1) When archived? (probably in Space Manager portion)
>         SAM-QFS archives well ahead of space need.
Concerning the archived copies vunlerability, I''m not sure this is 
Lustre responsability to manage several copies of each of its file 
versions into the HSM...
>         6.1, 100,000 migrations make current migration list operations
>                 problematic (lets say want to move last migration to
>                 be next migration).
You speak about pending migrations ? This is just pointer manipulation. 
I do not see a real problem at this level. This value is only 
algorithmic indications, not about resources (memory, ...)
But we could decrease this value to 10,000.
> Page 13, Lustre object mtime may not be good enough. There are several
>         mechanisms (like touch) to manipulate mtime, which makes it
>         unusable as a last written time.
If fact, this value is only needed for user information, not for Lustre 
internals. Lustre will based is comparison on the FID version.
The mtime field is used for listing the file copies in the HSM, and as 
the lustre fid version is not relevant for the user, will indicates the 
associated file date at this time.

(just a quick example, not the final output)
user$ list_hsm_copies ./foo
Storage    Date         Size         Version
===========================================HSM1       Feb  2  2006 1566162      
1
HSM1       Jun 18  2007 1423540            2
HSM1       Jun 18  2007 1900051           54

But the touch could be problematic. Lustre gurus, is there another time 
field we could use instead ? Should we add a 
"last-modification-field-which-ignore-touch" ? Is this really a
problem
is we use display a "touched" time ? In this case, we display what the
user set on the file, we suppose he did it in purpose.
> Page 15, a variant on 1.5, ask for/return last valid byte offset
>         (perhaps within a range).
Why not... But do you have use cases were the current "Data available"
feature as explained in 1.5 is not sufficent ?
> Page 28, protection of Lustre extended attributes?
I do not see what you mean.
> Issues:
>         The purge (3.2, Space manager needs to make room) and 4.1
>         "needs to be atomic" is a complex operations. Sequencing
is
>         important.
Does "transactionnal" fit ?



I will add a Bugzilla entry and a new updated version the HLD on it, 
next Monday.


Regards,

-- 
Aurelien Degremont
CEA/DAM - DIF/DSSI/SISR

Nathaniel Rutman

2008-Feb-08 21:18 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

DEGREMONT Aurelien wrote:> Hello
>
> Here is a first draft for comments of the Lustre HSM HLD.
> It is intended to be a support for further analyzes and comments from
> CFS/Sun.
>
> The document covers the main parts of the HSM features but some elements
> are still lacking.
> The policy management and the space manager will be describe later.
>
> Let us know your comments and ideas about it.
>
> Regards,5.1 external storage list - is this to be stored on the MGS device or a 
separate device?  If the coordinator lives on the MGS, why not it''s 
storage as well?  In any case, it should be possible to co-locate the 
coordinator on the MGS and used the MGS''s storage device, in the same 
way that the MGS can currently co-locate with the MDT.

6.3 object ref should include version number.  Also include checksum?

How does the coordinator request activity from an agent?  If the 
coordinator is the RPC server, then it''s up to the agents to make 
requests; agents aren''t listening for RPC requests themselves.

2.1Archiving one Lustre file
There should not be a cache miss when archiving a lustre file; perhaps 
open-by-fid is intended to bypass atime updates
so that the file isn''t marked as "recently accessed"?
2.2Restoring a file
"External ID" presumably contains all information required to retrieve
the file - tape #, path name, etc?
Once file is copied back, we should probably restore original ctime, 
mtime, atime - coordinator is storing this, correct?

IV2 - why not multiple purged windows?  Seems like if you''re going to 
purge 1 object out of a file, you might want to purge more.
Specifically, it will probably be a common case to purge every object of 
a file from a particular OST.  This is not contiguous in a
striped file.
I don''t see any reason to purge anything smaller than an entire object 
on an OST - is there good reason for this? 
If that''s the case, then it the OST must keep track of purged objects, 
not ranges within an existing object.
If the MDT is tracking purged areas also, then there''s a good potential
synergy here with a missing OST --
If the missing OST''s objects are marked as purged, then we can 
potentially recover them automatically from
HSM...

4.2 How is a purge request recovered?  For example, MDT says purge obj1 
from ost1, ost1 replies "ok", but then dies before it actually
does the purge.  Reboots, doesn''t know anything about purge request
now,
but MDT has marked it as purged. 

Transparent access - should this avoid modification of atime/mtime?
V2.1 How long does OST wait for completion?  Is there a timeout?    We 
probably need a "no timeout if progress is being
made" kind of function - clients currently do this kind of thing with OSTs.
V2.2 No need to copy-in purged data on full-object-size writes.

  Page 13, Lustre object mtime may not be good enough. There are several
        mechanisms (like touch) to manipulate mtime, which makes it
        unusable as a last written time.
  If a user make a touch in the past this change the mtime and can hide 
previous writes.
  If we want to keep real write time we need to add a new time field in 
Lustre backend
  (may be ZFS has it)
If a user touches or otherwise modifies the mtime on purpose, they
presumably know what they are doing.  Besides, we''re using the
object version number, not the mtime, to determine whether a file
is up to date.  I think this can be ignored.

Aurelien Degremont

2008-Feb-11 14:59 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Nathaniel Rutman a ?crit :> 5.1 external storage list - is this to be stored on the MGS device or a 
> separate device?  If the coordinator lives on the MGS, why not
it''s
> storage as well?  In any case, it should be possible to co-locate the 
> coordinator on the MGS and used the MGS''s storage device, in the
same
> way that the MGS can currently co-locate with the MDT.
> How does the coordinator request activity from an agent?  If the 
> coordinator is the RPC server, then it''s up to the agents to make 
> requests; agents aren''t listening for RPC requests themselves.
Presently, it is never said that the coordinator will live on the MGS.
The Coordinator constrains are:
  1 - Must receive various migration requests from OST/MDT.
  2 - Should be able to communicate with Agents and asks them migrations.
  3 - Should store configuration and migration logs.

I think #1 and #2 are two differents API. The coordinator is clearly a 
RPC server for the first one. How #2 should be implemented is not so 
clear. What would be be the "Lustre-way" here?

For #3, the few logs that will be backed up here are not huge, and it 
surely could be colocated with another Target, but I''m not sure this 
should be mandatory. This device should be available to several servers, 
for failover like the other Targets. We could imagine having more than 1 
coordinator at long term. I''m not sure it is a good idea to stick it to
another target.
> 6.3 object ref should include version number.  Also include checksum?
For data coherency? Should we add a explicit checksum for those values 
(stored in an EA) or used a possible backend feature (Can ZFS and 
ldiskfs detect EA value corruption by themselves?) ?
> 2.1Archiving one Lustre file
> There should not be a cache miss when archiving a lustre file; perhaps 
> open-by-fid is intended to bypass atime updates
> so that the file isn''t marked as "recently accessed"? > Transparent access - should this avoid modification of atime/mtime?

I would say yes.
> 2.2Restoring a file
> "External ID" presumably contains all information required to
retrieve
> the file - tape #, path name, etc?
> Once file is copied back, we should probably restore original ctime, 
> mtime, atime - coordinator is storing this, correct?
External ID is an opaque value manage by the archiving tool. If the HSM 
can store a lot of metadata, only a ref is needed, if not, the tool is 
responsible for storing all the data it needs. Anyway, this is totally 
opaque for Lustre.
I hope the HSMs will not need so many data in this field. HPSS does not 
need so many data, it uses its internal DB to store them. I suppose SAM 
also.
> IV2 - why not multiple purged windows?  Seems like if you''re going
to
> purge 1 object out of a file, you might want to purge more.
> Specifically, it will probably be a common case to purge every object of 
> a file from a particular OST.  This is not contiguous in a
> striped file.
> I don''t see any reason to purge anything smaller than an entire
object
> on an OST - is there good reason for this? 
Multiple purged window is subtle. If you permit this feature, you could 
technically have, in the worst case, one purged window per byte, and 
this could be very huge to store. Do you think you will do several holes 
in the same file? In which cases?
In fact, the more common case is to totally purge a file which have been 
  migrated on HSM, and it is only an optimisation to keep the start and 
the end of the file on disk, to avoid triggering tons of cache misses 
with commands like "file foo/*" or a tool like Nautilus or Windows 
Explorer browsing the directory.
The purged window is stored by per object, OST object and MDT object.
So, if several objects are purged, each object will store its own purged 
window. But the MDT object describing this file will store a special 
purged window which starts at the smallest unavailable bytes and ends at 
the first available one. The MDT purged window indicates "if you do I/O 
in this range, you''re not sure the date are there." or
"Outside of this
area, I guarantee data are present."
Maintain multiple purged windows will be an headache, with no real need 
I think.
Moreover, people have asked for an OST-object based migration, even if I 
think whole file migration will be the most common case.


 > If that''s the case, then it> the OST must keep track of purged objects, not ranges within an existing 
> object.
Objects are not removed, only their datas. All metadata are kept.
> If the MDT is tracking purged areas also, then there''s a good
potential
> synergy here with a missing OST --
> If the missing OST''s objects are marked as purged, then we can 
> potentially recover them automatically from
> HSM...
What do you call a "missing OST" ? A corrupt one ? A offline one? 
Unavailable?
Where will you copy back the object data ? On another OST object ?
With the purged window on each OST object and MDT and the file stripping 
info, we could easily restore the missing parts.
> 4.2 How is a purge request recovered?  For example, MDT says purge obj1 
> from ost1, ost1 replies "ok", but then dies before it actually
> does the purge.  Reboots, doesn''t know anything about purge
request now,
> but MDT has marked it as purged.
The OST asynchronously acknowledges the purge when it is done. The MDT 
marks it purged only when it is really done. I will clarify this.
> V2.1 How long does OST wait for completion?  Is there a timeout?    We 
> probably need a "no timeout if progress is being
> made" kind of function - clients currently do this kind of thing with
OSTs.
I''m sure Lustre already has similar mechanisms for optimized timeout in
this kind of situation we could reused here.
What you describe is a good approach I think.
> V2.2 No need to copy-in purged data on full-object-size writes.
True. We could had such optimization. But this is only useful for small 
files or very widely stripped ones, doesn''t it?


Thanks for your comments.

-- 
Aurelien Degremont
CEA

Andreas Dilger

2008-Feb-11 18:18 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

On Feb 08, 2008  16:55 +0100, Aurelien Degremont wrote:> But the touch could be problematic. Lustre gurus, is there another time 
> field we could use instead ? Should we add a 
> "last-modification-field-which-ignore-touch" ? Is this really a
problem
> is we use display a "touched" time ? In this case, we display
what the
> user set on the file, we suppose he did it in purpose.
There was work done in ext4/ldiskfs to add a 64-bit "version" field to
the on-disk inode, for use by lustre and NFSv4.  In the ldiskfs case
Lustre was free to store any information in this field it wanted.  The
planned use for this field is for "version based recovery" and it has
the semantic that it is an increasing (though not necessarily sequential)
version number that tracks any change to the file.  This is stored in
each inode on the MDT and each object on the OSTs.

In ZFS I believe there is also a "last modified transaction group"
(txg)
number stored with each dnode that could be used in a similar manner.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Peter Braam

2008-Feb-11 19:38 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Versions are critical - we need them for multiple things, let''s make 
sure we get exactly the right thing in ZFS also.

- Peter -

Andreas Dilger wrote:> On Feb 08, 2008  16:55 +0100, Aurelien Degremont wrote:
>   
>> But the touch could be problematic. Lustre gurus, is there another time
>> field we could use instead ? Should we add a 
>> "last-modification-field-which-ignore-touch" ? Is this really
a problem
>> is we use display a "touched" time ? In this case, we display
what the
>> user set on the file, we suppose he did it in purpose.
>>     
>
> There was work done in ext4/ldiskfs to add a 64-bit "version"
field to
> the on-disk inode, for use by lustre and NFSv4.  In the ldiskfs case
> Lustre was free to store any information in this field it wanted.  The
> planned use for this field is for "version based recovery" and it
has
> the semantic that it is an increasing (though not necessarily sequential)
> version number that tracks any change to the file.  This is stored in
> each inode on the MDT and each object on the OSTs.
>
> In ZFS I believe there is also a "last modified transaction
group" (txg)
> number stored with each dnode that could be used in a similar manner.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>   -------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/98333d16/attachment-0004.html

Nathaniel Rutman

2008-Feb-11 20:33 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Aurelien Degremont wrote:> Nathaniel Rutman a ?crit :
>   
>> 5.1 external storage list - is this to be stored on the MGS device or a
>> separate device?  If the coordinator lives on the MGS, why not
it''s
>> storage as well?  In any case, it should be possible to co-locate the 
>> coordinator on the MGS and used the MGS''s storage device, in
the same
>> way that the MGS can currently co-locate with the MDT.
>> How does the coordinator request activity from an agent?  If the 
>> coordinator is the RPC server, then it''s up to the agents to
make
>> requests; agents aren''t listening for RPC requests themselves.
>>     
>
> Presently, it is never said that the coordinator will live on the MGS.
> The Coordinator constrains are:
>   1 - Must receive various migration requests from OST/MDT.
>   2 - Should be able to communicate with Agents and asks them migrations.
>   3 - Should store configuration and migration logs.
>   
> I think #1 and #2 are two differents API. The coordinator is clearly a 
> RPC server for the first one. How #2 should be implemented is not so 
> clear. What would be be the "Lustre-way" here?
>   With userspace servers, presumably we have some way of passing LNET 
messages
from kernel to userspace.  We should probably still go through LNET for 
#2 in order
to use the broadest range of network fabrics.  So it could be the same 
or similar
RPC.  There is no "Lustre-way" for this area - we''ve never
done this
kind of thing before.> For #3, the few logs that will be backed up here are not huge, and it 
> surely could be colocated with another Target, but I''m not sure
this
> should be mandatory. This device should be available to several servers, 
> for failover like the other Targets. We could imagine having more than 1 
> coordinator at long term. I''m not sure it is a good idea to stick
it to
> another target.
>   Not mandatory, but possible is nice.  Minimize the number of required 
partitions.>   
>> 6.3 object ref should include version number.  Also include checksum?
>>     
>
> For data coherency? Should we add a explicit checksum for those values 
> (stored in an EA) or used a possible backend feature (Can ZFS and 
> ldiskfs detect EA value corruption by themselves?) ?
>   ZFS can, ldiskfs cannot.  Anyhow, it was just a thought.  Doesn''t hurt 
to allow space for it.>   
>> 2.1Archiving one Lustre file
>> There should not be a cache miss when archiving a lustre file; perhaps 
>> open-by-fid is intended to bypass atime updates
>> so that the file isn''t marked as "recently
accessed"?
>>     
>  > Transparent access - should this avoid modification of atime/mtime?
>
> I would say yes.
>
>   
>> 2.2Restoring a file
>> "External ID" presumably contains all information required to
retrieve
>> the file - tape #, path name, etc?
>> Once file is copied back, we should probably restore original ctime, 
>> mtime, atime - coordinator is storing this, correct?
>>     
>
> External ID is an opaque value manage by the archiving tool. If the HSM 
> can store a lot of metadata, only a ref is needed, if not, the tool is 
> responsible for storing all the data it needs. Anyway, this is totally 
> opaque for Lustre.
> I hope the HSMs will not need so many data in this field. HPSS does not 
> need so many data, it uses its internal DB to store them. I suppose SAM 
> also.
>   What about restore of original ctime, mtime, atime?  I think we must 
store it
in the coordinator because we must work with all HSMs, and I think it is 
important
to restore it. 
>   
>> IV2 - why not multiple purged windows?  Seems like if you''re
going to
>> purge 1 object out of a file, you might want to purge more.
>> Specifically, it will probably be a common case to purge every object
of
>> a file from a particular OST.  This is not contiguous in a
>> striped file.
>> I don''t see any reason to purge anything smaller than an
entire object
>> on an OST - is there good reason for this? 
>>     
>
> Multiple purged window is subtle. If you permit this feature, you could 
> technically have, in the worst case, one purged window per byte, and 
> this could be very huge to store. Do you think you will do several holes 
> in the same file? In which cases?
>   Like I said, I don''t see any reason to purge anything smaller than a 
full object; I
would in fact disallow purging of an arbitrary byte range, and only 
allow purging
on full-object boundaries.> In fact, the more common case is to totally purge a file which have been 
>   migrated on HSM, and it is only an optimisation to keep the start and 
> the end of the file on disk, to avoid triggering tons of cache misses 
> with commands like "file foo/*" or a tool like Nautilus or
Windows
> Explorer browsing the directory.
>   Again, since Lustre is optimized to work with 1MB chunks anyhow, I
don''t
think
it helps much to keep less than that in the beginning / end objects, so 
I would
say just keep the first and last blocks instead.> The purged window is stored by per object, OST object and MDT object.
> So, if several objects are purged, each object will store its own purged 
> window. But the MDT object describing this file will store a special 
> purged window which starts at the smallest unavailable bytes and ends at 
> the first available one. The MDT purged window indicates "if you do
I/O
> in this range, you''re not sure the date are there." or
"Outside of this
> area, I guarantee data are present."
> Maintain multiple purged windows will be an headache, with no real need 
> I think.
> Moreover, people have asked for an OST-object based migration, even if I 
> think whole file migration will be the most common case.
>
>
>  > If that''s the case, then it
>   
>> the OST must keep track of purged objects, not ranges within an
existing
>> object.
>>     
>
> Objects are not removed, only their datas. All metadata are kept.
>
>   
>> If the MDT is tracking purged areas also, then there''s a good
potential
>> synergy here with a missing OST --
>> If the missing OST''s objects are marked as purged, then we can
>> potentially recover them automatically from
>> HSM...
>>     
>
> What do you call a "missing OST" ? A corrupt one ? A offline one?
> Unavailable?
>   Yes.  All of the above. Obviously we need to distinguish between 
"permanently
gone" and "temporarily gone".> Where will you copy back the object data ? On another OST object ?
>   Yes.  Some kind of recovery will take place to generate a new object on 
a different OST and
we can restore the data there.> With the purged window on each OST object and MDT and the file stripping 
> info, we could easily restore the missing parts.
>   Exactly.  This is why I say we should think about this now, to allow for 
this
possibility.>   
>> 4.2 How is a purge request recovered?  For example, MDT says purge obj1
>> from ost1, ost1 replies "ok", but then dies before it
actually
>> does the purge.  Reboots, doesn''t know anything about purge
request now,
>> but MDT has marked it as purged.
>>     
>
> The OST asynchronously acknowledges the purge when it is done. The MDT 
> marks it purged only when it is really done. I will clarify this.
>
>   
>> V2.1 How long does OST wait for completion?  Is there a timeout?    We 
>> probably need a "no timeout if progress is being
>> made" kind of function - clients currently do this kind of thing
with OSTs.
>>     
>
> I''m sure Lustre already has similar mechanisms for optimized
timeout in
> this kind of situation we could reused here.
> What you describe is a good approach I think.
>
>   
>> V2.2 No need to copy-in purged data on full-object-size writes.
>>     
>
> True. We could had such optimization. But this is only useful for small 
> files or very widely stripped ones, doesn''t it?
>   No, we very frequently write entire stripes (objects).  Lustre clients 
can optimize for this.>
> Thanks for your comments.
>
>

Ricardo M. Correia

2008-Feb-11 21:11 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Hi,

On Seg, 2008-02-11 at 11:18 -0700, Andreas Dilger wrote:
> On Feb 08, 2008  16:55 +0100, Aurelien Degremont wrote:
> > But the touch could be problematic. Lustre gurus, is there another
time
> > field we could use instead ? Should we add a 
> > "last-modification-field-which-ignore-touch" ? Is this
really a problem
> > is we use display a "touched" time ? In this case, we
display what the
> > user set on the file, we suppose he did it in purpose.
> 
> (snip)
> 
> In ZFS I believe there is also a "last modified transaction
group" (txg)
> number stored with each dnode that could be used in a similar manner.

Hmm.. I think ZFS only has zp_gen in the dnode/znode, which is the txg
of the file creation. We also cannot use the txg birth time of the block
where the dnode is stored because a metadnode block holds several
dnodes.

I may be missing something here, but isn''t the "ctime" the
appropriate
value to use here?

Regards,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/03b52971/attachment-0004.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/03b52971/attachment-0004.gif

Andreas Dilger

2008-Feb-11 21:39 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

On Feb 11, 2008  21:11 +0000, Ricardo Correia wrote:> On Seg, 2008-02-11 at 11:18 -0700, Andreas Dilger wrote:
> 
> > On Feb 08, 2008  16:55 +0100, Aurelien Degremont wrote:
> > > But the touch could be problematic. Lustre gurus, is there
another time
> > > field we could use instead ? Should we add a 
> > > "last-modification-field-which-ignore-touch" ? Is this
really a problem
> > > is we use display a "touched" time ? In this case, we
display what the
> > > user set on the file, we suppose he did it in purpose.
> > 
> > (snip)
> > 
> > In ZFS I believe there is also a "last modified transaction
group" (txg)
> > number stored with each dnode that could be used in a similar manner.
> 
> 
> Hmm.. I think ZFS only has zp_gen in the dnode/znode, which is the txg
> of the file creation. We also cannot use the txg birth time of the block
> where the dnode is stored because a metadnode block holds several
> dnodes.
> 
> I may be missing something here, but isn''t the "ctime"
the appropriate
> value to use here?
The problem with ctime (on Linux as well) is that it is possible for the
system clock to go backward, whether due to ntp, or because the hardware
clock is incorrect/reset, so it cannot be depended upon to be monotonically
increasing for the life of the lustre filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Ricardo M. Correia

2008-Feb-11 22:07 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:
> The problem with ctime (on Linux as well) is that it is possible for the
> system clock to go backward, whether due to ntp, or because the hardware
> clock is incorrect/reset, so it cannot be depended upon to be monotonically
> increasing for the life of the lustre filesystem.

Ok. In that case, we could either add a new 64-bit version field to the
dnode (or znode) similar to the one in ldiskfs, or we could look at the
birth time (txg nr) of all the block pointers in the dnode.
Using txg numbers might not be very useful if an object is migrated from
one storage device to another, but I have not read the HSM HLD so I''m
not sure if this is a problem or not.

Cheers,
Ricardo
--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/4de0ee30/attachment-0004.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080211/4de0ee30/attachment-0004.gif

Nathaniel Rutman

2008-Feb-11 22:32 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Ricardo M. Correia wrote:>
> On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:
>> The problem with ctime (on Linux as well) is that it is possible for
the
>> system clock to go backward, whether due to ntp, or because the
hardware
>> clock is incorrect/reset, so it cannot be depended upon to be
monotonically
>> increasing for the life of the lustre filesystem.
>>     
>
> Ok. In that case, we could either add a new 64-bit version field to 
> the dnode (or znode) similar to the one in ldiskfs, or we could look 
> at the birth time (txg nr) of all the block pointers in the dnode.
> Using txg numbers might not be very useful if an object is migrated 
> from one storage device to another, but I have not read the HSM HLD so 
> I''m not sure if this is a problem or not.I''m missing the point of this discussion.  Clearly we
shouldn''t/can''t
use ctime/mtime for anything internal to Lustre; that is what object 
versions are all about.  Why are we talking about adding new fields or 
anything else?

Rick Matthews

2008-Feb-11 22:46 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

I''m probably responsible for opening this can of worms. I inferred from
the HSM HLD that
mtime was proposed to be used for state change, or version of the 
file/object. As the discussion
bears out, mtime for this purpose would be a bad idea. A reliable way of 
detecting change is
needed, and if it already exists withing Lustre, great!.

What I think is far more significant is the involvement of the community 
on issues
such as this. More folks examining (and critiquing) the details, the 
better.
Nice to see such an active community.
--

Nathaniel Rutman wrote:> Ricardo M. Correia wrote:
>>
>> On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:
>>> The problem with ctime (on Linux as well) is that it is possible
for
>>> the
>>> system clock to go backward, whether due to ntp, or because the 
>>> hardware
>>> clock is incorrect/reset, so it cannot be depended upon to be 
>>> monotonically
>>> increasing for the life of the lustre filesystem.
>>>     
>>
>> Ok. In that case, we could either add a new 64-bit version field to 
>> the dnode (or znode) similar to the one in ldiskfs, or we could look 
>> at the birth time (txg nr) of all the block pointers in the dnode.
>> Using txg numbers might not be very useful if an object is migrated 
>> from one storage device to another, but I have not read the HSM HLD 
>> so I''m not sure if this is a problem or not.
> I''m missing the point of this discussion.  Clearly we
shouldn''t/can''t
> use ctime/mtime for anything internal to Lustre; that is what object 
> versions are all about.  Why are we talking about adding new fields or 
> anything else?
>
>

-- 
---------------------------------------------------------------------
Rick Matthews                           email: Rick.Matthews at sun.com
Sun Microsystems, Inc.                  phone:+1(651) 554-1518
1270 Eagan Industrial Road              phone(internal): 54418
Suite 160                               fax:  +1(651) 554-1540
Eagan, MN 55121-1231 USA                main: +1(651) 554-1500		
---------------------------------------------------------------------

Ricardo M. Correia

2008-Feb-12 00:25 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

On Seg, 2008-02-11 at 14:32 -0800, Nathaniel Rutman wrote:
> I''m missing the point of this discussion.  Clearly we
shouldn''t/can''t
> use ctime/mtime for anything internal to Lustre; that is what object 
> versions are all about.  Why are we talking about adding new fields or 
> anything else?


If by object versions you are referring to the version field in the
ldiskfs inodes that Andreas mentioned, then we need to add a similar
field/attribute in ZFS.

It seems that Andreas has already filed bug 14865 for this.

Cheers,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080212/70ae527c/attachment-0004.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080212/70ae527c/attachment-0004.gif

Andreas Dilger

2008-Feb-12 03:55 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

On Feb 11, 2008  12:33 -0800, Nathaniel Rutman wrote:> Aurelien Degremont wrote:
> > Nathaniel Rutman a ?crit :
> >> IV2 - why not multiple purged windows?  Seems like if
you''re going to
> >> purge 1 object out of a file, you might want to purge more.
> >> Specifically, it will probably be a common case to purge every
object of
> >> a file from a particular OST.  This is not contiguous in a
> >> striped file.
> >> I don''t see any reason to purge anything smaller than an
entire object
> >> on an OST - is there good reason for this? 
> >
> > Multiple purged window is subtle. If you permit this feature, you
could
> > technically have, in the worst case, one purged window per byte, and 
> > this could be very huge to store. Do you think you will do several
holes
> > in the same file? In which cases?
One issue is that if you are purging individual objects from a file your
windows will be quite disjoint at the file level.  That may not be a serious
problem for applications that only look at the first and last chunks of a
file.

I can imagine use cases for extremely large files and limited-sized caches
where there is a need to access only subsets of the file (i.e. the entire
file cannot be resident at one time).  That said, it may be this is too
complex for the initial implementation.
> Like I said, I don''t see any reason to purge anything smaller than
a
> full object; I would in fact disallow purging of an arbitrary byte range,
> and only allow purging on full-object boundaries.
That is impractical, for the reasons that Aurelien mentioned - we want to
avoid file re-staging for tools like "file" and GUIs that read the
start/end
of files to determine file type and icons.
> > In fact, the more common case is to totally purge a file which have
been
> > migrated on HSM, and it is only an optimisation to keep the start and 
> > the end of the file on disk, to avoid triggering tons of cache misses 
> > with commands like "file foo/*" or a tool like Nautilus or
Windows
> > Explorer browsing the directory.
>
> Again, since Lustre is optimized to work with 1MB chunks anyhow, I
don''t
> think it helps much to keep less than that in the beginning / end objects,
> so I would say just keep the first and last blocks instead.
What if file is N*1MB + 1 byte?  We need to be able to keep something like
64kB for a windows icon, so having some arbitrary byte range seems reasonable.
> > The purged window is stored by per object, OST object and MDT object.
> > So, if several objects are purged, each object will store its own
purged
> > window. But the MDT object describing this file will store a special 
> > purged window which starts at the smallest unavailable bytes and ends
at
> > the first available one.
I think this should read "ends at the highest range contiguous to the end
of the file" or similar, or it will be misleading in the multi-object case.
> >> the OST must keep track of purged objects, not ranges within an
existing
> >> object.
> >
> > Objects are not removed, only their datas. All metadata are kept.
The one drawback with this approach is that it is not possible to HSM
copy-in objects to a different OST than where they were originally stored.
BUT... in conjunction with the migration tool it should be able to migrate
an (empty) object from one OST to another before the copy-in from HSM,
so long as there is no OST-specific data in the HSM identifier (i.e. the
HSM label is truely opaque).
> >> If the MDT is tracking purged areas also, then there''s a
good potential
> >> synergy here with a missing OST --
> >> If the missing OST''s objects are marked as purged, then
we can
> >> potentially recover them automatically from
> >> HSM...
> >
> > What do you call a "missing OST" ? A corrupt one ? A offline
one?
> > Unavailable?
>
> Yes.  All of the above. Obviously we need to distinguish between 
> "permanently gone" and "temporarily gone".
I suppose this leads to a requirement to store the object in HSM so
that it can be accessed just by the object FID+version.  That would allow
the OST to be restored from HSM even if the entire OST filesystem is lost,
potentially modifying the FLDB to relocate the FID to a different OST.
> > Where will you copy back the object data ? On another OST object ?
> 
> Yes.  Some kind of recovery will take place to generate a new object on 
> a different OST and we can restore the data there.
> > With the purged window on each OST object and MDT and the file
stripping
> > info, we could easily restore the missing parts.
> 
> Exactly.  This is why I say we should think about this now, to allow for 
> this possibility.
Right.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Eric Barton

2008-Feb-12 11:04 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Hi,

Sorry if these questions duplicates previous debate.

Have I understood correctly that the design allows individual objects
within a lustre file (i.e. stripes?) to be purged independently?

If so why is this needed?

I would have thought that when you purge a file, you need only record
the purged extent as an attribute of the whole lustre file and punch
its stripes to free the space.  Am I missing a use case?

-- 

        Cheers,
                Eric

Aurelien Degremont

2008-Feb-12 15:25 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Eric Barton a ?crit :> Hi,
> 
> Sorry if these questions duplicates previous debate.
> 
> Have I understood correctly that the design allows individual objects
> within a lustre file (i.e. stripes?) to be purged independently?
> 
> If so why is this needed?
> 
> I would have thought that when you purge a file, you need only record
> the purged extent as an attribute of the whole lustre file and punch
> its stripes to free the space.  Am I missing a use case?

Since the beginning CFS required this feature. It seems a lab ask for 
it. I do not know who. Unfortunately we have no use case for what they 
want to do with this.
I''m wondering if their need could not be met with other features like 
the internal Lustre migration...


-- 
Aurelien Degremont
CEA

Aurelien Degremont

2008-Feb-12 15:41 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

It is important to note that all comparisons and modifications are done 
at Lustre-object level: OST stripe object or MDT file object, each of 
those objects already has a version field, in the FID.
This is the version inside the FID that we will use for all treatments.
All purges are always requested for a specific FID.

The mtime is stored only for information, for the users. It is simpler 
to display to the user:

user$ list_hsm_copies ./foo
Date
===========Feb  2  2006
Jun 18  2007
Jun 19  2007

than:

user$ list_hsm_copies ./foo
Version
===========0x0012356
0x001a250
0x001a011

If the user "touched" the file sometime, he knew what he has done.
Just
the output will be different, but internaly, we manipulate Lustre FID 
and so we don''t care of mtime.

So the "version" in the backend is not a problem. We do not rely on
the
ldiskfs/zfs inode versioning.


Aurelien Degremont

Rick Matthews a ?crit :> I''m probably responsible for opening this can of worms. I inferred
from
> the HSM HLD that
> mtime was proposed to be used for state change, or version of the 
> file/object. As the discussion
> bears out, mtime for this purpose would be a bad idea. A reliable way of 
> detecting change is
> needed, and if it already exists withing Lustre, great!.
> 
> What I think is far more significant is the involvement of the community 
> on issues
> such as this. More folks examining (and critiquing) the details, the 
> better.
> Nice to see such an active community.
> --
> 
> Nathaniel Rutman wrote:
>> Ricardo M. Correia wrote:
>>> On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:
>>>> The problem with ctime (on Linux as well) is that it is
possible for
>>>> the
>>>> system clock to go backward, whether due to ntp, or because the
>>>> hardware
>>>> clock is incorrect/reset, so it cannot be depended upon to be 
>>>> monotonically
>>>> increasing for the life of the lustre filesystem.
>>>>     
>>> Ok. In that case, we could either add a new 64-bit version field to
>>> the dnode (or znode) similar to the one in ldiskfs, or we could
look
>>> at the birth time (txg nr) of all the block pointers in the dnode.
>>> Using txg numbers might not be very useful if an object is migrated
>>> from one storage device to another, but I have not read the HSM HLD
>>> so I''m not sure if this is a problem or not.
>> I''m missing the point of this discussion.  Clearly we
shouldn''t/can''t
>> use ctime/mtime for anything internal to Lustre; that is what object 
>> versions are all about.  Why are we talking about adding new fields or 
>> anything else?
>>
>>
> 
> 

-- 
Aurelien Degremont
CEA

Andreas Dilger

2008-Feb-12 17:23 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

On Feb 12, 2008  16:25 +0100, Aurelien Degremont wrote:> Eric Barton a ?crit :
> > Sorry if these questions duplicates previous debate.
> > 
> > Have I understood correctly that the design allows individual objects
> > within a lustre file (i.e. stripes?) to be purged independently?
> > 
> > If so why is this needed?
> > 
> > I would have thought that when you purge a file, you need only record
> > the purged extent as an attribute of the whole lustre file and punch
> > its stripes to free the space.  Am I missing a use case?
> 
> Since the beginning CFS required this feature. It seems a lab ask for 
> it. I do not know who. Unfortunately we have no use case for what they 
> want to do with this.
> I''m wondering if their need could not be met with other features
like
> the internal Lustre migration...
That is my understanding also - I believe one of the Labs wanted this
(to be able to do HSM on a per-stripe basis instead of a per-file basis).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Eric Barton

2008-Feb-12 19:43 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Andreas,

Is this requirement documented?  I''d appreciate any pointers... 
> -----Original Message-----
> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] 
> On Behalf Of Andreas Dilger
> Sent: 12 February 2008 5:23 PM
> To: Aurelien Degremont
> Cc: Eric Barton; lustre-devel at lists.lustre.org
> Subject: Re: [Lustre-devel] Lustre HSM HLD draft
> 
> On Feb 12, 2008  16:25 +0100, Aurelien Degremont wrote:
> > Eric Barton a ?crit :
> > > Sorry if these questions duplicates previous debate.
> > > 
> > > Have I understood correctly that the design allows 
> individual objects
> > > within a lustre file (i.e. stripes?) to be purged independently?
> > > 
> > > If so why is this needed?
> > > 
> > > I would have thought that when you purge a file, you need 
> only record
> > > the purged extent as an attribute of the whole lustre 
> file and punch
> > > its stripes to free the space.  Am I missing a use case?
> > 
> > Since the beginning CFS required this feature. It seems a 
> lab ask for 
> > it. I do not know who. Unfortunately we have no use case 
> for what they 
> > want to do with this.
> > I''m wondering if their need could not be met with other 
> features like 
> > the internal Lustre migration...
> 
> That is my understanding also - I believe one of the Labs wanted this
> (to be able to do HSM on a per-stripe basis instead of a 
> per-file basis).
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
>

Nathaniel Rutman

2008-Feb-12 23:24 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Andreas Dilger wrote:> On Feb 12, 2008  16:25 +0100, Aurelien Degremont wrote:
>   
>> Eric Barton a ?crit :
>>     
>>> Sorry if these questions duplicates previous debate.
>>>
>>> Have I understood correctly that the design allows individual
objects
>>> within a lustre file (i.e. stripes?) to be purged independently?
>>>
>>> If so why is this needed?
>>>
>>> I would have thought that when you purge a file, you need only
record
>>> the purged extent as an attribute of the whole lustre file and
punch
>>> its stripes to free the space.  Am I missing a use case?
>>>       
>> Since the beginning CFS required this feature. It seems a lab ask for 
>> it. I do not know who. Unfortunately we have no use case for what they 
>> want to do with this.
>> I''m wondering if their need could not be met with other
features like
>> the internal Lustre migration...
>>     
>
> That is my understanding also - I believe one of the Labs wanted this
> (to be able to do HSM on a per-stripe basis instead of a per-file basis).
>   This doesn''t make any sense to me.  Layouts may change;
a stripe on one filesystem may not correspond to a stripe on a replica 
of the filesystem;
exposing stripes to user apps is a bad idea.

I''m going to propose what I think we need:
1. Punch a single, arbitrary byte range from the middle of a file (thus
    leaving beginning and end for file type, icons, filesize.
2. No other arbitrary punch patterns.
3. The punched range is stored on the MDT alone.
4. Once punched, the OST may forget about any fully-punched stripes it
    used to hold.
5. Clients must take a layout lock (CR) when they retrieve the layout from
    the MDT.  If the MDT punches from the middle, it revokes the layout 
lock,
    and clients must re-enqueue it for further read/write on the file.  
The MDT
    is the sole keeper of the layout, and it must be protected by a lock.
6. Client access within a punched range results in an RPC to the MDT.  The
    MDT decides where to put the restored data, organizes the restoration
    (via the coordinator), and rewrites the layout (under lock, of course).
    Client gets the new layout, and can contact the appropriate OST.

Canon, Richard Shane

2008-Feb-18 21:51 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Aurelien and JC,

Sorry that my feedback is late.  Here are my questions/remarks.

General
* Any thought on how quotas will be handled?

Coordinator
* 3.4 - I was curious what the precise use case was that was driving
this?  I don''t disagree with it, but I was curious for more background
* 3.7.1 - The coordinator could become a scaling bottleneck.  We should
think about how this will be scaled in the future
* 4.1 - Does the coordinator store the ext obj id or does the agent
* 4.3 item 2 - This looks like the coordinator could become a bottle
neck for unlinks and slow down performance.  Could this be put in some
type of async queue to be processed later (or some type of attic space)?

Use Cases
* 2.3 (Use cases) - I''m really keen on this feature.  I think it is
very
important in order to make small file performance work well.
Unfortunately, it isn''t clear how the file list gets communicated to
the
archive tool.  The coordinator and agent seem to only take one file at a
time.  So how would this work exactly?
* 2.4 - The copy tool should be allowed to preemptively restage files.
I think this will work with the design, but we should make sure of this.
This would be useful for restaging a whole tar file versus doing things
piece-meal.

Part IV
2 EAs - I''m worried that the EA list could get huge for holes.
3.2 -item 3 - Who insures a file is archived before punches are made?
3.3 - Another use case...  The user checks to see if a file has been
archived.


Also, someone earlier made the point about the archive tool being able
to reorder request.  This is really important since an archival system
wants to know all the files being restaged in order to order tape mounts
and reads.  

Thanks for taking the lead on this.  It looks like there is a lot of
interest in it.

--Shane

-----Original Message-----
From: lustre-devel-bounces at lists.lustre.org
[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of DEGREMONT
Aurelien
Sent: Thursday, February 07, 2008 5:53 AM
To: lustre-devel at lists.lustre.org
Subject: [Lustre-devel] Lustre HSM HLD draft

Hello

Here is a first draft for comments of the Lustre HSM HLD.
It is intended to be a support for further analyzes and comments from
CFS/Sun.

The document covers the main parts of the HSM features but some elements
are still lacking.
The policy management and the space manager will be describe later.

Let us know your comments and ideas about it.

Regards,

Aurelien Degremont
CEA

Aurelien Degremont

2008-Feb-19 17:13 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Canon, Richard Shane a ?crit :> General
> * Any thought on how quotas will be handled?
That''s a very good question.
I think this point should be discussed.

The purge possibility introduces two values which could be under quota.
  1 - File size (current case)
  2 - The disk occupation are used (migrated files free quota)

The first point are the simplest to implement and will need fewer
modifications, but users could not free quota even if all their files
are migrated.
The second point could help users but this will be problematic when they
will copy back some of their file, because this will trigger space
issues and purge requests on theirs other files, and so on.

IMO, the best way is to take choice #1 and possibly add a ''real disk
use'' quota value that could be tuned by admins.

I''m not a Lustre quota specialist and AFAIK this code is a bit touchy.

> Coordinator
> * 3.4 - I was curious what the precise use case was that was driving
> this?  I don''t disagree with it, but I was curious for more
background
Coordinator is designed to also manage internal Lustre migrations.
> * 3.7.1 - The coordinator could become a scaling bottleneck.  We should
> think about how this will be scaled in the future
I think we should be able to have several coordinators in the future.
Each of them dealing with different external storages.
> * 4.1 - Does the coordinator store the ext obj id or does the agent
The agent does not have a storage device. It stores nothing.
The external IDs are in MDT device.
> * 4.3 item 2 - This looks like the coordinator could become a bottle
> neck for unlinks and slow down performance.  Could this be put in some
> type of async queue to be processed later (or some type of attic space)?
Yes, I think unlinks should be handled asynchronously.
> Use Cases
> * 2.3 (Use cases) - I''m really keen on this feature.  I think it
is very
> important in order to make small file performance work well.
> Unfortunately, it isn''t clear how the file list gets communicated
to the
> archive tool.  The coordinator and agent seem to only take one file at a
> time.  So how would this work exactly?
In fact, we have presently designed the archiving tool to support this
feature and only it because the archiving tool could be developped by
anyone and we want this API being as stable as possible. The current
Lustre component design does not handle it. But it will be added later,
in a second step, and the copy tool developped since will be already
compatible with it.
> * 2.4 - The copy tool should be allowed to preemptively restage files.
> I think this will work with the design, but we should make sure of this.
> This would be useful for restaging a whole tar file versus doing things
> piece-meal.
That''s an interesting point. I think we could avoid it but it is an
interesting feature. I must think how we should modify the design to
permit it. (The tool should be able to warn the coordinator: oh, i''m
staging this file also! please note it)
> 2 EAs - I''m worried that the EA list could get huge for holes.
This part has been redesigned. The data that were stored in EA have been
moved. It will be explained in the new document version.
> 3.2 -item 3 - Who insures a file is archived before punches are made?
The space manager did it. It is the only one which will make punch
request. May be MDT could ensure it before dealing with it.
> 3.3 - Another use case...  The user checks to see if a file has been
> archived.
Ok
> Also, someone earlier made the point about the archive tool being able
> to reorder request.  This is really important since an archival system
> wants to know all the files being restaged in order to order tape mounts
> and reads.  
I do not see any problem with this.
I will add this point in the doc.

> Thanks for taking the lead on this.  It looks like there is a lot of
> interest in it.
Thanks you for your very interesting comments.


-- 
Aurelien Degremont
CEA

Aurelien Degremont

2008-Feb-21 15:26 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Hello

I''ve got several wondering about some specific point in HSM 
implementation and I would like your opinion about them.

Coordinator:

This element will manage migration externally (HSM) and internally of 
Lustre (space balancing?). Is the current API acceptable (specific calls 
for external migration, and other ones for internal migration)? The best 
way could have been to have generic call for migration, but we must also 
have generic objects to describe the migration sources and destinations 
and those are not simples. We finally conclude with the API presented in 
the HLD document. Tell me if this is *really* a bad idea or if only 
adjustments are needed.

We presented two modes of migration, explicit and implicit migrations. 
The first one result of an administrative request, the second one was 
triggered automatically (cache miss by example). Is that ok? (See the 
doc for all details).

Agent:

It seems, to support Lustre internal migration, you have planned to 
implement specific Agents which will reside on OST. HSM will need 
specific agent on clients. Do those two kinds of agent are acceptable ? 
The current API only describe HSM-based agent. Maybe we should think of 
a generic agent framework and add specialized implementations for 
ost,hsm,etc ?



-- 
Aurelien Degremont
CEA

Peter J Braam

2008-Feb-25 22:38 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Aurelien Degremont wrote:> Hello
>
> I''ve got several wondering about some specific point in HSM 
> implementation and I would like your opinion about them.
>
> Coordinator:
>
> This element will manage migration externally (HSM) and internally of 
> Lustre (space balancing?). Is the current API acceptable (specific calls 
> for external migration, and other ones for internal migration)?I would like to see a parameter indicating what agent will be used and 
keep all other parameters the same.>  The best 
> way could have been to have generic call for migration, but we must also 
> have generic objects to describe the migration sources and destinations 
> and those are not simples.For migration to and from external sources, Lustre must already manage 
this data in an extended attribute (e.g. to describe the file on tape to 
which a Lustre file was migrated).  This data is opaque to Lustre and 
can be passed as a blob.>  We finally conclude with the API presented in 
> the HLD document. Tell me if this is *really* a bad idea or if only 
> adjustments are needed.
>
>   I have not yet looked at these.
> We presented two modes of migration, explicit and implicit migrations. 
> The first one result of an administrative request, the second one was 
> triggered automatically (cache miss by example). Is that ok? (See the 
> doc for all details).
>   
Yes, that seems ok.> Agent:
>
> It seems, to support Lustre internal migration, you have planned to 
> implement specific Agents which will reside on OST.To avoid many complications involving locks, we decided that even the 
agents used for internal migrations will layer on the file system.  The 
Lustre file system will be mounted on the OST''s and it will use the 
"LOLND" to transport the data efficiently between the OST process and 
the client file system cache.  In the internal case source and 
destination lie in Lustre in the HSM case only one of them.

As a result I believe these two cases are closer together than you may 
think, and should be one "type".

The key aspect we/you need to design is what an agent has to make sure 
happens, for example in terms of locking file extents and in terms of 
avoiding triggering a recursive cache miss (open by fid with a flag?).

- Peter -
>  HSM will need 
> specific agent on clients. Do those two kinds of agent are acceptable ? 
> The current API only describe HSM-based agent. Maybe we should think of 
> a generic agent framework and add specialized implementations for 
> ost,hsm,etc ?
>
>   
>
>

Peter J Braam

2008-Feb-25 22:44 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Just a few initial responses from me, I haven''t read things 
systematically yet.

Canon, Richard Shane wrote:> Aurelien and JC,
>
> Sorry that my feedback is late.  Here are my questions/remarks.
>
> General
> * Any thought on how quotas will be handled?
>
>   This is very very important and will require a lot of detail.  Well 
spotted Shane!!!> Coordinator
> * 3.4 - I was curious what the precise use case was that was driving
> this?  I don''t disagree with it, but I was curious for more
background
>   In internal migrations many objects will be restriped to another set of 
objects to move the data.  The coordinator handles the completion and 
abortion of the agents accomplishing this.> * 3.7.1 - The coordinator could become a scaling bottleneck.  We should
> think about how this will be scaled in the future
>   In my writings I was always anticipating a family of load balancing 
coordinators.> * 4.1 - Does the coordinator store the ext obj id or does the agent
>   Coordinator, I suggest, in view of the fact that many agents may be 
required to move one file.> * 4.3 item 2 - This looks like the coordinator could become a bottle
> neck for unlinks and slow down performance.  Could this be put in some
> type of async queue to be processed later (or some type of attic space)?
>
>   I agree with this.
> Use Cases
> * 2.3 (Use cases) - I''m really keen on this feature.  I think it
is very
> important in order to make small file performance work well.
> Unfortunately, it isn''t clear how the file list gets communicated
to the
> archive tool.  The coordinator and agent seem to only take one file at a
> time.  So how would this work exactly?
> * 2.4 - The copy tool should be allowed to preemptively restage files.
> I think this will work with the design, but we should make sure of this.
> This would be useful for restaging a whole tar file versus doing things
> piece-meal.
>
> Part IV
> 2 EAs - I''m worried that the EA list could get huge for holes.
>   The EA merely points to an extent tree (similar to the allocation extent 
tree).> 3.2 -item 3 - Who insures a file is archived before punches are made?
>   
The coordinator.> 3.3 - Another use case...  The user checks to see if a file has been
> archived.
>
>
> Also, someone earlier made the point about the archive tool being able
> to reorder request.  This is really important since an archival system
> wants to know all the files being restaged in order to order tape mounts
> and reads.  
>
> Thanks for taking the lead on this.  It looks like there is a lot of
> interest in it.
>
> --Shane
>
> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org
> [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of DEGREMONT
> Aurelien
> Sent: Thursday, February 07, 2008 5:53 AM
> To: lustre-devel at lists.lustre.org
> Subject: [Lustre-devel] Lustre HSM HLD draft
>
> Hello
>
> Here is a first draft for comments of the Lustre HSM HLD.
> It is intended to be a support for further analyzes and comments from
> CFS/Sun.
>
> The document covers the main parts of the HSM features but some elements
> are still lacking.
> The policy management and the space manager will be describe later.
>
> Let us know your comments and ideas about it.
>
> Regards,
>
> Aurelien Degremont
> CEA
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Aurelien Degremont

2008-Feb-27 16:51 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

Peter J Braam a ?crit :>> Coordinator:
>>
>> This element will manage migration externally (HSM) and internally of 
>> Lustre (space balancing?). Is the current API acceptable (specific 
>> calls for external migration, and other ones for internal migration)?
> I would like to see a parameter indicating what agent will be used and 
> keep all other parameters the same.
Agreed.
>>  The best way could have been to have generic call for migration, but 
>> we must also have generic objects to describe the migration sources 
>> and destinations and those are not simples.
> For migration to and from external sources, Lustre must already manage 
> this data in an extended attribute (e.g. to describe the file on tape to 
> which a Lustre file was migrated).  This data is opaque to Lustre and 
> can be passed as a blob.
>> It seems, to support Lustre internal migration, you have planned to 
>> implement specific Agents which will reside on OST.
> To avoid many complications involving locks, we decided that even the 
> agents used for internal migrations will layer on the file system.  The 
> Lustre file system will be mounted on the OST''s and it will use
the
> "LOLND" to transport the data efficiently between the OST process
and
> the client file system cache.  In the internal case source and 
> destination lie in Lustre in the HSM case only one of them.
> 
> As a result I believe these two cases are closer together than you may 
> think, and should be one "type".

If we unify the API, we must have a way to request some data movement like:

copy elemA in placeP
copy elemA,stored in placeP bak into Lustre
copy elemA into placeC
move elemB into elemB


The elem could be unified using Lustre FID, but the places could be an 
external storage, or a precise OST. If we want a unify API, the API call 
should manipulate a generic object which could describe a Lustre storage 
element (ost) or a external storage (hsm,...)

ie:
   struct storage_place {
     ...
   }
   copy(fid,storage_place*)
   move(fid,storage_place*)

and their is some specific cases to handle. The other possibity:

   ext_copyout(fid, external storage)
   ext_copyin(fid, external object)
   int_copy(fid, fid, ost)
   int_move(fid, fid, ost)

I think this one, even if the design is not the most beautiful one, if 
the easiest one.

Instead you want to create some new generic objects to manipulate lustre 
object data and generic storage areas, the second case is the best one IMO.


-- 
Aurelien Degremont
CEA

Peter Braam

2008-Feb-29 04:30 UTC

head link

[Lustre-devel] Lustre HSM HLD draft

The discussion below about the API''s is a standard element of data
abstraction taught in advanced programming courses  (see e.g. Abelson et.
al. Structure and Interpretation of Computer Programs (SICP)).

From this one concludes that the coordinator and agents will use abstract
data types and call abstract methods that accommodate multiple:
- source and destination descriptors for the data
- data movers implementing the methods to move data

If you proceed along the lines you outline you will get a big matrix of
movers and data types to keep track of.  If you follow my approach you will
encapsulate things much more cleanly.

Think in terms of virtual classes data movers acting on source and
destination objects.

- peter -



On 2/27/08 9:51 AM, "Aurelien Degremont" <aurelien.degremont at
cea.fr> wrote:
> 
> Peter J Braam a ?crit :
>>> Coordinator:
>>> 
>>> This element will manage migration externally (HSM) and internally
of
>>> Lustre (space balancing?). Is the current API acceptable (specific
>>> calls for external migration, and other ones for internal
migration)?
>> I would like to see a parameter indicating what agent will be used and
>> keep all other parameters the same.
> 
> Agreed.
> 
>>>  The best way could have been to have generic call for migration,
but
>>> we must also have generic objects to describe the migration sources
>>> and destinations and those are not simples.
>> For migration to and from external sources, Lustre must already manage
>> this data in an extended attribute (e.g. to describe the file on tape
to
>> which a Lustre file was migrated).  This data is opaque to Lustre and
>> can be passed as a blob.
>>> It seems, to support Lustre internal migration, you have planned to
>>> implement specific Agents which will reside on OST.
>> To avoid many complications involving locks, we decided that even the
>> agents used for internal migrations will layer on the file system.  The
>> Lustre file system will be mounted on the OST''s and it will
use the
>> "LOLND" to transport the data efficiently between the OST
process and
>> the client file system cache.  In the internal case source and
>> destination lie in Lustre in the HSM case only one of them.
>> 
>> As a result I believe these two cases are closer together than you may
>> think, and should be one "type".
> 
> 
> If we unify the API, we must have a way to request some data movement like:
> 
> copy elemA in placeP
> copy elemA,stored in placeP bak into Lustre
> copy elemA into placeC
> move elemB into elemB
> 
> 
> The elem could be unified using Lustre FID, but the places could be an
> external storage, or a precise OST. If we want a unify API, the API call
> should manipulate a generic object which could describe a Lustre storage
> element (ost) or a external storage (hsm,...)
> 
> ie:
>    struct storage_place {
>      ...
>    }
>    copy(fid,storage_place*)
>    move(fid,storage_place*)
> 
> and their is some specific cases to handle. The other possibity:
> 
>    ext_copyout(fid, external storage)
>    ext_copyin(fid, external object)
>    int_copy(fid, fid, ost)
>    int_move(fid, fid, ost)
> 
> I think this one, even if the design is not the most beautiful one, if
> the easiest one.
> 
> Instead you want to create some new generic objects to manipulate lustre
> object data and generic storage areas, the second case is the best one IMO.
>

Lustre devel - Feb 2008 - Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft

[Lustre-devel] Lustre HSM HLD draft