thr3ads.net - Lustre devel - [Lustre-devel] changelog for whole filesystem? [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2010-Oct-27 13:06 UTC

[Lustre-devel] changelog for whole filesystem?

I had an interesting idea today during a discussion on HSM.  One of the issues
with enabling HSM today is that a full-filesystem scan must be done initially to
populate the policy engine database.

I was thinking that it would be useful to have a virtual changelog that provides
a feed from internally traversing the whole filesystem in an efficient manner. 
For bug 22741 we have implemented a virtual index in the OSD which will return
all of the in-use ldiskfs inodes in inode-number order (which is the fastest way
to read/stat all of the inodes).  It probably wouldn''t be too hard to
hook this OSD callback into the changelog API, so that reading a special
changelog file would return all of the files in the filesystem.  It is possible
to generate the full pathnames of these inodes via the "link" xattr,
if that is needed.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

bzzz.tomas at gmail.com

2010-Oct-27 13:17 UTC

head link

[Lustre-devel] changelog for whole filesystem?

On 10/27/10 5:06 PM, Andreas Dilger wrote:> I had an interesting idea today during a discussion on HSM.  One of the
issues with enabling HSM today is that a full-filesystem scan must be done
initially to populate the policy engine database.
>
> I was thinking that it would be useful to have a virtual changelog that
provides a feed from internally traversing the whole filesystem in an efficient
manner.  For bug 22741 we have implemented a virtual index in the OSD which will
return all of the in-use ldiskfs inodes in inode-number order (which is the
fastest way to read/stat all of the inodes).  It probably wouldn''t be
too hard to hook this OSD callback into the changelog API, so that reading a
special changelog file would return all of the files in the filesystem.  It is
possible to generate the full pathnames of these inodes via the "link"
xattr, if that is needed.
probably you meant to iterate over OI ?

thanks, z

LEIBOVICI Thomas

2010-Oct-27 15:28 UTC

head link

[Lustre-devel] changelog for whole filesystem?

Would this special log have the same record structure as current 
changelogs, or a different structure with more information?
Depending on how this iterator works, maybe we can avoid RPCs (for stat, 
fid2path, get_stripe, hsm_state_get...) if this info is available when 
the log record is generated.
Anyhow, this feature sounds very interesting. We''ll be glad to help, if
you need taskforce for implementing such a feature.

Thomas

Andreas Dilger wrote:> I had an interesting idea today during a discussion on HSM.  One of the
issues with enabling HSM today is that a full-filesystem scan must be done
initially to populate the policy engine database.
>
> I was thinking that it would be useful to have a virtual changelog that
provides a feed from internally traversing the whole filesystem in an efficient
manner.  For bug 22741 we have implemented a virtual index in the OSD which will
return all of the in-use ldiskfs inodes in inode-number order (which is the
fastest way to read/stat all of the inodes).  It probably wouldn''t be
too hard to hook this OSD callback into the changelog API, so that reading a
special changelog file would return all of the files in the filesystem.  It is
possible to generate the full pathnames of these inodes via the "link"
xattr, if that is needed.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
>

Andreas Dilger

2010-Oct-28 09:04 UTC

head link

[Lustre-devel] changelog for whole filesystem?

On 2010-10-27, at 21:17, bzzz.tomas at gmail.com wrote:> On 10/27/10 5:06 PM, Andreas Dilger wrote:
>> 
>> I was thinking that it would be useful to have a virtual changelog that
provides a feed from internally traversing the whole filesystem in an efficient
manner.  For bug 22741 we have implemented a virtual index in the OSD which will
return all of the in-use ldiskfs inodes in inode-number order (which is the
fastest way to read/stat all of the inodes).  It probably wouldn''t be
too hard to hook this OSD callback into the changelog API, so that reading a
special changelog file would return all of the files in the filesystem.  It is
possible to generate the full pathnames of these inodes via the "link"
xattr, if that is needed.
> 
> probably you meant to iterate over OI ?
Yes, I was thinking to hook the OI iteration to the ChangeLog API so that it is
easier to iterate over the Filesystem from userspace.

Cheers, Andreas

Andreas Dilger

2010-Oct-28 09:15 UTC

head link

[Lustre-devel] changelog for whole filesystem?

On 2010-10-27, at 23:28, LEIBOVICI Thomas <thomas.leibovici at cea.fr>
wrote:> Would this special log have the same record structure as current
changelogs, or a different structure with more information?
> Depending on how this iterator works, maybe we can avoid RPCs (for stat,
fid2path, get_stripe, hsm_state_get...) if this info is available when the log
record is generated.
My thought was to use the same format for the changelog so that it would be easy
to use the same API to use the "whole filesystem" traversal log and
then transfer over to the standard "changes only" changelog. In fact,
it might make sense to make this atomic so that this is a flag on a regular
changelog open, and it will continue after the traversal is completed to the
changelog for any changes that happened since the traversal started.
> Anyhow, this feature sounds very interesting. We''ll be glad to
help, if you need taskforce for implementing such a feature.
> 
> Thomas
> 
> Andreas Dilger wrote:
>> I had an interesting idea today during a discussion on HSM.  One of the
issues with enabling HSM today is that a full-filesystem scan must be done
initially to populate the policy engine database.
>> 
>> I was thinking that it would be useful to have a virtual changelog that
provides a feed from internally traversing the whole filesystem in an efficient
manner.  For bug 22741 we have implemented a virtual index in the OSD which will
return all of the in-use ldiskfs inodes in inode-number order (which is the
fastest way to read/stat all of the inodes).  It probably wouldn''t be
too hard to hook this OSD callback into the changelog API, so that reading a
special changelog file would return all of the files in the filesystem.  It is
possible to generate the full pathnames of these inodes via the "link"
xattr, if that is needed.
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>> 
>>  
>

LEIBOVICI Thomas

2010-Oct-28 13:43 UTC

head link

[Lustre-devel] changelog for whole filesystem?

Andreas Dilger wrote:> On 2010-10-27, at 23:28, LEIBOVICI Thomas <thomas.leibovici at
cea.fr> wrote:
>   
>> Would this special log have the same record structure as current
changelogs, or a different structure with more information?
>> Depending on how this iterator works, maybe we can avoid RPCs (for
stat, fid2path, get_stripe, hsm_state_get...) if this info is available when the
log record is generated.
>>     
>
> My thought was to use the same format for the changelog so that it would be
easy to use the same API to use the "whole filesystem" traversal log
and then transfer over to the standard "changes only" changelog. In
fact, it might make sense to make this atomic so that this is a flag on a
regular changelog open, and it will continue after the traversal is completed to
the changelog for any changes that happened since the traversal started.
>   OK, I got it. So the idea is to have a switch in the policy engine that 
would be:
- if it starts for the first time => open the changelog with a special 
flag to get all entries + changes in the meanwhile
- else => open the changelog as usual

"any changes that happened since the traversal started"

A couple of comments about that:
- With the current implementation, the ChangeLog transaction management starts
after the "changelog_register" on MDT,
then the log records start accumulating on MDT until they are read and
acknowledged by the consummer.
So, reporting only the "changes that happened since the traversal
started" implies to voluntarily forget previous records
that were waiting to be read.
- if changes occur during the scan: do we skip/ignore records for entries that
have not been listed yet?
- If we want to make the "scan log" restartable from the last read
entry, the client should be able to reopen the log
by giving the last record id in argument and continue the scan and/or the
standard log records where it stopped.
So merging the 2 log streams (scan and standard changelog) may imply a common
record id management.

Distinguishing the two kind of logs depending on open flag makes it possible
to manage log record index and scan record index separately, which would
simplify the implementation:
the record index for "scan log" will be something like the
inode-number order,
and the log consummer can use this index for restarting an aborted scan.

Once the changelog consummer is registered on MDT, we are sure not to miss any
change that occurs on the filesystem.
So, for initializing the HSM policy engine DB, we can proceed the following way:
1) register a changelog consummer on MDT
2) open and process the "scan log"
3) open and process the standard changelog records that are accumlated since
step 1)
we are sure to know all entries in filesystem after those 3 steps.
Policy engine can actually perform 3) at any time. The only contain is to have
step 1) before step 2).

Thomas.

Eric Barton

2010-Oct-29 16:50 UTC

head link

[Lustre-devel] changelog for whole filesystem?

Andreas, Thomas,

I _do_ like the idea of opening the changelog to see changes either
"from now" or "from empty".   But I think the idea needs to
worked
out fully to support multiple changelog consumers - e.g. how to keep
multiple placeholders in the object enumeration so that changes to
objects yet to be enumerated for a particular consumer are not queued
to that consumer.  As ever, I''m concerned that what looks like
"low
hanging fruit" now later turns into technical debt later.

          Cheers,
                   Eric
> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces
at lists.lustre.org] On Behalf
> Of LEIBOVICI Thomas
> Sent: 28 October 2010 6:43 AM
> To: Andreas Dilger
> Cc: lustre-hsm-core-ext at Sun.COM; lustre-devel at lists.lustre.org List
> Subject: Re: [Lustre-devel] changelog for whole filesystem?
> 
> Andreas Dilger wrote:
> > On 2010-10-27, at 23:28, LEIBOVICI Thomas <thomas.leibovici at
cea.fr> wrote:
> >
> >> Would this special log have the same record structure as current
changelogs, or a different
> structure with more information?
> >> Depending on how this iterator works, maybe we can avoid RPCs (for
stat, fid2path, get_stripe,
> hsm_state_get...) if this info is available when the log record is
generated.
> >>
> >
> > My thought was to use the same format for the changelog so that it
would be easy to use the same API
> to use the "whole filesystem" traversal log and then transfer
over to the standard "changes only"
> changelog. In fact, it might make sense to make this atomic so that this is
a flag on a regular
> changelog open, and it will continue after the traversal is completed to
the changelog for any changes
> that happened since the traversal started.
> >
> OK, I got it. So the idea is to have a switch in the policy engine that
> would be:
> - if it starts for the first time => open the changelog with a special
> flag to get all entries + changes in the meanwhile
> - else => open the changelog as usual
> 
> "any changes that happened since the traversal started"
> 
> A couple of comments about that:
> - With the current implementation, the ChangeLog transaction management
starts after the
> "changelog_register" on MDT,
> then the log records start accumulating on MDT until they are read and
acknowledged by the consummer.
> So, reporting only the "changes that happened since the traversal
started" implies to voluntarily
> forget previous records
> that were waiting to be read.
> - if changes occur during the scan: do we skip/ignore records for entries
that have not been listed
> yet?
> - If we want to make the "scan log" restartable from the last
read entry, the client should be able to
> reopen the log
> by giving the last record id in argument and continue the scan and/or the
standard log records where
> it stopped.
> So merging the 2 log streams (scan and standard changelog) may imply a
common record id management.
> 
> Distinguishing the two kind of logs depending on open flag makes it
possible
> to manage log record index and scan record index separately, which would
simplify the implementation:
> the record index for "scan log" will be something like the
inode-number order,
> and the log consummer can use this index for restarting an aborted scan.
> 
> Once the changelog consummer is registered on MDT, we are sure not to miss
any change that occurs on
> the filesystem.
> So, for initializing the HSM policy engine DB, we can proceed the following
way:
> 1) register a changelog consummer on MDT
> 2) open and process the "scan log"
> 3) open and process the standard changelog records that are accumlated
since step 1)
> we are sure to know all entries in filesystem after those 3 steps.
> Policy engine can actually perform 3) at any time. The only contain is to
have step 1) before step 2).
> 
> Thomas.
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Andreas Dilger

2010-Nov-02 06:42 UTC

head link

[Lustre-devel] changelog for whole filesystem?

On 2010-10-29, at 10:50, Eric Barton wrote:> I _do_ like the idea of opening the changelog to see changes either
> "from now" or "from empty".   But I think the idea
needs to worked
> out fully to support multiple changelog consumers
Definitely.  Since the "from empty" iterator is a virtual iterator in
the first place, it seems relatively easy to have a separate iteration index for
each one.  The harder part is how to integrate the virtual filesystem iteration.
> - e.g. how to keep
> multiple placeholders in the object enumeration so that changes to
> objects yet to be enumerated for a particular consumer are not queued
> to that consumer.
I was initially thinking that all filesystem events would be queued for each
"from empty" iterator, for processing after the full filesystem
iteration has completed.  There would be some potential inconsistencies (e.g.
creation events for inodes that were iterated, or deletion events for inodes
that were not iterated).

This is no worse than doing the iteration with an external tool - if the
filesystem is "live" the tool would need to handle these
inconsistencies, and if it is offline for the initial iteration there are no
inconsistencies.

However, it would also seem possible to use the current inode iteration index as
a filter to only keep events for inodes beyond the current index (possibly in a
per-consumer log if there are multiple consumers).
> As ever, I''m concerned that what looks like "low
> hanging fruit" now later turns into technical debt later.
Potentially, yes, which is why I brought it up for discussion.  I definitely
think that having a single interface for filesystem iteration makes much more
sense than having to traverse the filesystem with an external tool and only then
start to use changelogs.
>> -----Original Message-----
>> From: lustre-devel-bounces at lists.lustre.org
[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf
>> Of LEIBOVICI Thomas
>> Sent: 28 October 2010 6:43 AM
>> To: Andreas Dilger
>> Cc: lustre-hsm-core-ext at Sun.COM; lustre-devel at lists.lustre.org
List
>> Subject: Re: [Lustre-devel] changelog for whole filesystem?
>> 
>> Andreas Dilger wrote:
>>> On 2010-10-27, at 23:28, LEIBOVICI Thomas <thomas.leibovici at
cea.fr> wrote:
>>> 
>>>> Would this special log have the same record structure as
current changelogs, or a different
>> structure with more information?
>>>> Depending on how this iterator works, maybe we can avoid RPCs
(for stat, fid2path, get_stripe,
>> hsm_state_get...) if this info is available when the log record is
generated.
>>>> 
>>> 
>>> My thought was to use the same format for the changelog so that it
would be easy to use the same API
>> to use the "whole filesystem" traversal log and then transfer
over to the standard "changes only"
>> changelog. In fact, it might make sense to make this atomic so that
this is a flag on a regular
>> changelog open, and it will continue after the traversal is completed
to the changelog for any changes
>> that happened since the traversal started.
>>> 
>> OK, I got it. So the idea is to have a switch in the policy engine that
>> would be:
>> - if it starts for the first time => open the changelog with a
special
>> flag to get all entries + changes in the meanwhile
>> - else => open the changelog as usual
>> 
>> "any changes that happened since the traversal started"
>> 
>> A couple of comments about that:
>> - With the current implementation, the ChangeLog transaction management
starts after the
>> "changelog_register" on MDT,
>> then the log records start accumulating on MDT until they are read and
acknowledged by the consummer.
>> So, reporting only the "changes that happened since the traversal
started" implies to voluntarily
>> forget previous records
>> that were waiting to be read.
>> - if changes occur during the scan: do we skip/ignore records for
entries that have not been listed
>> yet?
>> - If we want to make the "scan log" restartable from the last
read entry, the client should be able to
>> reopen the log
>> by giving the last record id in argument and continue the scan and/or
the standard log records where
>> it stopped.
>> So merging the 2 log streams (scan and standard changelog) may imply a
common record id management.
>> 
>> Distinguishing the two kind of logs depending on open flag makes it
possible
>> to manage log record index and scan record index separately, which
would simplify the implementation:
>> the record index for "scan log" will be something like the
inode-number order,
>> and the log consummer can use this index for restarting an aborted
scan.
>> 
>> Once the changelog consummer is registered on MDT, we are sure not to
miss any change that occurs on
>> the filesystem.
>> So, for initializing the HSM policy engine DB, we can proceed the
following way:
>> 1) register a changelog consummer on MDT
>> 2) open and process the "scan log"
>> 3) open and process the standard changelog records that are accumlated
since step 1)
>> we are sure to know all entries in filesystem after those 3 steps.
>> Policy engine can actually perform 3) at any time. The only contain is
to have step 1) before step 2).
>> 
>> Thomas.
>> 
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Nathan Rutman

2010-Nov-11 04:15 UTC

head link

[Lustre-devel] changelog for whole filesystem?

Hello all!
This same "initial population of a database from objects" problem
occurs when trying to replicate a Lustre filesystem using the changelog.
The problem is actually more complicated even for a single changelog consumer:
since iterating through the virtual changelog takes non-0 time, you''re
not sure if a virtual record was created before or after an actual changelog
entry.  E.g. you might try to resolve the name of a file for a virtual record
either before or after a rename, and then later you would see the rename in the
changelog, leading to an inconsistent view of the namespace.

If you don''t care about having the exact right name, then it''s
easy enough to ignore inapplicable changelog records.

On Nov 1, 2010, at 11:42 PM, Andreas Dilger wrote:
> On 2010-10-29, at 10:50, Eric Barton wrote:
>> I _do_ like the idea of opening the changelog to see changes either
>> "from now" or "from empty".   But I think the idea
needs to worked
>> out fully to support multiple changelog consumers
> 
> Definitely.  Since the "from empty" iterator is a virtual
iterator in the first place, it seems relatively easy to have a separate
iteration index for each one.  The harder part is how to integrate the virtual
filesystem iteration.
> 
>> - e.g. how to keep
>> multiple placeholders in the object enumeration so that changes to
>> objects yet to be enumerated for a particular consumer are not queued
>> to that consumer.
> 
> I was initially thinking that all filesystem events would be queued for
each "from empty" iterator, for processing after the full filesystem
iteration has completed.  There would be some potential inconsistencies (e.g.
creation events for inodes that were iterated, or deletion events for inodes
that were not iterated).
> 
> This is no worse than doing the iteration with an external tool - if the
filesystem is "live" the tool would need to handle these
inconsistencies, and if it is offline for the initial iteration there are no
inconsistencies.
> 
> However, it would also seem possible to use the current inode iteration
index as a filter to only keep events for inodes beyond the current index
(possibly in a per-consumer log if there are multiple consumers).
> 
>> As ever, I''m concerned that what looks like "low
>> hanging fruit" now later turns into technical debt later.
> 
> Potentially, yes, which is why I brought it up for discussion.  I
definitely think that having a single interface for filesystem iteration makes
much more sense than having to traverse the filesystem with an external tool and
only then start to use changelogs.
> 
>>> -----Original Message-----
>>> From: lustre-devel-bounces at lists.lustre.org
[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf
>>> Of LEIBOVICI Thomas
>>> Sent: 28 October 2010 6:43 AM
>>> To: Andreas Dilger
>>> Cc: lustre-hsm-core-ext at Sun.COM; lustre-devel at
lists.lustre.org List
>>> Subject: Re: [Lustre-devel] changelog for whole filesystem?
>>> 
>>> Andreas Dilger wrote:
>>>> On 2010-10-27, at 23:28, LEIBOVICI Thomas <thomas.leibovici
at cea.fr> wrote:
>>>> 
>>>>> Would this special log have the same record structure as
current changelogs, or a different
>>> structure with more information?
>>>>> Depending on how this iterator works, maybe we can avoid
RPCs (for stat, fid2path, get_stripe,
>>> hsm_state_get...) if this info is available when the log record is
generated.
>>>>> 
>>>> 
>>>> My thought was to use the same format for the changelog so that
it would be easy to use the same API
>>> to use the "whole filesystem" traversal log and then
transfer over to the standard "changes only"
>>> changelog. In fact, it might make sense to make this atomic so that
this is a flag on a regular
>>> changelog open, and it will continue after the traversal is
completed to the changelog for any changes
>>> that happened since the traversal started.
>>>> 
>>> OK, I got it. So the idea is to have a switch in the policy engine
that
>>> would be:
>>> - if it starts for the first time => open the changelog with a
special
>>> flag to get all entries + changes in the meanwhile
>>> - else => open the changelog as usual
>>> 
>>> "any changes that happened since the traversal started"
>>> 
>>> A couple of comments about that:
>>> - With the current implementation, the ChangeLog transaction
management starts after the
>>> "changelog_register" on MDT,
>>> then the log records start accumulating on MDT until they are read
and acknowledged by the consummer.
>>> So, reporting only the "changes that happened since the
traversal started" implies to voluntarily
>>> forget previous records
>>> that were waiting to be read.
>>> - if changes occur during the scan: do we skip/ignore records for
entries that have not been listed
>>> yet?
>>> - If we want to make the "scan log" restartable from the
last read entry, the client should be able to
>>> reopen the log
>>> by giving the last record id in argument and continue the scan
and/or the standard log records where
>>> it stopped.
>>> So merging the 2 log streams (scan and standard changelog) may
imply a common record id management.
>>> 
>>> Distinguishing the two kind of logs depending on open flag makes it
possible
>>> to manage log record index and scan record index separately, which
would simplify the implementation:
>>> the record index for "scan log" will be something like
the inode-number order,
>>> and the log consummer can use this index for restarting an aborted
scan.
>>> 
>>> Once the changelog consummer is registered on MDT, we are sure not
to miss any change that occurs on
>>> the filesystem.
>>> So, for initializing the HSM policy engine DB, we can proceed the
following way:
>>> 1) register a changelog consummer on MDT
>>> 2) open and process the "scan log"
>>> 3) open and process the standard changelog records that are
accumlated since step 1)
>>> we are sure to know all entries in filesystem after those 3 steps.
>>> Policy engine can actually perform 3) at any time. The only contain
is to have step 1) before step 2).
>>> 
>>> Thomas.
>>> 
>>> _______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>> 
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
> ______________________________________________________________________
This email may contain privileged or confidential information, which should only
be used for the purpose for which it was sent by Xyratex. No further rights or
licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised
amendment for which Xyratex does not accept liability. While we have taken
reasonable precautions to ensure that this email is free of viruses, Xyratex
does not accept liability for the presence of any computer viruses in this
email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
______________________________________________________________________

Eric Barton

2010-Nov-11 18:10 UTC

head link

[Lustre-devel] changelog for whole filesystem?

Nathan wrote...
> This same "initial population of a database from objects" problem
> occurs when trying to replicate a Lustre filesystem using the
> changelog.
That must be a great test case.  If we can reliably reconstruct an
exact replica using a "from empty" changelog, we must have got it
right :)
> The problem is actually more complicated even for a single changelog
> consumer: since iterating through the virtual changelog takes non-0
> time, you''re not sure if a virtual record was created before or
> after an actual changelog entry.  E.g. you might try to resolve the
> name of a file for a virtual record either before or after a rename,
> and then later you would see the rename in the changelog, leading to
> an inconsistent view of the namespace.
> 
> If you don''t care about having the exact right name, then
it''s easy
> enough to ignore inapplicable changelog records.
I was trying to hint at the need to filter changes for individual
consumers depending on how far through the initial "from empty"
iteration they are in my original post.  Andreas stated it more
explicitly.  I think you just need to enumerate the cases - e.g. for
rename...

Iterated over yet?               Record
Source Inode   Target Inode      emitted
no             no                none
no             yes               delete target
yes            yes               rename source + delete target
yes            no                rename source

The final case just looks like a rename where the target name didn''t
exist already.  In any case, the only real requirement on the stream
of changelog records constructed in the initial iteration is that
consistent filesystem state be reconstructed after the last record
is consumed.

-- 

                Cheers,
                        Eric

Eric Barton
CTO Whamcloud, Inc.
Tel: +44 117 330 1575
Mob: +44 7920 797 273

Robert Read

2010-Nov-12 23:41 UTC

head link

[Lustre-devel] changelog for whole filesystem?

It seems simpler to do the iteration on  a snapshot, instead of a live
filesystem, and allow post-snapshot changes to accumulate on the regular
changelog for processing once the "from empty" iteration was complete.

robert

On Nov 11, 2010, at 10:10 AM, Eric Barton wrote:
> Nathan wrote...
> 
>> This same "initial population of a database from objects"
problem
>> occurs when trying to replicate a Lustre filesystem using the
>> changelog.
> 
> That must be a great test case.  If we can reliably reconstruct an
> exact replica using a "from empty" changelog, we must have got it
> right :)
> 
>> The problem is actually more complicated even for a single changelog
>> consumer: since iterating through the virtual changelog takes non-0
>> time, you''re not sure if a virtual record was created before
or
>> after an actual changelog entry.  E.g. you might try to resolve the
>> name of a file for a virtual record either before or after a rename,
>> and then later you would see the rename in the changelog, leading to
>> an inconsistent view of the namespace.
>> 
>> If you don''t care about having the exact right name, then
it''s easy
>> enough to ignore inapplicable changelog records.
> 
> I was trying to hint at the need to filter changes for individual
> consumers depending on how far through the initial "from empty"
> iteration they are in my original post.  Andreas stated it more
> explicitly.  I think you just need to enumerate the cases - e.g. for
> rename...
> 
> Iterated over yet?               Record
> Source Inode   Target Inode      emitted
> no             no                none
> no             yes               delete target
> yes            yes               rename source + delete target
> yes            no                rename source
> 
> The final case just looks like a rename where the target name
didn''t
> exist already.  In any case, the only real requirement on the stream
> of changelog records constructed in the initial iteration is that
> consistent filesystem state be reconstructed after the last record
> is consumed.
> 
> -- 
> 
>                Cheers,
>                        Eric
> 
> Eric Barton
> CTO Whamcloud, Inc.
> Tel: +44 117 330 1575
> Mob: +44 7920 797 273
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Andreas Dilger

2010-Nov-12 23:58 UTC

head link

[Lustre-devel] changelog for whole filesystem?

On 2010-11-12, at 16:41, Robert Read wrote:> It seems simpler to do the iteration on  a snapshot, instead of a live
filesystem, and allow post-snapshot changes to accumulate on the regular
changelog for processing once the "from empty" iteration was complete.
Just need to implement snapshot support for Lustre. :-)
> On Nov 11, 2010, at 10:10 AM, Eric Barton wrote:
>> Nathan wrote...
>>> This same "initial population of a database from objects"
problem
>>> occurs when trying to replicate a Lustre filesystem using the
>>> changelog.
>> 
>> That must be a great test case.  If we can reliably reconstruct an
>> exact replica using a "from empty" changelog, we must have
got it
>> right :)
>> 
>>> The problem is actually more complicated even for a single
changelog
>>> consumer: since iterating through the virtual changelog takes non-0
>>> time, you''re not sure if a virtual record was created
before or
>>> after an actual changelog entry.  E.g. you might try to resolve the
>>> name of a file for a virtual record either before or after a
rename,
>>> and then later you would see the rename in the changelog, leading
to
>>> an inconsistent view of the namespace.
>>> 
>>> If you don''t care about having the exact right name, then
it''s easy
>>> enough to ignore inapplicable changelog records.
>> 
>> I was trying to hint at the need to filter changes for individual
>> consumers depending on how far through the initial "from
empty"
>> iteration they are in my original post.  Andreas stated it more
>> explicitly.  I think you just need to enumerate the cases - e.g. for
>> rename...
>> 
>> Iterated over yet?               Record
>> Source Inode   Target Inode      emitted
>> no             no                none
>> no             yes               delete target
>> yes            yes               rename source + delete target
>> yes            no                rename source
>> 
>> The final case just looks like a rename where the target name
didn''t
>> exist already.  In any case, the only real requirement on the stream
>> of changelog records constructed in the initial iteration is that
>> consistent filesystem state be reconstructed after the last record
>> is consumed.
>> 
>> -- 
>> 
>>               Cheers,
>>                       Eric
>> 
>> Eric Barton
>> CTO Whamcloud, Inc.
>> Tel: +44 117 330 1575
>> Mob: +44 7920 797 273
>> 
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Lustre devel - Oct 2010 - changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?

[Lustre-devel] changelog for whole filesystem?