thr3ads.net - Lustre devel - [Lustre-devel] Doubly indexed tree / changelogs [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Peter Braam

2008-Sep-21 23:40 UTC

[Lustre-devel] Doubly indexed tree / changelogs

Hi Nikita, Nathan -

After some pondering I have come to two conclusions.

To encode filesets, we need a tree that makes two iterations fast:

1. list all filesets that contain a certain object
2. list all objects in a certain fileset

Is there a doubly indexed tree for this?

Secondly, to make the changelogs useful and scalable for filesets we will
need to be able to list all changelog entries associated with a certain
inode efficiently.  I see two ways to do this ? one is an auxiliary
directory file mapping inodes to many changelog entries, the second is to
embed forward and backward pointers in the changelog entries to build a
linked list rooted at the inode (using an EA in the inode pointing to the
first and last element of the list).  Both have some overheads.  What are
your thoughts?

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080922/8c54f19d/attachment.html

Alex Zhuravlev

2008-Sep-22 05:52 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

can object migrate between filesets? if not, we probably
can use fid''s sequence as a record in that index?

thanks, Alex

Peter Braam wrote:> Hi Nikita, Nathan -
> 
> After some pondering I have come to two conclusions.
> 
> To encode filesets, we need a tree that makes two iterations fast:
> 
>    1. list all filesets that contain a certain object
>    2. list all objects in a certain fileset
> 
> 
> Is there a doubly indexed tree for this?
> 
> Secondly, to make the changelogs useful and scalable for filesets we 
> will need to be able to list all changelog entries associated with a 
> certain inode efficiently.  I see two ways to do this ? one is an 
> auxiliary directory file mapping inodes to many changelog entries, the 
> second is to embed forward and backward pointers in the changelog 
> entries to build a linked list rooted at the inode (using an EA in the 
> inode pointing to the first and last element of the list).  Both have 
> some overheads.  What are your thoughts?
> 
> Peter
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Peter Braam

2008-Sep-22 06:58 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

Objects can be in many filesets, and be added to some, removed from others.

I think this is an almost arbitrary collection of pairs (FID, fileset-id)
and we need it indexed by both.

Peter


On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM>
wrote:
> can object migrate between filesets? if not, we probably
> can use fid''s sequence as a record in that index?
> 
> thanks, Alex
> 
> Peter Braam wrote:
>> Hi Nikita, Nathan -
>> 
>> After some pondering I have come to two conclusions.
>> 
>> To encode filesets, we need a tree that makes two iterations fast:
>> 
>>    1. list all filesets that contain a certain object
>>    2. list all objects in a certain fileset
>> 
>> 
>> Is there a doubly indexed tree for this?
>> 
>> Secondly, to make the changelogs useful and scalable for filesets we
>> will need to be able to list all changelog entries associated with a
>> certain inode efficiently.  I see two ways to do this ? one is an
>> auxiliary directory file mapping inodes to many changelog entries, the
>> second is to embed forward and backward pointers in the changelog
>> entries to build a linked list rooted at the inode (using an EA in the
>> inode pointing to the first and last element of the list).  Both have
>> some overheads.  What are your thoughts?
>> 
>> Peter
>> 
>> 
>>
------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Alex Zhuravlev

2008-Sep-22 07:05 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

IMHO, it''d be useful to insert aggregations into that index.
just to keep the index small. say, subtree?

thanks, Alex

Peter Braam wrote:> Objects can be in many filesets, and be added to some, removed from others.
> 
> I think this is an almost arbitrary collection of pairs (FID, fileset-id)
> and we need it indexed by both.
> 
> Peter
> 
> 
> On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev at
Sun.COM> wrote:
> 
>> can object migrate between filesets? if not, we probably
>> can use fid''s sequence as a record in that index?
>>
>> thanks, Alex
>>
>> Peter Braam wrote:
>>> Hi Nikita, Nathan -
>>>
>>> After some pondering I have come to two conclusions.
>>>
>>> To encode filesets, we need a tree that makes two iterations fast:
>>>
>>>    1. list all filesets that contain a certain object
>>>    2. list all objects in a certain fileset
>>>
>>>
>>> Is there a doubly indexed tree for this?
>>>
>>> Secondly, to make the changelogs useful and scalable for filesets
we
>>> will need to be able to list all changelog entries associated with
a
>>> certain inode efficiently.  I see two ways to do this ? one is an
>>> auxiliary directory file mapping inodes to many changelog entries,
the
>>> second is to embed forward and backward pointers in the changelog
>>> entries to build a linked list rooted at the inode (using an EA in
the
>>> inode pointing to the first and last element of the list).  Both
have
>>> some overheads.  What are your thoughts?
>>>
>>> Peter
>>>
>>>
>>>
------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
>

Peter Braam

2008-Sep-22 07:13 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

Sure, when aggregations apply.  But they do not apply in general (e.g.
Filesets that are search results) and we need a doubly indexed tree for
that.

Hence my question, what doubly indexed trees exist?

Peter


On 9/22/08 3:05 PM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM>
wrote:
> IMHO, it''d be useful to insert aggregations into that index.
> just to keep the index small. say, subtree?
> 
> thanks, Alex
> 
> Peter Braam wrote:
>> Objects can be in many filesets, and be added to some, removed from
others.
>> 
>> I think this is an almost arbitrary collection of pairs (FID,
fileset-id)
>> and we need it indexed by both.
>> 
>> Peter
>> 
>> 
>> On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev at
Sun.COM> wrote:
>> 
>>> can object migrate between filesets? if not, we probably
>>> can use fid''s sequence as a record in that index?
>>> 
>>> thanks, Alex
>>> 
>>> Peter Braam wrote:
>>>> Hi Nikita, Nathan -
>>>> 
>>>> After some pondering I have come to two conclusions.
>>>> 
>>>> To encode filesets, we need a tree that makes two iterations
fast:
>>>> 
>>>>    1. list all filesets that contain a certain object
>>>>    2. list all objects in a certain fileset
>>>> 
>>>> 
>>>> Is there a doubly indexed tree for this?
>>>> 
>>>> Secondly, to make the changelogs useful and scalable for
filesets we
>>>> will need to be able to list all changelog entries associated
with a
>>>> certain inode efficiently.  I see two ways to do this ? one is
an
>>>> auxiliary directory file mapping inodes to many changelog
entries, the
>>>> second is to embed forward and backward pointers in the
changelog
>>>> entries to build a linked list rooted at the inode (using an EA
in the
>>>> inode pointing to the first and last element of the list). 
Both have
>>>> some overheads.  What are your thoughts?
>>>> 
>>>> Peter
>>>> 
>>>> 
>>>>
------------------------------------------------------------------------
>>>> 
>>>> _______________________________________________
>>>> Lustre-devel mailing list
>>>> Lustre-devel at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>> 
>> 
>

Alex Zhuravlev

2008-Sep-22 07:26 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

Peter Braam wrote:> Sure, when aggregations apply.  But they do not apply in general (e.g.
> Filesets that are search results) and we need a doubly indexed tree for
> that.
ah, clear enough
> Hence my question, what doubly indexed trees exist?
there is K-D tree, but I''m not sure it fits here.

if number of filesets is limited, then we probably could build a table of
all possible filesets ovelapping and put table''s index into inode? this
is
for reverse mapping to find all filesystems for given inode.

thanks, Alex

Nathaniel Rutman

2008-Sep-23 03:49 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

I actually added a "previous record" pointer in each changelog entry, 
but fill it in only where it is cheap -- when the metadata object is 
already in the cache I record the last changelog entry there. If it''s 
not in the cache, I don''t know where the last record associated with 
that fid is. We could store the last record number with the inode (EA?), 
but that would potentially be painful if we are recording e.g. file 
open/closes.
Forward pointers are also problematic, in that I don''t want to go back 
and modify the old record every time a new one is recorded (seems like 
this will make the disks very seek-y), and I think maybe we don''t need 
forward pointers anyhow (use case?). Anyhow, this effectively doubles 
the changelog write impact. Maybe that''s ok: Manoj''s
measurements put
the changelog overhead at only about 4% using mdsrate.

Peter Braam wrote:> Hi Nikita, Nathan -
>
> After some pondering I have come to two conclusions.
>
> To encode filesets, we need a tree that makes two iterations fast:
>
>    1. list all filesets that contain a certain object
>    2. list all objects in a certain fileset
>
>
> Is there a doubly indexed tree for this?
>
> Secondly, to make the changelogs useful and scalable for filesets we 
> will need to be able to list all changelog entries associated with a 
> certain inode efficiently. I see two ways to do this ? one is an 
> auxiliary directory file mapping inodes to many changelog entries, the 
> second is to embed forward and backward pointers in the changelog 
> entries to build a linked list rooted at the inode (using an EA in the 
> inode pointing to the first and last element of the list). Both have 
> some overheads. What are your thoughts?
>
> Peter
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Nikita Danilov

2008-Sep-23 07:38 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

Peter Braam writes:
 > Hi Nikita, Nathan -
 > 
 > After some pondering I have come to two conclusions.
 > 
 > To encode filesets, we need a tree that makes two iterations fast:
 > 
 > 1. list all filesets that contain a certain object
 > 2. list all objects in a certain fileset
 > 
 > Is there a doubly indexed tree for this?

I don''t know of a data-structure that provides ready solution for this.
As Alex pointed out k-d-trees do not fit there (they effectively use a
key composed of the alternating sequences of bits of the original keys),
and `geographical trees'' that data bases use to store 2-d data use very
special notion of proximity.

But, thinking about a fileset as a generalized directory, isn''t this
the
same problem as `list all files in directory'' and `find all parent
directories of a file''? It probably makes sense to use the same
mechanism to deal with directories and filesets.

Our current solution is to use EA to keep track of all parent
directories, do you see any problems with applying it to filesets?

 > 
 > Peter

Nikita.

Peter Braam

2008-Sep-23 09:20 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman at
Sun.COM> wrote:
> I actually added a "previous record" pointer in each changelog
entry,
> but fill it in only where it is cheap -- when the metadata object is
> already in the cache I record the last changelog entry there. If
it''s
> not in the cache, I don''t know where the last record associated
with
> that fid is. We could store the last record number with the inode (EA?),
> but that would potentially be painful if we are recording e.g. file
> open/closes.
Previous records are free - you get the previous one from the EA in the
inode, and replace the inode with the record info of the record you are
adding.  But for rename operations and others there are multiple pointers
like this needed.


> Forward pointers are also problematic, in that I don''t want to go
back
> and modify the old record every time a new one is recorded (seems like
> this will make the disks very seek-y), and I think maybe we don''t
need
> forward pointers anyhow (use case?). Anyhow, this effectively doubles
> the changelog write impact. Maybe that''s ok: Manoj''s
measurements put
> the changelog overhead at only about 4% using mdsrate.
Wow - that is amazingly low.

It is better to think about it before hacking it in I think.


Peter
> 
> Peter Braam wrote:
>> Hi Nikita, Nathan -
>> 
>> After some pondering I have come to two conclusions.
>> 
>> To encode filesets, we need a tree that makes two iterations fast:
>> 
>>    1. list all filesets that contain a certain object
>>    2. list all objects in a certain fileset
>> 
>> 
>> Is there a doubly indexed tree for this?
>> 
>> Secondly, to make the changelogs useful and scalable for filesets we
>> will need to be able to list all changelog entries associated with a
>> certain inode efficiently. I see two ways to do this ? one is an
>> auxiliary directory file mapping inodes to many changelog entries, the
>> second is to embed forward and backward pointers in the changelog
>> entries to build a linked list rooted at the inode (using an EA in the
>> inode pointing to the first and last element of the list). Both have
>> some overheads. What are your thoughts?
>> 
>> Peter
>>
------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>   
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Nathaniel Rutman

2008-Sep-23 21:46 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

Peter Braam wrote:>
> On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman at
Sun.COM> wrote:
>
>   
>> I actually added a "previous record" pointer in each
changelog entry,
>> but fill it in only where it is cheap -- when the metadata object is
>> already in the cache I record the last changelog entry there. If
it''s
>> not in the cache, I don''t know where the last record
associated with
>> that fid is. We could store the last record number with the inode
(EA?),
>> but that would potentially be painful if we are recording e.g. file
>> open/closes.
>>     
>
> Previous records are free - you get the previous one from the EA in the
> inode, and replace the inode with the record info of the record you are
> adding.  But for rename operations and others there are multiple pointers
> like this needed.
>   Currently I''m not reading or writing any EA for the changelog. Yes, if 
you want to tie in the fwd/back ptrs to the inode itself we need to do 
this, but I thought we were specifically discussing alternatives to 
doing that here (e.g. "auxiliary directory file mapping inodes to many 
changelog entries".) If we are e.g. recording every open/close for a 
file, do we really want to read/write the EA on the MDT every time, in 
addition to the changelog llog entry?
> Secondly, to make the changelogs useful and scalable for filesets we 
> will need to be able to list all changelog entries associated with a 
> certain inode efficiently. I see two ways to do this ? one is an 
> auxiliary directory file mapping inodes to many changelog entries, the 
> second is to embed forward and backward pointers in the changelog 
> entries to build a linked list rooted at the inode (using an EA in the 
> inode pointing to the first and last element of the list). Both have 
> some overheads. What are your thoughts?
>
>

Peter Braam

2008-Sep-23 22:48 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

On 9/24/08 5:46 AM, "Nathaniel Rutman" <Nathan.Rutman at
Sun.COM> wrote:
> Peter Braam wrote:
>> 
>> On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman at
Sun.COM> wrote:
>> 
>>   
>>> I actually added a "previous record" pointer in each
changelog entry,
>>> but fill it in only where it is cheap -- when the metadata object
is
>>> already in the cache I record the last changelog entry there. If
it''s
>>> not in the cache, I don''t know where the last record
associated with
>>> that fid is. We could store the last record number with the inode
(EA?),
>>> but that would potentially be painful if we are recording e.g. file
>>> open/closes.
>>>     
>> 
>> Previous records are free - you get the previous one from the EA in the
>> inode, and replace the inode with the record info of the record you are
>> adding.  But for rename operations and others there are multiple
pointers
>> like this needed.
>>   
> Currently I''m not reading or writing any EA for the changelog.
Yes, if
> you want to tie in the fwd/back ptrs to the inode itself we need to do
> this, but I thought we were specifically discussing alternatives to
> doing that here (e.g. "auxiliary directory file mapping inodes to many
> changelog entries".)
Good point.
> If we are e.g. recording every open/close for a
> file, do we really want to read/write the EA on the MDT every time, in
> addition to the changelog llog entry?
You are writing that inode anyway, so it doesn''t cost more I/O if the
EA is
embedded.

Peter
> 
>> Secondly, to make the changelogs useful and scalable for filesets we
>> will need to be able to list all changelog entries associated with a
>> certain inode efficiently. I see two ways to do this ? one is an
>> auxiliary directory file mapping inodes to many changelog entries, the
>> second is to embed forward and backward pointers in the changelog
>> entries to build a linked list rooted at the inode (using an EA in the
>> inode pointing to the first and last element of the list). Both have
>> some overheads. What are your thoughts?
>> 
>> 
> 
> 
>

Peter Braam

2008-Sep-24 02:50 UTC

head link

[Lustre-devel] Doubly indexed tree / changelogs

The idea is ok, but scale may turn out to be different than expected.

There could be a huge amount of filesets containing the word "Bin
Laden",
say 100,000 of them.

Would two directories work?


Peter


On 9/23/08 3:38 PM, "Nikita Danilov" <Nikita.Danilov at Sun.COM>
wrote:
> Peter Braam writes:
>> Hi Nikita, Nathan -
>> 
>> After some pondering I have come to two conclusions.
>> 
>> To encode filesets, we need a tree that makes two iterations fast:
>> 
>> 1. list all filesets that contain a certain object
>> 2. list all objects in a certain fileset
>> 
>> Is there a doubly indexed tree for this?
> 
> I don''t know of a data-structure that provides ready solution for
this.
> As Alex pointed out k-d-trees do not fit there (they effectively use a
> key composed of the alternating sequences of bits of the original keys),
> and `geographical trees'' that data bases use to store 2-d data use
very
> special notion of proximity.
> 
> But, thinking about a fileset as a generalized directory, isn''t
this the
> same problem as `list all files in directory'' and `find all parent
> directories of a file''? It probably makes sense to use the same
> mechanism to deal with directories and filesets.
> 
> Our current solution is to use EA to keep track of all parent
> directories, do you see any problems with applying it to filesets?
> 
>> 
>> Peter
> 
> Nikita.

Lustre devel - Sep 2008 - Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs

[Lustre-devel] Doubly indexed tree / changelogs