Hi Nikita, Nathan - After some pondering I have come to two conclusions. To encode filesets, we need a tree that makes two iterations fast: 1. list all filesets that contain a certain object 2. list all objects in a certain fileset Is there a doubly indexed tree for this? Secondly, to make the changelogs useful and scalable for filesets we will need to be able to list all changelog entries associated with a certain inode efficiently. I see two ways to do this ? one is an auxiliary directory file mapping inodes to many changelog entries, the second is to embed forward and backward pointers in the changelog entries to build a linked list rooted at the inode (using an EA in the inode pointing to the first and last element of the list). Both have some overheads. What are your thoughts? Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080922/8c54f19d/attachment.html
can object migrate between filesets? if not, we probably can use fid''s sequence as a record in that index? thanks, Alex Peter Braam wrote:> Hi Nikita, Nathan - > > After some pondering I have come to two conclusions. > > To encode filesets, we need a tree that makes two iterations fast: > > 1. list all filesets that contain a certain object > 2. list all objects in a certain fileset > > > Is there a doubly indexed tree for this? > > Secondly, to make the changelogs useful and scalable for filesets we > will need to be able to list all changelog entries associated with a > certain inode efficiently. I see two ways to do this ? one is an > auxiliary directory file mapping inodes to many changelog entries, the > second is to embed forward and backward pointers in the changelog > entries to build a linked list rooted at the inode (using an EA in the > inode pointing to the first and last element of the list). Both have > some overheads. What are your thoughts? > > Peter > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Objects can be in many filesets, and be added to some, removed from others. I think this is an almost arbitrary collection of pairs (FID, fileset-id) and we need it indexed by both. Peter On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM> wrote:> can object migrate between filesets? if not, we probably > can use fid''s sequence as a record in that index? > > thanks, Alex > > Peter Braam wrote: >> Hi Nikita, Nathan - >> >> After some pondering I have come to two conclusions. >> >> To encode filesets, we need a tree that makes two iterations fast: >> >> 1. list all filesets that contain a certain object >> 2. list all objects in a certain fileset >> >> >> Is there a doubly indexed tree for this? >> >> Secondly, to make the changelogs useful and scalable for filesets we >> will need to be able to list all changelog entries associated with a >> certain inode efficiently. I see two ways to do this ? one is an >> auxiliary directory file mapping inodes to many changelog entries, the >> second is to embed forward and backward pointers in the changelog >> entries to build a linked list rooted at the inode (using an EA in the >> inode pointing to the first and last element of the list). Both have >> some overheads. What are your thoughts? >> >> Peter >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >
IMHO, it''d be useful to insert aggregations into that index. just to keep the index small. say, subtree? thanks, Alex Peter Braam wrote:> Objects can be in many filesets, and be added to some, removed from others. > > I think this is an almost arbitrary collection of pairs (FID, fileset-id) > and we need it indexed by both. > > Peter > > > On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM> wrote: > >> can object migrate between filesets? if not, we probably >> can use fid''s sequence as a record in that index? >> >> thanks, Alex >> >> Peter Braam wrote: >>> Hi Nikita, Nathan - >>> >>> After some pondering I have come to two conclusions. >>> >>> To encode filesets, we need a tree that makes two iterations fast: >>> >>> 1. list all filesets that contain a certain object >>> 2. list all objects in a certain fileset >>> >>> >>> Is there a doubly indexed tree for this? >>> >>> Secondly, to make the changelogs useful and scalable for filesets we >>> will need to be able to list all changelog entries associated with a >>> certain inode efficiently. I see two ways to do this ? one is an >>> auxiliary directory file mapping inodes to many changelog entries, the >>> second is to embed forward and backward pointers in the changelog >>> entries to build a linked list rooted at the inode (using an EA in the >>> inode pointing to the first and last element of the list). Both have >>> some overheads. What are your thoughts? >>> >>> Peter >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel > >
Sure, when aggregations apply. But they do not apply in general (e.g. Filesets that are search results) and we need a doubly indexed tree for that. Hence my question, what doubly indexed trees exist? Peter On 9/22/08 3:05 PM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM> wrote:> IMHO, it''d be useful to insert aggregations into that index. > just to keep the index small. say, subtree? > > thanks, Alex > > Peter Braam wrote: >> Objects can be in many filesets, and be added to some, removed from others. >> >> I think this is an almost arbitrary collection of pairs (FID, fileset-id) >> and we need it indexed by both. >> >> Peter >> >> >> On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM> wrote: >> >>> can object migrate between filesets? if not, we probably >>> can use fid''s sequence as a record in that index? >>> >>> thanks, Alex >>> >>> Peter Braam wrote: >>>> Hi Nikita, Nathan - >>>> >>>> After some pondering I have come to two conclusions. >>>> >>>> To encode filesets, we need a tree that makes two iterations fast: >>>> >>>> 1. list all filesets that contain a certain object >>>> 2. list all objects in a certain fileset >>>> >>>> >>>> Is there a doubly indexed tree for this? >>>> >>>> Secondly, to make the changelogs useful and scalable for filesets we >>>> will need to be able to list all changelog entries associated with a >>>> certain inode efficiently. I see two ways to do this ? one is an >>>> auxiliary directory file mapping inodes to many changelog entries, the >>>> second is to embed forward and backward pointers in the changelog >>>> entries to build a linked list rooted at the inode (using an EA in the >>>> inode pointing to the first and last element of the list). Both have >>>> some overheads. What are your thoughts? >>>> >>>> Peter >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Lustre-devel mailing list >>>> Lustre-devel at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel >> >> >
Peter Braam wrote:> Sure, when aggregations apply. But they do not apply in general (e.g. > Filesets that are search results) and we need a doubly indexed tree for > that.ah, clear enough> Hence my question, what doubly indexed trees exist?there is K-D tree, but I''m not sure it fits here. if number of filesets is limited, then we probably could build a table of all possible filesets ovelapping and put table''s index into inode? this is for reverse mapping to find all filesystems for given inode. thanks, Alex
I actually added a "previous record" pointer in each changelog entry, but fill it in only where it is cheap -- when the metadata object is already in the cache I record the last changelog entry there. If it''s not in the cache, I don''t know where the last record associated with that fid is. We could store the last record number with the inode (EA?), but that would potentially be painful if we are recording e.g. file open/closes. Forward pointers are also problematic, in that I don''t want to go back and modify the old record every time a new one is recorded (seems like this will make the disks very seek-y), and I think maybe we don''t need forward pointers anyhow (use case?). Anyhow, this effectively doubles the changelog write impact. Maybe that''s ok: Manoj''s measurements put the changelog overhead at only about 4% using mdsrate. Peter Braam wrote:> Hi Nikita, Nathan - > > After some pondering I have come to two conclusions. > > To encode filesets, we need a tree that makes two iterations fast: > > 1. list all filesets that contain a certain object > 2. list all objects in a certain fileset > > > Is there a doubly indexed tree for this? > > Secondly, to make the changelogs useful and scalable for filesets we > will need to be able to list all changelog entries associated with a > certain inode efficiently. I see two ways to do this ? one is an > auxiliary directory file mapping inodes to many changelog entries, the > second is to embed forward and backward pointers in the changelog > entries to build a linked list rooted at the inode (using an EA in the > inode pointing to the first and last element of the list). Both have > some overheads. What are your thoughts? > > Peter > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
Peter Braam writes: > Hi Nikita, Nathan - > > After some pondering I have come to two conclusions. > > To encode filesets, we need a tree that makes two iterations fast: > > 1. list all filesets that contain a certain object > 2. list all objects in a certain fileset > > Is there a doubly indexed tree for this? I don''t know of a data-structure that provides ready solution for this. As Alex pointed out k-d-trees do not fit there (they effectively use a key composed of the alternating sequences of bits of the original keys), and `geographical trees'' that data bases use to store 2-d data use very special notion of proximity. But, thinking about a fileset as a generalized directory, isn''t this the same problem as `list all files in directory'' and `find all parent directories of a file''? It probably makes sense to use the same mechanism to deal with directories and filesets. Our current solution is to use EA to keep track of all parent directories, do you see any problems with applying it to filesets? > > Peter Nikita.
On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman at Sun.COM> wrote:> I actually added a "previous record" pointer in each changelog entry, > but fill it in only where it is cheap -- when the metadata object is > already in the cache I record the last changelog entry there. If it''s > not in the cache, I don''t know where the last record associated with > that fid is. We could store the last record number with the inode (EA?), > but that would potentially be painful if we are recording e.g. file > open/closes.Previous records are free - you get the previous one from the EA in the inode, and replace the inode with the record info of the record you are adding. But for rename operations and others there are multiple pointers like this needed.> Forward pointers are also problematic, in that I don''t want to go back > and modify the old record every time a new one is recorded (seems like > this will make the disks very seek-y), and I think maybe we don''t need > forward pointers anyhow (use case?). Anyhow, this effectively doubles > the changelog write impact. Maybe that''s ok: Manoj''s measurements put > the changelog overhead at only about 4% using mdsrate.Wow - that is amazingly low. It is better to think about it before hacking it in I think. Peter> > Peter Braam wrote: >> Hi Nikita, Nathan - >> >> After some pondering I have come to two conclusions. >> >> To encode filesets, we need a tree that makes two iterations fast: >> >> 1. list all filesets that contain a certain object >> 2. list all objects in a certain fileset >> >> >> Is there a doubly indexed tree for this? >> >> Secondly, to make the changelogs useful and scalable for filesets we >> will need to be able to list all changelog entries associated with a >> certain inode efficiently. I see two ways to do this ? one is an >> auxiliary directory file mapping inodes to many changelog entries, the >> second is to embed forward and backward pointers in the changelog >> entries to build a linked list rooted at the inode (using an EA in the >> inode pointing to the first and last element of the list). Both have >> some overheads. What are your thoughts? >> >> Peter >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >> > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Peter Braam wrote:> > On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman at Sun.COM> wrote: > > >> I actually added a "previous record" pointer in each changelog entry, >> but fill it in only where it is cheap -- when the metadata object is >> already in the cache I record the last changelog entry there. If it''s >> not in the cache, I don''t know where the last record associated with >> that fid is. We could store the last record number with the inode (EA?), >> but that would potentially be painful if we are recording e.g. file >> open/closes. >> > > Previous records are free - you get the previous one from the EA in the > inode, and replace the inode with the record info of the record you are > adding. But for rename operations and others there are multiple pointers > like this needed. >Currently I''m not reading or writing any EA for the changelog. Yes, if you want to tie in the fwd/back ptrs to the inode itself we need to do this, but I thought we were specifically discussing alternatives to doing that here (e.g. "auxiliary directory file mapping inodes to many changelog entries".) If we are e.g. recording every open/close for a file, do we really want to read/write the EA on the MDT every time, in addition to the changelog llog entry?> Secondly, to make the changelogs useful and scalable for filesets we > will need to be able to list all changelog entries associated with a > certain inode efficiently. I see two ways to do this ? one is an > auxiliary directory file mapping inodes to many changelog entries, the > second is to embed forward and backward pointers in the changelog > entries to build a linked list rooted at the inode (using an EA in the > inode pointing to the first and last element of the list). Both have > some overheads. What are your thoughts? > >
On 9/24/08 5:46 AM, "Nathaniel Rutman" <Nathan.Rutman at Sun.COM> wrote:> Peter Braam wrote: >> >> On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman at Sun.COM> wrote: >> >> >>> I actually added a "previous record" pointer in each changelog entry, >>> but fill it in only where it is cheap -- when the metadata object is >>> already in the cache I record the last changelog entry there. If it''s >>> not in the cache, I don''t know where the last record associated with >>> that fid is. We could store the last record number with the inode (EA?), >>> but that would potentially be painful if we are recording e.g. file >>> open/closes. >>> >> >> Previous records are free - you get the previous one from the EA in the >> inode, and replace the inode with the record info of the record you are >> adding. But for rename operations and others there are multiple pointers >> like this needed. >> > Currently I''m not reading or writing any EA for the changelog. Yes, if > you want to tie in the fwd/back ptrs to the inode itself we need to do > this, but I thought we were specifically discussing alternatives to > doing that here (e.g. "auxiliary directory file mapping inodes to many > changelog entries".)Good point.> If we are e.g. recording every open/close for a > file, do we really want to read/write the EA on the MDT every time, in > addition to the changelog llog entry?You are writing that inode anyway, so it doesn''t cost more I/O if the EA is embedded. Peter> >> Secondly, to make the changelogs useful and scalable for filesets we >> will need to be able to list all changelog entries associated with a >> certain inode efficiently. I see two ways to do this ? one is an >> auxiliary directory file mapping inodes to many changelog entries, the >> second is to embed forward and backward pointers in the changelog >> entries to build a linked list rooted at the inode (using an EA in the >> inode pointing to the first and last element of the list). Both have >> some overheads. What are your thoughts? >> >> > > >
The idea is ok, but scale may turn out to be different than expected. There could be a huge amount of filesets containing the word "Bin Laden", say 100,000 of them. Would two directories work? Peter On 9/23/08 3:38 PM, "Nikita Danilov" <Nikita.Danilov at Sun.COM> wrote:> Peter Braam writes: >> Hi Nikita, Nathan - >> >> After some pondering I have come to two conclusions. >> >> To encode filesets, we need a tree that makes two iterations fast: >> >> 1. list all filesets that contain a certain object >> 2. list all objects in a certain fileset >> >> Is there a doubly indexed tree for this? > > I don''t know of a data-structure that provides ready solution for this. > As Alex pointed out k-d-trees do not fit there (they effectively use a > key composed of the alternating sequences of bits of the original keys), > and `geographical trees'' that data bases use to store 2-d data use very > special notion of proximity. > > But, thinking about a fileset as a generalized directory, isn''t this the > same problem as `list all files in directory'' and `find all parent > directories of a file''? It probably makes sense to use the same > mechanism to deal with directories and filesets. > > Our current solution is to use EA to keep track of all parent > directories, do you see any problems with applying it to filesets? > >> >> Peter > > Nikita.