Hi Nathan - I talked through the design with Nikita. After he had understood our constraints and I had understood his issues it all narrowed down to one important improvement that Nikita suggests: we must get a fast way to compute the pathname of a FID. The scanning and searching I suggested without an index is not tenable. We had a couple of suggestions, such as storing parent fid and a name in the EA, or storing similar information in a large directory file. Can you connect with Nikita and do this? Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080505/23025d10/attachment.html
On 5/6/08 11:43 AM, "Nathaniel Rutman" <Nathan.Rutman at Sun.COM> wrote:> Peter Braam wrote: >> Hi Nathan - >> >> I talked through the design with Nikita. After he had understood our >> constraints and I had understood his issues it all narrowed down to >> one important improvement that Nikita suggests: we must get a fast >> way to compute the pathname of a FID. The scanning and searching I >> suggested without an index is not tenable. >> >> We had a couple of suggestions, such as storing parent fid and a name >> in the EA, or storing similar information in a large directory file. >> >> Can you connect with Nikita and do this? > > We talked yesterday afternoon. > Nikita has three concerns: > > 1. Global lock on namespace during pathname reconstruction. > I think we can eliminate this the following way: > a. lookup full path from fid, parent fid (remember the list of fids for > the entire path also) > b. lookup last transno > c. verify traversing down the full path name results in the same branch > and leaf fids all the way back down > i. if they don''t match, repeat from a > ii. if they do match, we can backtrack starting from the transno in b > to regenerate the original name > > 2. Directory name lookup given the parent fid - this may be inefficient > if we have to read the parent directory in order to get the name (parent > object is not likely to be cached at lookup time). > > 3. Someone deletes one of the parents of a hardlinked file. If we only > store one parent, there''s no way to regenerate a pathname if that parent > is the one that gets removed. > > For 2 and 3, we could store the directory name for each directory in an > EA, and all the fids for all the parents in some other manner. > But it seems to make more sense at this point to put all this > information (fid, name, parent list) in a database file stored on the > MDT. Then we just look through this database to generate our full path > information; no need to lookup info in the file objects or EAs. > Generating this database should be no more time consuming than writing > the changelogs themselves, assuming a reasonable database structure like > IAM. >Yes I agree with all of this. Peter
Peter Braam writes: > On 5/6/08 11:43 AM, "Nathaniel Rutman" <Nathan.Rutman at Sun.COM> wrote: [...] > > > > For 2 and 3, we could store the directory name for each directory in an > > EA, and all the fids for all the parents in some other manner. > > But it seems to make more sense at this point to put all this > > information (fid, name, parent list) in a database file stored on the > > MDT. Then we just look through this database to generate our full path One advantage EA has over global data-base is that the former is more resilient against file system corruption. This becomes more important if we ever plan to use (parent-fid, name) information for things like fsck. > > information; no need to lookup info in the file objects or EAs. > > Generating this database should be no more time consuming than writing > > the changelogs themselves, assuming a reasonable database structure like > > IAM. On a lower level note, I think that changelogs and parent-database are better to be implemented as a new layer separate from mdd: - mdd code is already complicated enough, - separate layer can be inserted into stack optionally, avoiding run-time cost if change-logs are not needed (currently there is no way to insert a layer after initial configuration completes though). > > > > Yes I agree with all of this. > > Peter > Nikita.
On 5/8/08 8:48 AM, "Nikita Danilov" <Nikita.Danilov at Sun.COM> wrote:> Peter Braam writes: >> On 5/6/08 11:43 AM, "Nathaniel Rutman" <Nathan.Rutman at Sun.COM> wrote: > > [...] > >>> >>> For 2 and 3, we could store the directory name for each directory in an >>> EA, and all the fids for all the parents in some other manner. >>> But it seems to make more sense at this point to put all this >>> information (fid, name, parent list) in a database file stored on the >>> MDT. Then we just look through this database to generate our full path > > One advantage EA has over global data-base is that the former is more > resilient against file system corruption. This becomes more important if > we ever plan to use (parent-fid, name) information for things like fsck. > >>> information; no need to lookup info in the file objects or EAs. >>> Generating this database should be no more time consuming than writing >>> the changelogs themselves, assuming a reasonable database structure like >>> IAM. > > On a lower level note, I think that changelogs and parent-database are > better to be implemented as a new layer separate from mdd: > > - mdd code is already complicated enough, > > - separate layer can be inserted into stack optionally, avoiding > run-time cost if change-logs are not needed (currently there is no > way to insert a layer after initial configuration completes though).Yes, find a good place. Just remember that things like pNFS integrated with the Lustre servers also need to replicate. In fact having this log purely at the DMU / ZFS level would be a valuable feature - there are no good replication solutions even for laptops today! Peter> >>> >> >> Yes I agree with all of this. >> >> Peter >> > > Nikita.