If these dynamic layout changes are being considered, doing so for file data
first or in parallel might make sense.
I think that fixed directory striping patterns may in fact be fine for a
long tim to come. Spread directory entries over a particular pool or over
any nodes using a certain width may be all we need.
Peter
On 9/16/08 5:05 AM, "Eric Barton" <eeb at sun.com> wrote:
> Guys,
>
> I''m cc-ing lustre-devel - it''s of general interest.
>
> I definitely think the first CMD product releases should stick to
> static directory layouts with all directories contained within a
> single MDT by default - i.e. you have to do something special to
> create a striped dir. Meanwhile we should start ASAP on completing
> the design of automatic dir splitting to handle _all_ the recovery
> cases.
>
> IMHO, a one-time-only directory split seems a bit too all-or-nothing.
> What is the reasoning behind the assumption that further splitting is
> not required? Also, directory size doesn''t necessarily seem like
the
> only or even the best clue about when to split and how. So here are a
> couple of suggestions.
>
> 1. Consider the split from 1 MDT to several just as a special case of
> migration - i.e. allow arbitrary n->m re-layout over MDTs. We
> would like to support metadata migration in any case for space
> management. Andreas'' ideas about migration by mirroring seem
> equally applicable to directories (furthermore mirrored directories
> seem like a valuable component of a namespace availability and
> resilience feature).
>
> 2. Keep the discussion on policy (when to split) separate from
> mechanism (how to split).
>
> Cheers,
> Eric
>
>>
>> -----Original Message-----
>> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On
Behalf Of
>> Andreas Dilger
>> Sent: 16 September 2008 9:23 AM
>> To: Nikita Danilov
>> Cc: Yury Umanets; Eric Barton; Alex Zhuravlev
>> Subject: Re: CMD directory split
>>
>> On Sep 15, 2008 14:20 +0400, Nikita Danilov wrote:
>>> Andreas Dilger writes:
>>>> I think we can have a very simple directory split if we
>>>> consider the directory split like file IO instead of inserting
>>>> thousands of dirents.
>>>
>>> In some sense this is already done. Split (cmm/cmm_split.c) uses
>>> the same interface as readdir to construct a pagesful of directory
>>> entries and to send them to a slave mdt, where they are
>>> inserted. It is not raw `write'', because actual directory
page
>>> format is encapsulated within osd
>>
>> Yes, I was thinking of something like this, I didn''t know that
is
>> how it is actually handled. The other issue is that the directory
>> creation and insertions should be done in a single transaction in
>> order to simplify recovery. In that case we don''t have to
worry
>> about the case where the directory is partially created and
>> populated.
>>
>>> It''s my impression that other source of trouble with split
is a
>>> complicated locking scheme that it requires to keep clients happy.
>>
>> How is the locking of the directory any different than the locking
>> on the LOV EA needed to restripe a file for migration? The locks on
>> the inodes themselves do not change, because the inodes are not
>> moving. The pages on the directory itself are revoked on a regular
>> basis whenever there is a new insertion in any case (i.e. when the
>> directory is split).
>>
>> It would seem that the only thing which needs to be changed is the
>> LMV EA on the clients.
>>
>>> It would be _much_ easier if directories were split at the time of
>>> creation like files are. That would also eliminate almost all
>>> recovery issues and page-shuffling mechanics.
>>
>> This might be possible for a simple initial implementation, but it
>> isn''t a good long-term solution. Consider the problems we
face even
>> today with widely-striped files - stat slowdowns to track the
>> size/mtime/ctime, readdir will always have to do RPCs to each MDT to
>> get the entries even if there are only a few entries, unlinks will
>> need multiple RPCs, etc.
>>
>> In contrast, we can tune the split threshold so that the majority of
>> small directories remain on a single MDT, and only large directories
>> are split (with some small overhead for the split).
>>
>> In the future we have to consider configurations with hundreds or
>> even thousands of MDTs, perhaps one on each OST, in order to scale
>> metadata and small file performance dramatically.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel