thr3ads.net - Lustre devel - [Lustre-devel] readdir for striped dir [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Tom.Wang

2010-Mar-22 23:09 UTC

[Lustre-devel] readdir for striped dir

In CMD, one directory can be striped to several MDTs according to the 
hash value of each entry (calculated by its name). Suppose we have N 
MDTs and hash range is 0 to MAX_HASH. First server will keep records 
with hashes [ 0 ... MAX_HASH / N  - 1], second one with hashes [MAX_HASH 
/ N ... 2 * MAX_HASH / N] and so on.  Currently, it uses the same hash 
policy with the one used on disk(ldiskfs/ext3 hash), so when reading 
striped directory, the entries from different stripe objects can be 
mapped on client side cache simply, actually this page cache is only 
maintained in llite layer. But this bonding of CMD split-dir protocal 
with on-disk hash seems not good, and it even brings more problems when 
porting MDT to kdmu.  

This dir-entry page cache should be moved to mdc layer, and each stripe 
object will have its own page cache. It will need 2 lookups for locating 
an entry in the page cache, first locating the stripe 
objects(ll_stripe_offset will be added in ll_file_data to indicate the 
stripe offset), then got the page by offset(f_pos) inside the 
stripe_object. The entry page cache can even be organized as the favor 
of different purposes, for example readdir-plus, dir-extent lock. Idealy 
we can reuse the cl_page on mdc layer, but that might need object 
layering on metadata stack. In the first step probably register some 
page callback for mdc to manage the page cache.

Andreas Dilger

2010-Mar-23 04:22 UTC

head link

[Lustre-devel] readdir for striped dir

On 2010-03-22, at 17:09, Tom.Wang wrote:> In CMD, one directory can be striped to several MDTs according to the
> hash value of each entry (calculated by its name). Suppose we have N
> MDTs and hash range is 0 to MAX_HASH. First server will keep records
> with hashes [ 0 ... MAX_HASH / N  - 1], second one with hashes  
> [MAX_HASH
> / N ... 2 * MAX_HASH / N] and so on.  Currently, it uses the same hash
> policy with the one used on disk(ldiskfs/ext3 hash), so when reading
> striped directory, the entries from different stripe objects can be
> mapped on client side cache simply, actually this page cache is only
> maintained in llite layer. But this bonding of CMD split-dir protocal
> with on-disk hash seems not good, and it even brings more problems  
> when
> porting MDT to kdmu.
>
> This dir-entry page cache should be moved to mdc layer, and each  
> stripe
> object will have its own page cache. It will need 2 lookups for  
> locating
> an entry in the page cache, first locating the stripe
> objects(ll_stripe_offset will be added in ll_file_data to indicate the
> stripe offset), then got the page by offset(f_pos) inside the
> stripe_object. The entry page cache can even be organized as the favor
> of different purposes, for example readdir-plus, dir-extent lock.  
> Idealy
> we can reuse the cl_page on mdc layer, but that might need object
> layering on metadata stack. In the first step probably register some
> page callback for mdc to manage the page cache.

Tom, could you please explain the proposed mechanism for hashing in  
this scheme?    Will there be one hash function at the LMV level to  
select the dirstripe, and a second one at the MDC level?  Does this  
imply that the client still needs to know the hashing scheme used by  
the backing storage?  At least this allows a different hash per  
dirstripe, which is important for DMU because the hash seed is  
different for each directory.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Tom.Wang

2010-Mar-23 11:29 UTC

head link

[Lustre-devel] readdir for striped dir

Hello,
Andreas Dilger wrote:> On 2010-03-22, at 17:09, Tom.Wang wrote:
>   
>> In CMD, one directory can be striped to several MDTs according to the
>> hash value of each entry (calculated by its name). Suppose we have N
>> MDTs and hash range is 0 to MAX_HASH. First server will keep records
>> with hashes [ 0 ... MAX_HASH / N  - 1], second one with hashes  
>> [MAX_HASH
>> / N ... 2 * MAX_HASH / N] and so on.  Currently, it uses the same hash
>> policy with the one used on disk(ldiskfs/ext3 hash), so when reading
>> striped directory, the entries from different stripe objects can be
>> mapped on client side cache simply, actually this page cache is only
>> maintained in llite layer. But this bonding of CMD split-dir protocal
>> with on-disk hash seems not good, and it even brings more problems  
>> when
>> porting MDT to kdmu.
>>
>> This dir-entry page cache should be moved to mdc layer, and each  
>> stripe
>> object will have its own page cache. It will need 2 lookups for  
>> locating
>> an entry in the page cache, first locating the stripe
>> objects(ll_stripe_offset will be added in ll_file_data to indicate the
>> stripe offset), then got the page by offset(f_pos) inside the
>> stripe_object. The entry page cache can even be organized as the favor
>> of different purposes, for example readdir-plus, dir-extent lock.  
>> Idealy
>> we can reuse the cl_page on mdc layer, but that might need object
>> layering on metadata stack. In the first step probably register some
>> page callback for mdc to manage the page cache.
>>     
>
>
> Tom, could you please explain the proposed mechanism for hashing in  
> this scheme?    Will there be one hash function at the LMV level to  
> select the dirstripe, and a second one at the MDC level?Does this  
> imply that the client still needs to know the hashing scheme used by  
> the backing storage?  At least this allows a different hash per  
> dirstripe, which is important for DMU because the hash seed is  
> different for each directory.
>   Client does not need to know the hash scheme of the backing storage.

LMV will use new hash function to select stripe object (mdc), which 
could be independent with the one
used in the storage.  In mdc level, it just need to map the entries of 
each dir stripe object in the cache,
we can index the cache in anyway as we want, probably hash order (as the 
server storage) is a good choice,
because client can easily find and cancel the pages by the hash in later 
dir-extent lock. Note: Even in this case,
client does not need to know server hash scheme at all, since server 
will set the hash-offset of these pages, client just
need to put these pages on the cache by hash-offset. 

Currently, the cache will only be touched by readdir.  If the cache will 
be used by readdir-plus later, i.e. we need locate
the entry by name, then client must use the same hash as the server 
storage, but server will tell client which hash function
it use.  Yes, different hash per dirstripe should not be a problem here.

Thanks
WangDi
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Tom.Wang

2010-Mar-23 12:35 UTC

head link

[Lustre-devel] readdir for striped dir

Nikita Danilov wrote:> On 23 March 2010 14:29, Tom.Wang <Tom.Wang at sun.com> wrote:
>   
>> Hello,
>>     
>
> Hello,
>
> [...]
>
>   
>> LMV will use new hash function to select stripe object (mdc), which
could be
>> independent with the one
>> used in the storage.  In mdc level, it just need to map the entries of
each
>> dir stripe object in the cache,
>> we can index the cache in anyway as we want, probably hash order (as
the
>> server storage) is a good choice,
>> because client can easily find and cancel the pages by the hash in
later
>> dir-extent lock. Note: Even in this case,
>> client does not need to know server hash scheme at all, since server
will
>> set the hash-offset of these pages, client just
>> need to put these pages on the cache by hash-offset.
>> Currently, the cache will only be touched by readdir.  If the cache
will be
>> used by readdir-plus later, i.e. we need locate
>> the entry by name, then client must use the same hash as the server
storage,
>> but server will tell client which hash function
>> it use.  Yes, different hash per dirstripe should not be a problem
here.
>>     
>
> If I understand correctly, the scheme has an advantage of cleaner
> solution for readdir scalability than "hash adjust" hack of CMD3
(see
> comment in lmv/lmv_obd.c:lmv_readpage()). The problem, to remind, is
> that if a number of clients readdir the same split directory, they all
> hammer the same servers one after another, negating the advantages of
> meta-data clustering. The solution is to cyclically shift hash space
> by an offset depending on the client, so that clients load servers
> uniformly. With 2-level hashing, this shifting can be done entirely
> within new LMV hash function.
>   Yes, in this case, these clients should start from different 
stripe-offset in
readdir.

Thanks
Wangdi>   
>> Thanks
>> WangDi
>>     
>
> Thank you,
> Nikita.
>

Nikita Danilov

2010-Mar-23 16:15 UTC

head link

[Lustre-devel] readdir for striped dir

On 23 March 2010 14:29, Tom.Wang <Tom.Wang at sun.com>
wrote:> Hello,
Hello,

[...]
>
> LMV will use new hash function to select stripe object (mdc), which could
be
> independent with the one
> used in the storage. ?In mdc level, it just need to map the entries of each
> dir stripe object in the cache,
> we can index the cache in anyway as we want, probably hash order (as the
> server storage) is a good choice,
> because client can easily find and cancel the pages by the hash in later
> dir-extent lock. Note: Even in this case,
> client does not need to know server hash scheme at all, since server will
> set the hash-offset of these pages, client just
> need to put these pages on the cache by hash-offset.
> Currently, the cache will only be touched by readdir. ?If the cache will be
> used by readdir-plus later, i.e. we need locate
> the entry by name, then client must use the same hash as the server
storage,
> but server will tell client which hash function
> it use. ?Yes, different hash per dirstripe should not be a problem here.
If I understand correctly, the scheme has an advantage of cleaner
solution for readdir scalability than "hash adjust" hack of CMD3 (see
comment in lmv/lmv_obd.c:lmv_readpage()). The problem, to remind, is
that if a number of clients readdir the same split directory, they all
hammer the same servers one after another, negating the advantages of
meta-data clustering. The solution is to cyclically shift hash space
by an offset depending on the client, so that clients load servers
uniformly. With 2-level hashing, this shifting can be done entirely
within new LMV hash function.
>
> Thanks
> WangDi
Thank you,
Nikita.

Andreas Dilger

2010-Mar-23 18:23 UTC

head link

[Lustre-devel] readdir for striped dir

Hi Nikita,

On 2010-03-23, at 10:15, Nikita Danilov wrote:> On 23 March 2010 14:29, Tom.Wang <Tom.Wang at sun.com> wrote:
>>
>> LMV will use new hash function to select stripe object (mdc), which  
>> could be
>> independent with the one used in the storage.  In mdc level, it  
>> just need to map the entries of each dir stripe object in the  
>> cache, we can index the cache in anyway as we want, probably hash  
>> order (as the server storage) is a good choice, because client can  
>> easily find and cancel the pages by the hash in later dir-extent  
>> lock. Note: Even in this case, client does not need to know server  
>> hash scheme at all, since server will set the hash-offset of these  
>> pages, client just need to put these pages on the cache by hash- 
>> offset.
>>
>> Currently, the cache will only be touched by readdir.  If the cache  
>> will be
>> used by readdir-plus later, i.e. we need locate the entry by name,  
>> then client must use the same hash as the server storage, but  
>> server will tell client which hash function it use.  Yes, different  
>> hash per dirstripe should not be a problem here.
>
> If I understand correctly, the scheme has an advantage of cleaner
> solution for readdir scalability than "hash adjust" hack of CMD3
(see
> comment in lmv/lmv_obd.c:lmv_readpage()).
Yes, I saw this for the first time just a few days ago and had to  
shield my eyes :-).
> The problem, to remind, is that if a number of clients readdir the  
> same split directory, they all hammer the same servers one after  
> another, negating the advantages of meta-data clustering. The  
> solution is to cyclically shift hash space by an offset depending on  
> the client, so that clients load servers uniformly. With 2-level  
> hashing, this shifting can be done entirely within new LMV hash  
> function.

Sure.  Even without the 2-level hashing I wondered why the readdir  
pages weren''t simply pre-fetched from a random MDT index (N %  
nstripes) by each client into its cache.

One question is whether it is important for applications to get  
entries back from readdir in a consistent order between invocations,  
which would imply that N should be persistent across calls (e.g.  
NID).  If it is important for the application to get the entries in  
the same order, it would mean higher latency on some clients to return  
the first page, but thereafter the clients would all be pre-fetching  
round-robin.  In aggregate it would speed up performance, however, by  
distributing the RPC traffic more evenly, and also by starting the  
disk IO on all of the MDTs concurrently instead of in series.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nikita Danilov

2010-Mar-23 19:14 UTC

head link

[Lustre-devel] readdir for striped dir

On 23 March 2010 21:23, Andreas Dilger <adilger at sun.com>
wrote:> Hi Nikita,
Hello Andreas,
>
> On 2010-03-23, at 10:15, Nikita Danilov wrote:
>>
[...]
>>
>> If I understand correctly, the scheme has an advantage of cleaner
>> solution for readdir scalability than "hash adjust" hack of
CMD3 (see
>> comment in lmv/lmv_obd.c:lmv_readpage()).
>
> Yes, I saw this for the first time just a few days ago and had to shield my
> eyes :-).
Hehe, wild days, happy memories.
>
>> The problem, to remind, is that if a number of clients readdir the same
>> split directory, they all hammer the same servers one after another,
>> negating the advantages of meta-data clustering. The solution is to
>> cyclically shift hash space by an offset depending on the client, so
that
>> clients load servers uniformly. With 2-level hashing, this shifting can
be
>> done entirely within new LMV hash function.
>
>
> Sure. ?Even without the 2-level hashing I wondered why the readdir pages
> weren''t simply pre-fetched from a random MDT index (N % nstripes)
by each
> client into its cache.
Do you mean that a client reads from servers in order N, N + 1, ..., N
- 1? Or that all clients read pages from servers in the same order 1,
2, ... nrstripes and in addition every client pre-fetches from servers
N, N + 1, ..., N - 1?

As to the first case, a directory entry hash value is used as a value
of ->d_off field in struct dirent. Historically this field means byte
offset in a directory file and hash adjustment tries to maintain its
monotonicity through readdir iterations.
>
> One question is whether it is important for applications to get entries
back
> from readdir in a consistent order between invocations, which would imply
> that N should be persistent across calls (e.g. NID). ?If it is important
for
> the application to get the entries in the same order, it would mean higher
> latency on some clients to return the first page, but thereafter the
clients
> would all be pre-fetching round-robin. ?In aggregate it would speed up
> performance, however, by distributing the RPC traffic more evenly, and also
> by starting the disk IO on all of the MDTs concurrently instead of in
> series.
POSIX doesn''t guarantee readdir repeatability, I am not sure about NT.
>
> Cheers, Andreas
Thank you,
Nikita.

Andreas Dilger

2010-Mar-24 22:23 UTC

head link

[Lustre-devel] readdir for striped dir

On 2010-03-23, at 13:14, Nikita Danilov wrote:> On 23 March 2010 21:23, Andreas Dilger <adilger at sun.com> wrote:
>>
>>> The problem, to remind, is that if a number of clients readdir the
>>> same
>>> split directory, they all hammer the same servers one after
another,
>>> negating the advantages of meta-data clustering. The solution is to
>>> cyclically shift hash space by an offset depending on the client,  
>>> so that
>>> clients load servers uniformly. With 2-level hashing, this  
>>> shifting can be
>>> done entirely within new LMV hash function.
>>
>> Sure.  Even without the 2-level hashing I wondered why the readdir  
>> pages
>> weren''t simply pre-fetched from a random MDT index (N %
nstripes)
>> by each
>> client into its cache.
>
> Do you mean that a client reads from servers in order N, N + 1, ..., N
> - 1? Or that all clients read pages from servers in the same order 1,
> 2, ... nrstripes and in addition every client pre-fetches from servers
> N, N + 1, ..., N - 1?
What I meant was that each client starts its read on a different  
stripe (e.g. dir_stripe_index = client_nid % num_stripes), and reads(- 
ahead, optionally) chunks in (approximately?) round-robin order from  
the starting index, but it still returns the readdir data back to  
userspace as if it was reading starting at dir_stripe_index = 0.

That implies that, depending on the client NID, the client may buffer  
up to (num_stripes - 1) reads (pages or MB, depending on how much is  
read per RPC) until it gets to the 0th stripe index and can start  
returning entries from that stripe to userspace.  The readahead of  
directory stripes should always be offset from where it is currently  
processing, so that the clients continue to distribute load across  
MDTs even when they are working from cache.
> As to the first case, a directory entry hash value is used as a value
> of ->d_off field in struct dirent. Historically this field means byte
> offset in a directory file and hash adjustment tries to maintain its
> monotonicity through readdir iterations.
Agreed, though in the case of DMU backing storage (or even ldiskfs  
with a different hash function) there will be entries in each page  
read from each MDT that may have overlapping values.  There will need  
to be some way to disambiguate hash X that was read from MDT 0 from  
hash X from MDT 1.

It may be that the prefetching described above will help this.  If the  
client is doing hash-ordered reads from each MDT, it could be merging  
the entries on the client to return to userspace in strict hash order,  
even though the client doesn''t know the underlying hash function  
used.  Presumably, with a well-behaved hash function, the entries in  
each stripe are uniformly distributed, so progress will be made  
through all stripes in a relatively uniform manner (i.e. reads will be  
going to all MDTs at about the same rate).
>> One question is whether it is important for applications to get  
>> entries back
>> from readdir in a consistent order between invocations, which would  
>> imply
>> that N should be persistent across calls (e.g. NID).  If it is  
>> important for
>> the application to get the entries in the same order, it would mean  
>> higher
>> latency on some clients to return the first page, but thereafter  
>> the clients
>> would all be pre-fetching round-robin.  In aggregate it would speed  
>> up
>> performance, however, by distributing the RPC traffic more evenly,  
>> and also
>> by starting the disk IO on all of the MDTs concurrently instead of in
>> series.
>
> POSIX doesn''t guarantee readdir repeatability, I am not sure about
NT.

In SUSv2 I didn''t see any mention of entry ordering, per se, though  
telldir() and seekdir() should presumably be repeatable for NFS re- 
export, which implies that the user-visible ordering can''t change  
randomly between invocations, and it shouldn''t change between clients  
or clustered NFS export would fail.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Oleg Drokin

2010-Mar-25 16:26 UTC

head link

[Lustre-devel] readdir for striped dir

Hello!

On Mar 24, 2010, at 6:23 PM, Andreas Dilger wrote:> In SUSv2 I didn''t see any mention of entry ordering, per se,
though
> telldir() and seekdir() should presumably be repeatable for NFS re- 
> export, which implies that the user-visible ordering can''t change
> randomly between invocations, and it shouldn''t change between
clients
> or clustered NFS export would fail.
We just need to remember starting MDS in the cookie and then the cookie
would be 100% repeatable. (or derive that in some other 100% repeatable
client-neutral way, but without having all clients to always choose the
same order for new readdirs of the same dir of course)

We need to do it anyway so that we don''t end up returning some entries
we already returned in the past in clustered NFS reexport environment.

Bye,
    Oelg

Lustre devel - Mar 2010 - readdir for striped dir

[Lustre-devel] readdir for striped dir

[Lustre-devel] readdir for striped dir

[Lustre-devel] readdir for striped dir

[Lustre-devel] readdir for striped dir

[Lustre-devel] readdir for striped dir

[Lustre-devel] readdir for striped dir

[Lustre-devel] readdir for striped dir

[Lustre-devel] readdir for striped dir

[Lustre-devel] readdir for striped dir