In CMD, one directory can be striped to several MDTs according to the hash value of each entry (calculated by its name). Suppose we have N MDTs and hash range is 0 to MAX_HASH. First server will keep records with hashes [ 0 ... MAX_HASH / N - 1], second one with hashes [MAX_HASH / N ... 2 * MAX_HASH / N] and so on. Currently, it uses the same hash policy with the one used on disk(ldiskfs/ext3 hash), so when reading striped directory, the entries from different stripe objects can be mapped on client side cache simply, actually this page cache is only maintained in llite layer. But this bonding of CMD split-dir protocal with on-disk hash seems not good, and it even brings more problems when porting MDT to kdmu. This dir-entry page cache should be moved to mdc layer, and each stripe object will have its own page cache. It will need 2 lookups for locating an entry in the page cache, first locating the stripe objects(ll_stripe_offset will be added in ll_file_data to indicate the stripe offset), then got the page by offset(f_pos) inside the stripe_object. The entry page cache can even be organized as the favor of different purposes, for example readdir-plus, dir-extent lock. Idealy we can reuse the cl_page on mdc layer, but that might need object layering on metadata stack. In the first step probably register some page callback for mdc to manage the page cache.
On 2010-03-22, at 17:09, Tom.Wang wrote:> In CMD, one directory can be striped to several MDTs according to the > hash value of each entry (calculated by its name). Suppose we have N > MDTs and hash range is 0 to MAX_HASH. First server will keep records > with hashes [ 0 ... MAX_HASH / N - 1], second one with hashes > [MAX_HASH > / N ... 2 * MAX_HASH / N] and so on. Currently, it uses the same hash > policy with the one used on disk(ldiskfs/ext3 hash), so when reading > striped directory, the entries from different stripe objects can be > mapped on client side cache simply, actually this page cache is only > maintained in llite layer. But this bonding of CMD split-dir protocal > with on-disk hash seems not good, and it even brings more problems > when > porting MDT to kdmu. > > This dir-entry page cache should be moved to mdc layer, and each > stripe > object will have its own page cache. It will need 2 lookups for > locating > an entry in the page cache, first locating the stripe > objects(ll_stripe_offset will be added in ll_file_data to indicate the > stripe offset), then got the page by offset(f_pos) inside the > stripe_object. The entry page cache can even be organized as the favor > of different purposes, for example readdir-plus, dir-extent lock. > Idealy > we can reuse the cl_page on mdc layer, but that might need object > layering on metadata stack. In the first step probably register some > page callback for mdc to manage the page cache.Tom, could you please explain the proposed mechanism for hashing in this scheme? Will there be one hash function at the LMV level to select the dirstripe, and a second one at the MDC level? Does this imply that the client still needs to know the hashing scheme used by the backing storage? At least this allows a different hash per dirstripe, which is important for DMU because the hash seed is different for each directory. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hello, Andreas Dilger wrote:> On 2010-03-22, at 17:09, Tom.Wang wrote: > >> In CMD, one directory can be striped to several MDTs according to the >> hash value of each entry (calculated by its name). Suppose we have N >> MDTs and hash range is 0 to MAX_HASH. First server will keep records >> with hashes [ 0 ... MAX_HASH / N - 1], second one with hashes >> [MAX_HASH >> / N ... 2 * MAX_HASH / N] and so on. Currently, it uses the same hash >> policy with the one used on disk(ldiskfs/ext3 hash), so when reading >> striped directory, the entries from different stripe objects can be >> mapped on client side cache simply, actually this page cache is only >> maintained in llite layer. But this bonding of CMD split-dir protocal >> with on-disk hash seems not good, and it even brings more problems >> when >> porting MDT to kdmu. >> >> This dir-entry page cache should be moved to mdc layer, and each >> stripe >> object will have its own page cache. It will need 2 lookups for >> locating >> an entry in the page cache, first locating the stripe >> objects(ll_stripe_offset will be added in ll_file_data to indicate the >> stripe offset), then got the page by offset(f_pos) inside the >> stripe_object. The entry page cache can even be organized as the favor >> of different purposes, for example readdir-plus, dir-extent lock. >> Idealy >> we can reuse the cl_page on mdc layer, but that might need object >> layering on metadata stack. In the first step probably register some >> page callback for mdc to manage the page cache. >> > > > Tom, could you please explain the proposed mechanism for hashing in > this scheme? Will there be one hash function at the LMV level to > select the dirstripe, and a second one at the MDC level?Does this > imply that the client still needs to know the hashing scheme used by > the backing storage? At least this allows a different hash per > dirstripe, which is important for DMU because the hash seed is > different for each directory. >Client does not need to know the hash scheme of the backing storage. LMV will use new hash function to select stripe object (mdc), which could be independent with the one used in the storage. In mdc level, it just need to map the entries of each dir stripe object in the cache, we can index the cache in anyway as we want, probably hash order (as the server storage) is a good choice, because client can easily find and cancel the pages by the hash in later dir-extent lock. Note: Even in this case, client does not need to know server hash scheme at all, since server will set the hash-offset of these pages, client just need to put these pages on the cache by hash-offset. Currently, the cache will only be touched by readdir. If the cache will be used by readdir-plus later, i.e. we need locate the entry by name, then client must use the same hash as the server storage, but server will tell client which hash function it use. Yes, different hash per dirstripe should not be a problem here. Thanks WangDi> Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
Nikita Danilov wrote:> On 23 March 2010 14:29, Tom.Wang <Tom.Wang at sun.com> wrote: > >> Hello, >> > > Hello, > > [...] > > >> LMV will use new hash function to select stripe object (mdc), which could be >> independent with the one >> used in the storage. In mdc level, it just need to map the entries of each >> dir stripe object in the cache, >> we can index the cache in anyway as we want, probably hash order (as the >> server storage) is a good choice, >> because client can easily find and cancel the pages by the hash in later >> dir-extent lock. Note: Even in this case, >> client does not need to know server hash scheme at all, since server will >> set the hash-offset of these pages, client just >> need to put these pages on the cache by hash-offset. >> Currently, the cache will only be touched by readdir. If the cache will be >> used by readdir-plus later, i.e. we need locate >> the entry by name, then client must use the same hash as the server storage, >> but server will tell client which hash function >> it use. Yes, different hash per dirstripe should not be a problem here. >> > > If I understand correctly, the scheme has an advantage of cleaner > solution for readdir scalability than "hash adjust" hack of CMD3 (see > comment in lmv/lmv_obd.c:lmv_readpage()). The problem, to remind, is > that if a number of clients readdir the same split directory, they all > hammer the same servers one after another, negating the advantages of > meta-data clustering. The solution is to cyclically shift hash space > by an offset depending on the client, so that clients load servers > uniformly. With 2-level hashing, this shifting can be done entirely > within new LMV hash function. >Yes, in this case, these clients should start from different stripe-offset in readdir. Thanks Wangdi> >> Thanks >> WangDi >> > > Thank you, > Nikita. >
On 23 March 2010 14:29, Tom.Wang <Tom.Wang at sun.com> wrote:> Hello,Hello, [...]> > LMV will use new hash function to select stripe object (mdc), which could be > independent with the one > used in the storage. ?In mdc level, it just need to map the entries of each > dir stripe object in the cache, > we can index the cache in anyway as we want, probably hash order (as the > server storage) is a good choice, > because client can easily find and cancel the pages by the hash in later > dir-extent lock. Note: Even in this case, > client does not need to know server hash scheme at all, since server will > set the hash-offset of these pages, client just > need to put these pages on the cache by hash-offset. > Currently, the cache will only be touched by readdir. ?If the cache will be > used by readdir-plus later, i.e. we need locate > the entry by name, then client must use the same hash as the server storage, > but server will tell client which hash function > it use. ?Yes, different hash per dirstripe should not be a problem here.If I understand correctly, the scheme has an advantage of cleaner solution for readdir scalability than "hash adjust" hack of CMD3 (see comment in lmv/lmv_obd.c:lmv_readpage()). The problem, to remind, is that if a number of clients readdir the same split directory, they all hammer the same servers one after another, negating the advantages of meta-data clustering. The solution is to cyclically shift hash space by an offset depending on the client, so that clients load servers uniformly. With 2-level hashing, this shifting can be done entirely within new LMV hash function.> > Thanks > WangDiThank you, Nikita.
Hi Nikita, On 2010-03-23, at 10:15, Nikita Danilov wrote:> On 23 March 2010 14:29, Tom.Wang <Tom.Wang at sun.com> wrote: >> >> LMV will use new hash function to select stripe object (mdc), which >> could be >> independent with the one used in the storage. In mdc level, it >> just need to map the entries of each dir stripe object in the >> cache, we can index the cache in anyway as we want, probably hash >> order (as the server storage) is a good choice, because client can >> easily find and cancel the pages by the hash in later dir-extent >> lock. Note: Even in this case, client does not need to know server >> hash scheme at all, since server will set the hash-offset of these >> pages, client just need to put these pages on the cache by hash- >> offset. >> >> Currently, the cache will only be touched by readdir. If the cache >> will be >> used by readdir-plus later, i.e. we need locate the entry by name, >> then client must use the same hash as the server storage, but >> server will tell client which hash function it use. Yes, different >> hash per dirstripe should not be a problem here. > > If I understand correctly, the scheme has an advantage of cleaner > solution for readdir scalability than "hash adjust" hack of CMD3 (see > comment in lmv/lmv_obd.c:lmv_readpage()).Yes, I saw this for the first time just a few days ago and had to shield my eyes :-).> The problem, to remind, is that if a number of clients readdir the > same split directory, they all hammer the same servers one after > another, negating the advantages of meta-data clustering. The > solution is to cyclically shift hash space by an offset depending on > the client, so that clients load servers uniformly. With 2-level > hashing, this shifting can be done entirely within new LMV hash > function.Sure. Even without the 2-level hashing I wondered why the readdir pages weren''t simply pre-fetched from a random MDT index (N % nstripes) by each client into its cache. One question is whether it is important for applications to get entries back from readdir in a consistent order between invocations, which would imply that N should be persistent across calls (e.g. NID). If it is important for the application to get the entries in the same order, it would mean higher latency on some clients to return the first page, but thereafter the clients would all be pre-fetching round-robin. In aggregate it would speed up performance, however, by distributing the RPC traffic more evenly, and also by starting the disk IO on all of the MDTs concurrently instead of in series. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On 23 March 2010 21:23, Andreas Dilger <adilger at sun.com> wrote:> Hi Nikita,Hello Andreas,> > On 2010-03-23, at 10:15, Nikita Danilov wrote: >>[...]>> >> If I understand correctly, the scheme has an advantage of cleaner >> solution for readdir scalability than "hash adjust" hack of CMD3 (see >> comment in lmv/lmv_obd.c:lmv_readpage()). > > Yes, I saw this for the first time just a few days ago and had to shield my > eyes :-).Hehe, wild days, happy memories.> >> The problem, to remind, is that if a number of clients readdir the same >> split directory, they all hammer the same servers one after another, >> negating the advantages of meta-data clustering. The solution is to >> cyclically shift hash space by an offset depending on the client, so that >> clients load servers uniformly. With 2-level hashing, this shifting can be >> done entirely within new LMV hash function. > > > Sure. ?Even without the 2-level hashing I wondered why the readdir pages > weren''t simply pre-fetched from a random MDT index (N % nstripes) by each > client into its cache.Do you mean that a client reads from servers in order N, N + 1, ..., N - 1? Or that all clients read pages from servers in the same order 1, 2, ... nrstripes and in addition every client pre-fetches from servers N, N + 1, ..., N - 1? As to the first case, a directory entry hash value is used as a value of ->d_off field in struct dirent. Historically this field means byte offset in a directory file and hash adjustment tries to maintain its monotonicity through readdir iterations.> > One question is whether it is important for applications to get entries back > from readdir in a consistent order between invocations, which would imply > that N should be persistent across calls (e.g. NID). ?If it is important for > the application to get the entries in the same order, it would mean higher > latency on some clients to return the first page, but thereafter the clients > would all be pre-fetching round-robin. ?In aggregate it would speed up > performance, however, by distributing the RPC traffic more evenly, and also > by starting the disk IO on all of the MDTs concurrently instead of in > series.POSIX doesn''t guarantee readdir repeatability, I am not sure about NT.> > Cheers, AndreasThank you, Nikita.
On 2010-03-23, at 13:14, Nikita Danilov wrote:> On 23 March 2010 21:23, Andreas Dilger <adilger at sun.com> wrote: >> >>> The problem, to remind, is that if a number of clients readdir the >>> same >>> split directory, they all hammer the same servers one after another, >>> negating the advantages of meta-data clustering. The solution is to >>> cyclically shift hash space by an offset depending on the client, >>> so that >>> clients load servers uniformly. With 2-level hashing, this >>> shifting can be >>> done entirely within new LMV hash function. >> >> Sure. Even without the 2-level hashing I wondered why the readdir >> pages >> weren''t simply pre-fetched from a random MDT index (N % nstripes) >> by each >> client into its cache. > > Do you mean that a client reads from servers in order N, N + 1, ..., N > - 1? Or that all clients read pages from servers in the same order 1, > 2, ... nrstripes and in addition every client pre-fetches from servers > N, N + 1, ..., N - 1?What I meant was that each client starts its read on a different stripe (e.g. dir_stripe_index = client_nid % num_stripes), and reads(- ahead, optionally) chunks in (approximately?) round-robin order from the starting index, but it still returns the readdir data back to userspace as if it was reading starting at dir_stripe_index = 0. That implies that, depending on the client NID, the client may buffer up to (num_stripes - 1) reads (pages or MB, depending on how much is read per RPC) until it gets to the 0th stripe index and can start returning entries from that stripe to userspace. The readahead of directory stripes should always be offset from where it is currently processing, so that the clients continue to distribute load across MDTs even when they are working from cache.> As to the first case, a directory entry hash value is used as a value > of ->d_off field in struct dirent. Historically this field means byte > offset in a directory file and hash adjustment tries to maintain its > monotonicity through readdir iterations.Agreed, though in the case of DMU backing storage (or even ldiskfs with a different hash function) there will be entries in each page read from each MDT that may have overlapping values. There will need to be some way to disambiguate hash X that was read from MDT 0 from hash X from MDT 1. It may be that the prefetching described above will help this. If the client is doing hash-ordered reads from each MDT, it could be merging the entries on the client to return to userspace in strict hash order, even though the client doesn''t know the underlying hash function used. Presumably, with a well-behaved hash function, the entries in each stripe are uniformly distributed, so progress will be made through all stripes in a relatively uniform manner (i.e. reads will be going to all MDTs at about the same rate).>> One question is whether it is important for applications to get >> entries back >> from readdir in a consistent order between invocations, which would >> imply >> that N should be persistent across calls (e.g. NID). If it is >> important for >> the application to get the entries in the same order, it would mean >> higher >> latency on some clients to return the first page, but thereafter >> the clients >> would all be pre-fetching round-robin. In aggregate it would speed >> up >> performance, however, by distributing the RPC traffic more evenly, >> and also >> by starting the disk IO on all of the MDTs concurrently instead of in >> series. > > POSIX doesn''t guarantee readdir repeatability, I am not sure about NT.In SUSv2 I didn''t see any mention of entry ordering, per se, though telldir() and seekdir() should presumably be repeatable for NFS re- export, which implies that the user-visible ordering can''t change randomly between invocations, and it shouldn''t change between clients or clustered NFS export would fail. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hello! On Mar 24, 2010, at 6:23 PM, Andreas Dilger wrote:> In SUSv2 I didn''t see any mention of entry ordering, per se, though > telldir() and seekdir() should presumably be repeatable for NFS re- > export, which implies that the user-visible ordering can''t change > randomly between invocations, and it shouldn''t change between clients > or clustered NFS export would fail.We just need to remember starting MDS in the cookie and then the cookie would be 100% repeatable. (or derive that in some other 100% repeatable client-neutral way, but without having all clients to always choose the same order for new readdirs of the same dir of course) We need to do it anyway so that we don''t end up returning some entries we already returned in the past in clustered NFS reexport environment. Bye, Oelg