Raghavendra Gowdappa
2016-Nov-02 04:08 UTC
[Gluster-users] [Gluster-devel] A question of GlusterFS dentries!
----- Original Message -----> From: "Keiviw" <keiviw at 163.com> > To: gluster-devel at gluster.org > Sent: Tuesday, November 1, 2016 12:41:02 PM > Subject: [Gluster-devel] A question of GlusterFS dentries! > > Hi, > In GlusterFS distributed volumes, listing a non-empty directory was slow. > Then I read the dht codes and found the reasons. But I was confused that > GlusterFS dht travesed all the bricks(in the volume) sequentially,why not > use multi-thread to read dentries from multiple bricks simultaneously. > That's a question that's always puzzled me, Couly you please tell me > something about this???readdir across subvols is sequential mostly because we have to support rewinddir(3). We need to maintain the mapping of offset and dentry across multiple invocations of readdir. In other words if someone did a rewinddir to an offset corresponding to earlier dentry, subsequent readdirs should return same set of dentries what the earlier invocation of readdir returned. For example, in an hypothetical scenario, readdir returned following dentries: 1. a, off=10 2. b, off=2 3. c, off=5 4. d, off=15 5. e, off=17 6. f, off=13 Now if we did rewinddir to off 5 and issue readdir again we should get following dentries: (c, off=5), (d, off=15), (e, off=17), (f, off=13) Within a subvol backend filesystem provides rewinddir guarantee for the dentries present on that subvol. However, across subvols it is the responsibility of DHT to provide the above guarantee. Which means we should've some well defined order in which we send readdir calls (Note that order is not well defined if we do a parallel readdir across all subvols). So, DHT has sequential readdir which is a well defined order of reading dentries. To give an example if we have another subvol - subvol2 - (in addiction to the subvol above - say subvol1) with following listing: 1. g, off=16 2. h, off=20 3. i, off=3 4. j, off=19 With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done parallely): 1. A complete listing of the directory (which can be any one of 10P1 = 10 ways - I hope math is correct here). 2. Do rewinddir (20) We cannot predict what are the set of dentries that come _after_ offset 20. However, if we do a readdir sequentially across subvols there is only one directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to support rewinddir. If there is no POSIX requirement for rewinddir support, I think a parallel readdir can easily be implemented (which improves performance too). But unfortunately rewinddir is still a POSIX requirement. This also opens up another possibility of a "no-rewinddir-support" option in DHT, which if enabled results in parallel readdirs across subvols. What I am not sure is how many users still use rewinddir? If there is a critical mass which wants performance with a tradeoff of no rewinddir support this can be a good feature. +gluster-users to get an opinion on this. regards, Raghavendra> > > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel
Raghavendra Gowdappa
2016-Nov-02 04:21 UTC
[Gluster-users] [Gluster-devel] A question of GlusterFS dentries!
----- Original Message -----> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com> > To: "Keiviw" <keiviw at 163.com> > Cc: gluster-devel at gluster.org, "gluster-users" <gluster-users at gluster.org> > Sent: Wednesday, November 2, 2016 9:38:46 AM > Subject: Re: [Gluster-devel] A question of GlusterFS dentries! > > > > ----- Original Message ----- > > From: "Keiviw" <keiviw at 163.com> > > To: gluster-devel at gluster.org > > Sent: Tuesday, November 1, 2016 12:41:02 PM > > Subject: [Gluster-devel] A question of GlusterFS dentries! > > > > Hi, > > In GlusterFS distributed volumes, listing a non-empty directory was slow. > > Then I read the dht codes and found the reasons. But I was confused that > > GlusterFS dht travesed all the bricks(in the volume) sequentially,why not > > use multi-thread to read dentries from multiple bricks simultaneously. > > That's a question that's always puzzled me, Couly you please tell me > > something about this??? > > readdir across subvols is sequential mostly because we have to support > rewinddir(3). We need to maintain the mapping of offset and dentry across > multiple invocations of readdir. In other words if someone did a rewinddir > to an offset corresponding to earlier dentry, subsequent readdirs should > return same set of dentries what the earlier invocation of readdir returned. > For example, in an hypothetical scenario, readdir returned following > dentries: > > 1. a, off=10 > 2. b, off=2 > 3. c, off=5 > 4. d, off=15 > 5. e, off=17 > 6. f, off=13 > > Now if we did rewinddir to off 5 and issue readdir again we should get > following dentries: > (c, off=5), (d, off=15), (e, off=17), (f, off=13) > > Within a subvol backend filesystem provides rewinddir guarantee for the > dentries present on that subvol. However, across subvols it is the > responsibility of DHT to provide the above guarantee. Which means we > should've some well defined order in which we send readdir calls (Note that > order is not well defined if we do a parallel readdir across all subvols). > So, DHT has sequential readdir which is a well defined order of reading > dentries. > > To give an example if we have another subvol - subvol2 - (in addiction to thes/addiction/addition/> subvol above - say subvol1) with following listing: > 1. g, off=16 > 2. h, off=20 > 3. i, off=3 > 4. j, off=19 > > With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, > e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir > done parallely): > > 1. A complete listing of the directory (which can be any one of 10P1 = 10I think it is 10P10 = 3628800. But again it is not completely random selection as readdir on a single subvol still gives one ordering, so the value is much less. The point here is that there can be many possible listings with parallel readdir.> ways - I hope math is correct here). > 2. Do rewinddir (20) > > We cannot predict what are the set of dentries that come _after_ offset 20. > However, if we do a readdir sequentially across subvols there is only one > directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to > support rewinddir. > > If there is no POSIX requirement for rewinddir support, I think a parallel > readdir can easily be implemented (which improves performance too). But > unfortunately rewinddir is still a POSIX requirement. This also opens up > another possibility of a "no-rewinddir-support" option in DHT, which if > enabled results in parallel readdirs across subvols. What I am not sure is > how many users still use rewinddir? If there is a critical mass which wants > performance with a tradeoff of no rewinddir support this can be a good > feature. > > +gluster-users to get an opinion on this. > > regards, > Raghavendra > > > > > > > > > > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel >
Serkan Çoban
2016-Nov-02 04:24 UTC
[Gluster-users] [Gluster-devel] A question of GlusterFS dentries!
+1 for "no-rewinddir-support" option in DHT. We are seeing very slow directory listing specially with 1500+ brick volume, 'ls' takes 20+ second with 1000+ files. On Wed, Nov 2, 2016 at 7:08 AM, Raghavendra Gowdappa <rgowdapp at redhat.com> wrote:> > > ----- Original Message ----- >> From: "Keiviw" <keiviw at 163.com> >> To: gluster-devel at gluster.org >> Sent: Tuesday, November 1, 2016 12:41:02 PM >> Subject: [Gluster-devel] A question of GlusterFS dentries! >> >> Hi, >> In GlusterFS distributed volumes, listing a non-empty directory was slow. >> Then I read the dht codes and found the reasons. But I was confused that >> GlusterFS dht travesed all the bricks(in the volume) sequentially,why not >> use multi-thread to read dentries from multiple bricks simultaneously. >> That's a question that's always puzzled me, Couly you please tell me >> something about this??? > > readdir across subvols is sequential mostly because we have to support rewinddir(3). We need to maintain the mapping of offset and dentry across multiple invocations of readdir. In other words if someone did a rewinddir to an offset corresponding to earlier dentry, subsequent readdirs should return same set of dentries what the earlier invocation of readdir returned. For example, in an hypothetical scenario, readdir returned following dentries: > > 1. a, off=10 > 2. b, off=2 > 3. c, off=5 > 4. d, off=15 > 5. e, off=17 > 6. f, off=13 > > Now if we did rewinddir to off 5 and issue readdir again we should get following dentries: > (c, off=5), (d, off=15), (e, off=17), (f, off=13) > > Within a subvol backend filesystem provides rewinddir guarantee for the dentries present on that subvol. However, across subvols it is the responsibility of DHT to provide the above guarantee. Which means we should've some well defined order in which we send readdir calls (Note that order is not well defined if we do a parallel readdir across all subvols). So, DHT has sequential readdir which is a well defined order of reading dentries. > > To give an example if we have another subvol - subvol2 - (in addiction to the subvol above - say subvol1) with following listing: > 1. g, off=16 > 2. h, off=20 > 3. i, off=3 > 4. j, off=19 > > With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done parallely): > > 1. A complete listing of the directory (which can be any one of 10P1 = 10 ways - I hope math is correct here). > 2. Do rewinddir (20) > > We cannot predict what are the set of dentries that come _after_ offset 20. However, if we do a readdir sequentially across subvols there is only one directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to support rewinddir. > > If there is no POSIX requirement for rewinddir support, I think a parallel readdir can easily be implemented (which improves performance too). But unfortunately rewinddir is still a POSIX requirement. This also opens up another possibility of a "no-rewinddir-support" option in DHT, which if enabled results in parallel readdirs across subvols. What I am not sure is how many users still use rewinddir? If there is a critical mass which wants performance with a tradeoff of no rewinddir support this can be a good feature. > > +gluster-users to get an opinion on this. > > regards, > Raghavendra > >> >> >> >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users
Keiviw
2016-Nov-02 11:54 UTC
[Gluster-users] [Gluster-devel] A question of GlusterFS dentries!
What is rewinddir() used for ? In other words, What are the situations in which we use rewinddir?? At 2016-11-02 12:08:46, "Raghavendra Gowdappa" <rgowdapp at redhat.com> wrote:> > >----- Original Message ----- >> From: "Keiviw" <keiviw at 163.com> >> To: gluster-devel at gluster.org >> Sent: Tuesday, November 1, 2016 12:41:02 PM >> Subject: [Gluster-devel] A question of GlusterFS dentries! >> >> Hi, >> In GlusterFS distributed volumes, listing a non-empty directory was slow. >> Then I read the dht codes and found the reasons. But I was confused that >> GlusterFS dht travesed all the bricks(in the volume) sequentially,why not >> use multi-thread to read dentries from multiple bricks simultaneously. >> That's a question that's always puzzled me, Couly you please tell me >> something about this??? > >readdir across subvols is sequential mostly because we have to support rewinddir(3). We need to maintain the mapping of offset and dentry across multiple invocations of readdir. In other words if someone did a rewinddir to an offset corresponding to earlier dentry, subsequent readdirs should return same set of dentries what the earlier invocation of readdir returned. For example, in an hypothetical scenario, readdir returned following dentries: > >1. a, off=10 >2. b, off=2 >3. c, off=5 >4. d, off=15 >5. e, off=17 >6. f, off=13 > >Now if we did rewinddir to off 5 and issue readdir again we should get following dentries: >(c, off=5), (d, off=15), (e, off=17), (f, off=13) > >Within a subvol backend filesystem provides rewinddir guarantee for the dentries present on that subvol. However, across subvols it is the responsibility of DHT to provide the above guarantee. Which means we should've some well defined order in which we send readdir calls (Note that order is not well defined if we do a parallel readdir across all subvols). So, DHT has sequential readdir which is a well defined order of reading dentries. > >To give an example if we have another subvol - subvol2 - (in addiction to the subvol above - say subvol1) with following listing: >1. g, off=16 >2. h, off=20 >3. i, off=3 >4. j, off=19 > >With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done parallely): > >1. A complete listing of the directory (which can be any one of 10P1 = 10 ways - I hope math is correct here). >2. Do rewinddir (20) > >We cannot predict what are the set of dentries that come _after_ offset 20. However, if we do a readdir sequentially across subvols there is only one directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to support rewinddir. > >If there is no POSIX requirement for rewinddir support, I think a parallel readdir can easily be implemented (which improves performance too). But unfortunately rewinddir is still a POSIX requirement. This also opens up another possibility of a "no-rewinddir-support" option in DHT, which if enabled results in parallel readdirs across subvols. What I am not sure is how many users still use rewinddir? If there is a critical mass which wants performance with a tradeoff of no rewinddir support this can be a good feature. > >+gluster-users to get an opinion on this. > >regards, >Raghavendra > >> >> >> >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161102/de42374e/attachment.html>
Raghavendra G
2016-Nov-03 04:52 UTC
[Gluster-users] [Gluster-devel] A question of GlusterFS dentries!
On Wed, Nov 2, 2016 at 9:38 AM, Raghavendra Gowdappa <rgowdapp at redhat.com> wrote:> > > ----- Original Message ----- > > From: "Keiviw" <keiviw at 163.com> > > To: gluster-devel at gluster.org > > Sent: Tuesday, November 1, 2016 12:41:02 PM > > Subject: [Gluster-devel] A question of GlusterFS dentries! > > > > Hi, > > In GlusterFS distributed volumes, listing a non-empty directory was slow. > > Then I read the dht codes and found the reasons. But I was confused that > > GlusterFS dht travesed all the bricks(in the volume) sequentially,why not > > use multi-thread to read dentries from multiple bricks simultaneously. > > That's a question that's always puzzled me, Couly you please tell me > > something about this??? > > readdir across subvols is sequential mostly because we have to support > rewinddir(3).Sorry. seekdir(3) is the more relevant function here. Since rewinddir resets the dir stream to beginning, its not much of a difficulty to support rewinddir with parallel readdirs across subvols.> We need to maintain the mapping of offset and dentry across multiple > invocations of readdir. In other words if someone did a rewinddir to an > offset corresponding to earlier dentry, subsequent readdirs should return > same set of dentries what the earlier invocation of readdir returned. For > example, in an hypothetical scenario, readdir returned following dentries: > > 1. a, off=10 > 2. b, off=2 > 3. c, off=5 > 4. d, off=15 > 5. e, off=17 > 6. f, off=13 > > Now if we did rewinddir to off 5 and issue readdir again we should get > following dentries: > (c, off=5), (d, off=15), (e, off=17), (f, off=13) > > Within a subvol backend filesystem provides rewinddir guarantee for the > dentries present on that subvol. However, across subvols it is the > responsibility of DHT to provide the above guarantee. Which means we > should've some well defined order in which we send readdir calls (Note that > order is not well defined if we do a parallel readdir across all subvols). > So, DHT has sequential readdir which is a well defined order of reading > dentries. > > To give an example if we have another subvol - subvol2 - (in addiction to > the subvol above - say subvol1) with following listing: > 1. g, off=16 > 2. h, off=20 > 3. i, off=3 > 4. j, off=19 > > With parallel readdir we can have many ordering like - (a, b, g, h, i, c, > d, e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir > done parallely): > > 1. A complete listing of the directory (which can be any one of 10P1 = 10 > ways - I hope math is correct here). > 2. Do rewinddir (20) > > We cannot predict what are the set of dentries that come _after_ offset > 20. However, if we do a readdir sequentially across subvols there is only > one directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier > to support rewinddir. > > If there is no POSIX requirement for rewinddir support, I think a parallel > readdir can easily be implemented (which improves performance too). But > unfortunately rewinddir is still a POSIX requirement. This also opens up > another possibility of a "no-rewinddir-support" option in DHT, which if > enabled results in parallel readdirs across subvols. What I am not sure is > how many users still use rewinddir? If there is a critical mass which wants > performance with a tradeoff of no rewinddir support this can be a good > feature. > > +gluster-users to get an opinion on this. > > regards, > Raghavendra > > > > > > > > > > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel >-- Raghavendra G -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161103/d8a3ed42/attachment.html>