Hi, The suggestion you gave was in fact considered at the time of writing shard translator. Here are some of the considerations for sticking with a single directory as opposed to a two-tier classification of shards based on the initial chars of the uuid string: i) Even for a 4TB disk with the smallest possible shard size of 4MB, there will only be a max of 1048576 entries under /.shard in the worst case - a number far less than the max number of inodes that are supported by most backend file systems. ii) Entry self-heal for a single directory even with the simplest case of 1 entry deleted/created while a replica is down required crawling the whole sub-directory tree, figuring which entry is present/absent between src and sink and then healing it to the sink. With granular entry self-heal [1], we no longer have to live under this limitation. iii) Resolving shards from the original file name as given by the application to the corresponding shard within a single directory (/.shard in the existing case) would mean, looking up the parent dir /.shard first followed by lookup on the actual shard that is to be operated on. But having a two-tier sub-directory structure means that we not only have to resolve (or look-up) /.shard first, but also the directories '/.shard/d2', '/.shard/d2/18', and '/.shard/d2/18/d218cd1c-4bd9-40d7-9810-86b3f7932509' before finally looking up the shard, which is a lot of network operations. Yes, these are all one-time operations and the results can be cached in the inode table, but still on account of having to have dynamic gfids (as opposed to just /.shard, which has a fixed gfid - be318638-e8a0-4c6d-977d-7a937aa84806), it is trivial to resolve the name of the shard to gfid, or the parent name to parent gfid _even_ in memory. Are you unhappy with the performance? What's your typical VM image size, shard block size and the capacity of individual bricks? -Krutika On Mon, Jul 18, 2016 at 2:43 PM, Gandalf Corvotempesta < gandalf.corvotempesta at gmail.com> wrote:> 2016-07-18 9:53 GMT+02:00 Oleksandr Natalenko <oleksandr at natalenko.name>: > > I'd say, like this: > > > > /.shard/d2/18/D218CD1C-4BD9-40D7-9810-86B3F7932509.1 > > Yes, something like this. > I was on mobile when I wrote. Your suggestion is better than mine. > > Probably, using a directory for the whole shard is also better and > keep the directory structure clear: > > > /.shard/d2/18/D218CD1C-4BD9-40D7-9810-86B3F7932509/D218CD1C-4BD9-40D7-9810-86B3F7932509.1 > > The current shard directory structure doesn't scale at all. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160718/b373d673/attachment.html>
2016-07-18 12:25 GMT+02:00 Krutika Dhananjay <kdhananj at redhat.com>:> Hi, > > The suggestion you gave was in fact considered at the time of writing shard > translator. > Here are some of the considerations for sticking with a single directory as > opposed to a two-tier classification of shards based on the initial chars of > the uuid string: > i) Even for a 4TB disk with the smallest possible shard size of 4MB, there > will only be a max of 1048576 entries > under /.shard in the worst case - a number far less than the max number of > inodes that are supported by most backend file systems.This with just 1 single file. What about thousands of huge sharded files? In a petabyte scale cluster, having thousands of huge file should be considered normal.> iii) Resolving shards from the original file name as given by the > application to the corresponding shard within a single directory (/.shard in > the existing case) would mean, looking up the parent dir /.shard first > followed by lookup on the actual shard that is to be operated on. But having > a two-tier sub-directory structure means that we not only have to resolve > (or look-up) /.shard first, but also the directories '/.shard/d2', > '/.shard/d2/18', and '/.shard/d2/18/d218cd1c-4bd9-40d7-9810-86b3f7932509' > before finally looking up the shard, which is a lot of network operations. > Yes, these are all one-time operations and the results can be cached in the > inode table, but still on account of having to have dynamic gfids (as > opposed to just /.shard, which has a fixed gfid - > be318638-e8a0-4c6d-977d-7a937aa84806), it is trivial to resolve the name of > the shard to gfid, or the parent name to parent gfid _even_ in memory.What about just 1 single level? /.shard/d218cd1c-4bd9-40d7-9810-86b3f7932509/d218cd1c-4bd9-40d7-9810-86b3f7932509.1 ? You have the GFID, thus there is no need to crawl multiple levels, just direct-access to the proper path. With this soulution, you have 1.048.576 entries with a 4TB shared file with 4MB shard size. With the current implementation, you have 1.048.576 for each sharded file. If I have 100 4TB files, i'll end with 1.048.576*100 = 104.857.600 files in a single directory.> Are you unhappy with the performance? What's your typical VM image size, > shard block size and the capacity of individual bricks?No, i'm just thinking about this optimization.
On Mon, Jul 18, 2016 at 3:55 PM, Krutika Dhananjay <kdhananj at redhat.com> wrote:> Hi, > > The suggestion you gave was in fact considered at the time of writing > shard translator. > Here are some of the considerations for sticking with a single directory > as opposed to a two-tier classification of shards based on the initial > chars of the uuid string: > i) Even for a 4TB disk with the smallest possible shard size of 4MB, there > will only be a max of 1048576 entries > under /.shard in the worst case - a number far less than the max number > of inodes that are supported by most backend file systems. > > ii) Entry self-heal for a single directory even with the simplest case of > 1 entry deleted/created while a replica is down required crawling the whole > sub-directory tree, figuring which entry is present/absent between src and > sink and then healing it to the sink. With granular entry self-heal [1], we > no longer have to live under this limitation. > > iii) Resolving shards from the original file name as given by the > application to the corresponding shard within a single directory (/.shard > in the existing case) would mean, looking up the parent dir /.shard first > followed by lookup on the actual shard that is to be operated on. But > having a two-tier sub-directory structure means that we not only have to > resolve (or look-up) /.shard first, but also the directories '/.shard/d2', > '/.shard/d2/18', and '/.shard/d2/18/d218cd1c-4bd9-40d7-9810-86b3f7932509' > before finally looking up the shard, which is a lot of network operations. > Yes, these are all one-time operations and the results can be cached in the > inode table, but still on account of having to have dynamic gfids (as > opposed to just /.shard, which has a fixed gfid - > be318638-e8a0-4c6d-977d-7a937aa84806), it is trivial to resolve the name of > the shard to gfid, or the parent name to parent gfid _even_ in memory. >s/trivial/non-trivial/ in the last sentence above. Oh and [1] - https://github.com/gluster/glusterfs-specs/blob/master/done/GlusterFS%203.8/granular-entry-self-healing.md -Krutika> > > Are you unhappy with the performance? What's your typical VM image size, > shard block size and the capacity of individual bricks? > > -Krutika > > On Mon, Jul 18, 2016 at 2:43 PM, Gandalf Corvotempesta < > gandalf.corvotempesta at gmail.com> wrote: > >> 2016-07-18 9:53 GMT+02:00 Oleksandr Natalenko <oleksandr at natalenko.name>: >> > I'd say, like this: >> > >> > /.shard/d2/18/D218CD1C-4BD9-40D7-9810-86B3F7932509.1 >> >> Yes, something like this. >> I was on mobile when I wrote. Your suggestion is better than mine. >> >> Probably, using a directory for the whole shard is also better and >> keep the directory structure clear: >> >> >> /.shard/d2/18/D218CD1C-4BD9-40D7-9810-86B3F7932509/D218CD1C-4BD9-40D7-9810-86B3F7932509.1 >> >> The current shard directory structure doesn't scale at all. >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-users >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160718/00204b1a/attachment.html>