thr3ads.net - Gluster users - [Gluster-users] Shard storage suggestions [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Krutika Dhananjay

2016-Jul-18 10:25 UTC

[Gluster-users] Shard storage suggestions

Hi,

The suggestion you gave was in fact considered at the time of writing shard
translator.
Here are some of the considerations for sticking with a single directory as
opposed to a two-tier classification of shards based on the initial chars
of the uuid string:
i) Even for a 4TB disk with the smallest possible shard size of 4MB, there
will only be a max of 1048576 entries
 under /.shard in the worst case - a number far less than the max number of
inodes that are supported by most backend file systems.

ii) Entry self-heal for a single directory even with the simplest case of 1
entry deleted/created while a replica is down required crawling the whole
sub-directory tree, figuring which entry is present/absent between src and
sink and then healing it to the sink. With granular entry self-heal [1], we
no longer have to live under this limitation.

iii) Resolving shards from the original file name as given by the
application to the corresponding shard within a single directory (/.shard
in the existing case) would mean, looking up the parent dir /.shard first
followed by lookup on the actual shard that is to be operated on. But
having a two-tier sub-directory structure means that we not only have to
resolve (or look-up) /.shard first, but also the directories
'/.shard/d2',
'/.shard/d2/18', and
'/.shard/d2/18/d218cd1c-4bd9-40d7-9810-86b3f7932509'
before finally looking up the shard, which is a lot of network operations.
Yes, these are all one-time operations and the results can be cached in the
inode table, but still on account of having to have dynamic gfids (as
opposed to just /.shard, which has a fixed gfid -
be318638-e8a0-4c6d-977d-7a937aa84806), it is trivial to resolve the name of
the shard to gfid, or the parent name to parent gfid _even_ in memory.

Are you unhappy with the performance? What's your typical VM image size,
shard block size and the capacity of individual bricks?

-Krutika

On Mon, Jul 18, 2016 at 2:43 PM, Gandalf Corvotempesta <
gandalf.corvotempesta at gmail.com> wrote:
> 2016-07-18 9:53 GMT+02:00 Oleksandr Natalenko <oleksandr at
natalenko.name>:
> > I'd say, like this:
> >
> > /.shard/d2/18/D218CD1C-4BD9-40D7-9810-86B3F7932509.1
>
> Yes, something like this.
> I was on mobile when I wrote. Your suggestion is better than mine.
>
> Probably, using a directory for the whole shard is also better and
> keep the directory structure clear:
>
>
> 
/.shard/d2/18/D218CD1C-4BD9-40D7-9810-86B3F7932509/D218CD1C-4BD9-40D7-9810-86B3F7932509.1
>
> The current shard directory structure doesn't scale at all.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160718/b373d673/attachment.html>

Gandalf Corvotempesta

2016-Jul-18 10:33 UTC

head link

[Gluster-users] Shard storage suggestions

2016-07-18 12:25 GMT+02:00 Krutika Dhananjay <kdhananj at
redhat.com>:> Hi,
>
> The suggestion you gave was in fact considered at the time of writing shard
> translator.
> Here are some of the considerations for sticking with a single directory as
> opposed to a two-tier classification of shards based on the initial chars
of
> the uuid string:
> i) Even for a 4TB disk with the smallest possible shard size of 4MB, there
> will only be a max of 1048576 entries
>  under /.shard in the worst case - a number far less than the max number of
> inodes that are supported by most backend file systems.
This with just 1 single file.
What about thousands of huge sharded files? In a petabyte scale cluster, having
thousands of huge file should be considered normal.
> iii) Resolving shards from the original file name as given by the
> application to the corresponding shard within a single directory (/.shard
in
> the existing case) would mean, looking up the parent dir /.shard first
> followed by lookup on the actual shard that is to be operated on. But
having
> a two-tier sub-directory structure means that we not only have to resolve
> (or look-up) /.shard first, but also the directories '/.shard/d2',
> '/.shard/d2/18', and
'/.shard/d2/18/d218cd1c-4bd9-40d7-9810-86b3f7932509'
> before finally looking up the shard, which is a lot of network operations.
> Yes, these are all one-time operations and the results can be cached in the
> inode table, but still on account of having to have dynamic gfids (as
> opposed to just /.shard, which has a fixed gfid -
> be318638-e8a0-4c6d-977d-7a937aa84806), it is trivial to resolve the name of
> the shard to gfid, or the parent name to parent gfid _even_ in memory.
What about just 1 single level?
/.shard/d218cd1c-4bd9-40d7-9810-86b3f7932509/d218cd1c-4bd9-40d7-9810-86b3f7932509.1
?

You have the GFID, thus there is no need to crawl multiple levels,
just direct-access to the proper path.

With this soulution, you have 1.048.576 entries with a 4TB shared file
with 4MB shard size.
With the current implementation, you have 1.048.576 for each sharded
file. If I have 100 4TB files, i'll end
with 1.048.576*100 = 104.857.600 files in a single directory.
> Are you unhappy with the performance? What's your typical VM image
size,
> shard block size and the capacity of individual bricks?
No, i'm just thinking about this optimization.

Krutika Dhananjay

2016-Jul-18 10:41 UTC

head link

[Gluster-users] Shard storage suggestions

On Mon, Jul 18, 2016 at 3:55 PM, Krutika Dhananjay <kdhananj at
redhat.com>
wrote:
> Hi,
>
> The suggestion you gave was in fact considered at the time of writing
> shard translator.
> Here are some of the considerations for sticking with a single directory
> as opposed to a two-tier classification of shards based on the initial
> chars of the uuid string:
> i) Even for a 4TB disk with the smallest possible shard size of 4MB, there
> will only be a max of 1048576 entries
>  under /.shard in the worst case - a number far less than the max number
> of inodes that are supported by most backend file systems.
>
> ii) Entry self-heal for a single directory even with the simplest case of
> 1 entry deleted/created while a replica is down required crawling the whole
> sub-directory tree, figuring which entry is present/absent between src and
> sink and then healing it to the sink. With granular entry self-heal [1], we
> no longer have to live under this limitation.
>
> iii) Resolving shards from the original file name as given by the
> application to the corresponding shard within a single directory (/.shard
> in the existing case) would mean, looking up the parent dir /.shard first
> followed by lookup on the actual shard that is to be operated on. But
> having a two-tier sub-directory structure means that we not only have to
> resolve (or look-up) /.shard first, but also the directories
'/.shard/d2',
> '/.shard/d2/18', and
'/.shard/d2/18/d218cd1c-4bd9-40d7-9810-86b3f7932509'
> before finally looking up the shard, which is a lot of network operations.
> Yes, these are all one-time operations and the results can be cached in the
> inode table, but still on account of having to have dynamic gfids (as
> opposed to just /.shard, which has a fixed gfid -
> be318638-e8a0-4c6d-977d-7a937aa84806), it is trivial to resolve the name of
> the shard to gfid, or the parent name to parent gfid _even_ in memory.
>
s/trivial/non-trivial/ in the last sentence above.


Oh and [1] -
https://github.com/gluster/glusterfs-specs/blob/master/done/GlusterFS%203.8/granular-entry-self-healing.md

-Krutika

>
>
> Are you unhappy with the performance? What's your typical VM image
size,
> shard block size and the capacity of individual bricks?
>
> -Krutika
>
> On Mon, Jul 18, 2016 at 2:43 PM, Gandalf Corvotempesta <
> gandalf.corvotempesta at gmail.com> wrote:
>
>> 2016-07-18 9:53 GMT+02:00 Oleksandr Natalenko <oleksandr at
natalenko.name>:
>> > I'd say, like this:
>> >
>> > /.shard/d2/18/D218CD1C-4BD9-40D7-9810-86B3F7932509.1
>>
>> Yes, something like this.
>> I was on mobile when I wrote. Your suggestion is better than mine.
>>
>> Probably, using a directory for the whole shard is also better and
>> keep the directory structure clear:
>>
>>
>> 
/.shard/d2/18/D218CD1C-4BD9-40D7-9810-86B3F7932509/D218CD1C-4BD9-40D7-9810-86B3F7932509.1
>>
>> The current shard directory structure doesn't scale at all.
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160718/00204b1a/attachment.html>

Gluster users - Jul 2016 - Shard storage suggestions

[Gluster-users] Shard storage suggestions

[Gluster-users] Shard storage suggestions

[Gluster-users] Shard storage suggestions