thr3ads.net - Gluster users - [Gluster-users] small files and cluster/stripe [May 2010]

If this information is useful, please help other people find it:
Share via:

Jeff Anderson-Lee

2010-May-13 19:36 UTC

[Gluster-users] small files and cluster/stripe

cluster/stripe will split large files across multiple volumes, but it 
seems to
always put the first part of the file on the first volume; if you have a 
bunch of small files they all end up there, and one volume gets heavily 
used by small files while the others are empty.

cluster/distribute spreads files across multiple volumes, but it puts 
the whole file on a single volume.

Some marriage of the two would be helpful for workloads which contain 
both large and small files, like adding an "option block-size ..." to 
cluster/distribute or "option distribute" to cluster/stripe; it would 
use the filename hash modulo nSubvolumes to determine which volume to 
start in for the first block, then rotate around the stripe for the rest.

I suppose I can work-around by creating multiple volumes as 
sub-directories of the same partition, then striping across those in 
rotations, and distributing across the stripes.

Is there some other way?  Am I missing something?

Jeff Anderson-Lee

Craig Carl

2010-May-14 00:05 UTC

head link

[Gluster-users] small files and cluster/stripe

Jeff - 
Two comments/ideas. 

1. If you are limited to four pieces of hardware, the minimum for stripe, and
you want to stripe some of the data and just distribute other files there is a
way to do that. Ideally you would use your hardware RAID controllers to create
two LUNs on each host, one for distribute, the other for stripe. If you
don't have hardware RAID you could use LVM2 or ZFS to achieve the same
thing. (or you could use folders)
1a. Once you have two file systems created use glusterfs-volgen to create the
vol files for the distribute export just like you normally would.
1b. Move the files you just created to the storage servers and clients. 
1c. Re-run glusterfs-volgen this time for the stripe, adding the -p option and
specifying a port. (something above 1024, not 6996).
1d. Move the files you just created to the storage servers and clients. 
1e . Start Gluster twice on all the servers, specifying the different vol files.
1f. You now have two GlusterFS exports, one distribute, the other mirror. 


1g. You can mount one inside the other on the client if that makes management
easier.
There are advantages to this model, having two separate Gluster instances
significantly improves parallelism on the storage servers. You can manage the
two instances as if they are on different iron.




2. The use case for stripe is vanishingly small. If you have very large files
(at least 2X the amount of memory in your storage servers and a minimum of 50GB)
with very limited writes and simultaneous access from hundreds of clients then
maybe stripe might be appropriate. Stripe was designed for a specific type of
HPC problem solving, not general file serving. Our video streaming users
don't use stripe, even though that is an obvious use, there are better ways
to configure Gluster for that. If you could share the type of content/access
methods/iops per sec we could make some specific suggestions.








Thanks, 





Craig 









-- 
Craig Carl 



Gluster, Inc. 
Cell - (408) 829-9953 (California, USA) 
Gtalk - craig.carl at gmail.com 

----- Original Message ----- 
From: "Jeff Anderson-Lee" <jonah at eecs.berkeley.edu> 
To: gluster-users at gluster.org 
Sent: Thursday, May 13, 2010 12:36:58 PM GMT -08:00 US/Canada Pacific 
Subject: [Gluster-users] small files and cluster/stripe 

cluster/stripe will split large files across multiple volumes, but it 
seems to 
always put the first part of the file on the first volume; if you have a 
bunch of small files they all end up there, and one volume gets heavily 
used by small files while the others are empty. 

cluster/distribute spreads files across multiple volumes, but it puts 
the whole file on a single volume. 

Some marriage of the two would be helpful for workloads which contain 
both large and small files, like adding an "option block-size ..." to 
cluster/distribute or "option distribute" to cluster/stripe; it would 
use the filename hash modulo nSubvolumes to determine which volume to 
start in for the first block, then rotate around the stripe for the rest. 

I suppose I can work-around by creating multiple volumes as 
sub-directories of the same partition, then striping across those in 
rotations, and distributing across the stripes. 

Is there some other way? Am I missing something? 

Jeff Anderson-Lee 

_______________________________________________ 
Gluster-users mailing list 
Gluster-users at gluster.org 
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Craig Carl

2010-May-14 01:24 UTC

head link

[Gluster-users] small files and cluster/stripe

Jeff - 
Thanks for your email, I think I've got a grasp of your environment now and
I understand the problem. If we create a "/gluster/small_files" and a
"/gluster/large_files" your users are unlikely to respect distinction,
plus it is a management nightmare, right?
If you have time I'd like your help writing a feature request that would
implement what you need. Something like -

Gluster should provide the option of distributing files based on size to
different volumes.
This distribution should be transparent to users. 
This distribution only needs to happen the first time a file is written. 
The Gluster administrator should have the ability to provide a file size range
for each volume.
The different volumes could be different types; mirror, stripe, mirror &
distribute, etc.

What have I missed? 

Craig 





-- 
Craig Carl 



Gluster, Inc. 
Cell - (408) 829-9953 (California, USA) 
Gtalk - craig.carl at gmail.com 

----- Original Message ----- 
From: "Jeff Anderson-Lee" <jonah at eecs.berkeley.edu> 
To: "Craig Carl" <craig at gluster.com> 
Cc: gluster-users at gluster.org 
Sent: Thursday, May 13, 2010 5:27:45 PM GMT -08:00 US/Canada Pacific 
Subject: Re: [Gluster-users] small files and cluster/stripe 

On 5/13/2010 5:05 PM, Craig Carl wrote: 


Jeff - 
Two comments/ideas. 

1. If you are limited to four pieces of hardware, the minimum for stripe, and
you want to stripe some of the data and just distribute other files there is a
way to do that. Ideally you would use your hardware RAID controllers to create
two LUNs on each host, one for distribute, the other for stripe. If you
don't have hardware RAID you could use LVM2 or ZFS to achieve the same
thing. (or you could use folders)
1a. Once you have two file systems created use glusterfs-volgen to create the
vol files for the distribute export just like you normally would.
1b. Move the files you just created to the storage servers and clients. 
1c. Re-run glusterfs-volgen this time for the stripe, adding the -p option and
specifying a port. (something above 1024, not 6996).
1d. Move the files you just created to the storage servers and clients. 
1e . Start Gluster twice on all the servers, specifying the different vol files.
1f. You now have two GlusterFS exports, one distribute, the other mirror. 


1g. You can mount one inside the other on the client if that makes management
easier.
There are advantages to this model, having two separate Gluster instances
significantly improves parallelism on the storage servers. You can manage the
two instances as if they are on different iron.




2. The use case for stripe is vanishingly small. If you have very large files
(at least 2X the amount of memory in your storage servers and a minimum of 50GB)
with very limited writes and simultaneous access from hundreds of clients then
maybe stripe might be appropriate. Stripe was designed for a specific type of
HPC problem solving, not general file serving. Our video streaming users
don't use stripe, even though that is an obvious use, there are better ways
to configure Gluster for that. If you could share the type of content/access
methods/iops per sec we could make some specific suggestions.

We *are* a quasi-HPC environment. We have 100+ batch compute servers with 500+
cores, all with GbT interfaces, pounding on an old NAS storage server. We are
trying to replace the old shared staging area with new hardware. We've been
looking at an Isilon solution, which performs well for the task but costs 4x to
5x what a Gluster solution would price out at for similar-sized hardware/space.

Some our users have millions of small files, some have thousands of large files,
some have one or two humongous files. If all the data was just one size or
another all would be well. All files are currently stored in the same shared
staging area. Our users are not HPC programmers and tend to program in HLL such
as matlab, so we try to be as accommodating as possible, rather than force them
to manage the data distribution.

We'd love a solution that would (a) spread small files over multiple volumes
as well as (b) spread large files over multiple volumes. Cluster/distribute
would work for the former and cluster/stripe for the latter. A marriage of the
two would be great.

Right now I'm trying to patch together a temporary testbed using a bunch of
old machines with two 143GB drives each. The problem is that many files are
multi-GB and unless they are striped they could easily fill up a volume with
poor hash distributions. Likewise many small files could swamp the low-end disk
in a stripe volume.

I suppose we could create two pools and tell the predominantly small file users
to use one and the predominantly large file users to use the other, but somehow
I would not hold my breath on it working out.

Jeff

Jeff Darcy

2010-May-14 13:14 UTC

head link

[Gluster-users] small files and cluster/stripe

On 05/13/2010 03:36 PM, Jeff Anderson-Lee wrote:> cluster/stripe will split large files across multiple volumes, but it
> seems to
> always put the first part of the file on the first volume; if you have a
> bunch of small files they all end up there, and one volume gets heavily
> used by small files while the others are empty.
>
> cluster/distribute spreads files across multiple volumes, but it puts
> the whole file on a single volume.
>
> Some marriage of the two would be helpful for workloads which contain
> both large and small files, like adding an "option block-size
..." to
> cluster/distribute or "option distribute" to cluster/stripe; it
would
> use the filename hash modulo nSubvolumes to determine which volume to
> start in for the first block, then rotate around the stripe for the rest.
>
> I suppose I can work-around by creating multiple volumes as
> sub-directories of the same partition, then striping across those in
> rotations, and distributing across the stripes.
I have written such a distribute+stripe hybrid translator, but it's not 
ready to go beyond my desk and I'm distracted by another project right 
now.  Without that, distribute over stripe (as you suggest) would seem 
to be the natural choice, but I've generally had bad luck combining 
stripe with distribute or replicate.  In general, both distribute and 
stripe have a negative effect on reliability and per-node performance, 
though they make up for the latter by scaling horizontally, so you'd 
want to seriously consider adding replicate as well - but then you'll 
have even more complex interactions between complex translators and 
that's where bugs tend to creep in.  In fact, even though I'm from an 
HPC background myself, I've found little enough value in stripe that 
I've considered changing my translator to do distribute+replicate instead.

Craig Carl

2010-May-14 23:20 UTC

head link

[Gluster-users] small files and cluster/stripe

Jeff - 
I've paraphrased Tejas's response here - 
1. There is no way to know how big a file will be until the fclose() is
received.
2. What would we do about files that change sizes across the cutoff line? 
3. We could perhaps add a size parameter to the rebalance/defrag scripts we
have.

Would a process that redistributed the file on some sort of a schedule work? 

Craig 



-- 
Craig Carl 



Gluster, Inc. 
Cell - (408) 829-9953 (California, USA) 
Gtalk - craig.carl at gmail.com 

----- Original Message ----- 
From: "Jeff Anderson-Lee" <jonah at eecs.berkeley.edu> 
To: "Craig Carl" <craig at gluster.com> 
Cc: gluster-users at gluster.org 
Sent: Thursday, May 13, 2010 6:39:31 PM GMT -08:00 US/Canada Pacific 
Subject: Re: [Gluster-users] small files and cluster/stripe 

On 5/13/2010 6:24 PM, Craig Carl wrote: 


Jeff - 
Thanks for your email, I think I've got a grasp of your environment now and
I understand the problem. If we create a "/gluster/small_files" and a
"/gluster/large_files" your users are unlikely to respect distinction,
plus it is a management nightmare, right?
If you have time I'd like your help writing a feature request that would
implement what you need. Something like -

Gluster should provide the option of distributing files based on size to
different volumes.
This distribution should be transparent to users. 
This distribution only needs to happen the first time a file is written. 
The Gluster administrator should have the ability to provide a file size range
for each volume.
The different volumes could be different types; mirror, stripe, mirror &
distribute, etc.

What have I missed? 

Craig 

That would be one solution. I would target another that I suspecr is probably
simpler:

Gluster should provide the option of pseudo-randomizing the distribution of file
stripes across volumes, so that all small files do not end up on the same
subvolume of a cluster/stripe.
This distribution should be transparent to users. 
This distribution only needs to happen the first time a file is written and may
be based on the file name hash (a la cluster/distribute).

The net behavior could be such that small files (less that the block-size) would
have the same data distribution pattern as they would have with
cluster/distribute, while larger files (greater than the stripe block-size)
would have their upper blocks ditributed in a round-robin from that starting
place.

Given that the code already exists for distributing files based on namehash in
cluster/distribute I think this could be an easier feature to add.

Jeff

Gluster users - May 2010 - small files and cluster/stripe

[Gluster-users] small files and cluster/stripe

[Gluster-users] small files and cluster/stripe

[Gluster-users] small files and cluster/stripe

[Gluster-users] small files and cluster/stripe

[Gluster-users] small files and cluster/stripe