thr3ads.net - Gluster users - [Gluster-users] How to correctly distribute OpenStack VM files... [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Xavier Trilla

2013-Aug-02 00:52 UTC

[Gluster-users] How to correctly distribute OpenStack VM files...

Hi,

We have been playing for a while with GlusterFS (Now with ver 3.4). We are
running tests and playing with it to check if GlusterFS  can be really used as
the distributed storage for OpenStack block storage (Cinder) as new features in
KVM, GlusterFS and OpenStack are pointing to GlusterFS as the future of
OpenStack open source block and object storage.

But we've found a problem just when we started playing with GlusterFS... The
way distribute translator (DHT) balances the load. I mean, we understand and see
the benefits of metadata less setup. Using hashes based on filenames and
assigning a hash range to each brick is clever, reliable and fast, but from our
understanding there is a big problem when it comes to storing VM images of a
OpenStack deployment.

I mean, OpenStack Block Storage (Cinder) assigns a name to each volume it
creates (GUID), so GlusterFS does a hash of the filename and decides in which
brick it should be stored. But as in this scenario we don't have many files
(I mean, we would just have one big file per VM) we may end with a really
unbalanced storage.

Let's say we have a 4 bricks setup with DHT distribute, and we want to store
100 VMs there, so the ideal scenario would be:

Brick1: 25 VMs
Brick2: 25 VMs
Brick3: 25 VMs
Brick4: 25 VMs

As VMs are IO intensive it's really important to correctly balance the load,
as each brick has a limited amount of IOPS, but as DHT is just based on a
filename HASH, we could end with something like the following scenario (Or even
worse):

Brick1: 50 VMs
Brick2: 10 VMs
Brick3: 35 VMs
Brick4: 5 VMs

And if we scale this out, things may get even worse. I mean, we may end with
almost all VM file in one or two bricks and all the other bricks almost empty.
And if we use growing VM disk image files like qcow2 the option
"min-free-disk" will not prevent all VMs disk image files being stored
in the same brick. So, I understand DHT works well for large amount of small
files, but for few big IO intensive files doesn't seem to be a really good
solution... (I mean, we are looking for a solution able to handle around 32
bricks and around 1500 VM for the initial deployment and able to scale up to 256
bricks and 12000 VMs :/ )

So, anybody has a suggestion about how to handle this? I mean so far we only see
two options: Either using legacy unify translator with ALU scheduler or either
use cluster/stripe translator with a big block-size so at least load gets
balanced across all bricks in some way.  But obviously we don't like unify
as it needs a namespace brick, and using stripping seems to have an impact on
performance and really complicates backup/restore/recovery strategies.

So, suggestions? :)

Thanks!

Saludos cordiales,
Xavier Trilla P.
Silicon Hosting<https://siliconhosting.com/>

?Sab?as que ahora en SiliconHosting
resolvemos tus dudas t?cnicas gratis?

M?s informaci?n en: siliconhosting.com/qa/<https://siliconhosting.com/qa/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130802/20b50f8d/attachment.html>

Joel Young

2013-Aug-02 13:53 UTC

head link

[Gluster-users] How to correctly distribute OpenStack VM files...

Are you actually observing this?  With cryptographic hashes being
effectively uniform the probability of such an extreme distribution is
extraordinarily low.
On Aug 2, 2013 6:23 AM, "Xavier Trilla" <xavier.trilla at
silicontower.net>
wrote:
>  Hi,****
>
> ** **
>
> We have been playing for a while with GlusterFS (Now with ver 3.4). We are
> running tests and playing with it to check if GlusterFS  can be really used
> as the distributed storage for OpenStack block storage (Cinder) as new
> features in KVM, GlusterFS and OpenStack are pointing to GlusterFS as the
> future of OpenStack open source block and object storage. ****
>
> ** **
>
> But we?ve found a problem just when we started playing with GlusterFS? The
> way distribute translator (DHT) balances the load. I mean, we understand
> and see the benefits of metadata less setup. Using hashes based on
> filenames and assigning a hash range to each brick is clever, reliable and
> fast, but from our understanding there is a big problem when it comes to
> storing VM images of a OpenStack deployment. ****
>
> ** **
>
> I mean, OpenStack Block Storage (Cinder) assigns a name to each volume it
> creates (GUID), so GlusterFS does a hash of the filename and decides in
> which brick it should be stored. But as in this scenario we don?t have many
> files (I mean, we would just have one big file per VM) we may end with a
> really unbalanced storage. ****
>
> ** **
>
> Let?s say we have a 4 bricks setup with DHT distribute, and we want to
> store 100 VMs there, so the ideal scenario would be:****
>
> ** **
>
> Brick1: 25 VMs****
>
> Brick2: 25 VMs****
>
> Brick3: 25 VMs****
>
> Brick4: 25 VMs****
>
> ** **
>
> As VMs are IO intensive it?s really important to correctly balance the
> load, as each brick has a limited amount of IOPS, but as DHT is just based
> on a filename HASH, we could end with something like the following scenario
> (Or even worse): ****
>
> ** **
>
> Brick1: 50 VMs****
>
> Brick2: 10 VMs****
>
> Brick3: 35 VMs****
>
> Brick4: 5 VMs****
>
> ** **
>
> And if we scale this out, things may get even worse. I mean, we may end
> with almost all VM file in one or two bricks and all the other bricks
> almost empty. And if we use growing VM disk image files like qcow2 the
> option ?min-free-disk? will not prevent all VMs disk image files being
> stored in the same brick. So, I understand DHT works well for large amount
> of small files, but for few big IO intensive files doesn?t seem to be a
> really good solution? (I mean, we are looking for a solution able to handle
> around 32 bricks and around 1500 VM for the initial deployment and able to
> scale up to 256 bricks and 12000 VMs :/ )****
>
> ** **
>
> So, anybody has a suggestion about how to handle this? I mean so far we
> only see two options: Either using legacy unify translator with ALU
> scheduler or either use cluster/stripe translator with a big block-size so
> at least load gets balanced across all bricks in some way.  But obviously
> we don?t like unify as it needs a namespace brick, and using stripping
> seems to have an impact on performance and really complicates
> backup/restore/recovery strategies. ****
>
> ** **
>
> So, suggestions? :) ****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Saludos cordiales,****
>
> Xavier Trilla P.****
>
> Silicon Hosting <https://siliconhosting.com/>****
>
> ** **
>
> ?Sab?as que ahora en SiliconHosting ****
>
> resolvemos tus dudas t?cnicas gratis?****
>
> ** **
>
> M?s informaci?n en: siliconhosting.com/qa/****
>
> ** **
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130802/a54063c6/attachment.html>

Ben Turner

2013-Aug-02 14:39 UTC

head link

[Gluster-users] How to correctly distribute OpenStack VM files...

I ran into this in my testing as well.  I think that this is happening because
Cinder names its image files in the format image-$UUID.  My guess is that the
similar filenames are hashing to the same brick as I have seen similar behavior
naming files like test1, test2, ect.  Looking at this problem the first/simplest
thing I could think of was for cinder to use a more unique filename and hope
that they don't hash to the same brick.  Maybe even just use $UUID, but I
don't know enough about the EHA to make very educated suggestions.

Something completely experimental I have been toying with is the idea of running
both gluster(with the NUFA xlator) and cinder on all compute nodes.  Gluster
would group all of the local storage across compute nodes the into a single name
space and replicate it.  NUFA would prefer the local node for creating the image
and hopefully reduce some overhead of going over the wire(although data would
still need to be replicated).  I doubt the folks in either the glusterfs or OS
camps would suggest this but, on paper at least, I was thinking this could help
several things that I have ran into.

As far as striping goes I haven't looked into that as an option, and tbh I
prolly won't until it becomes supported on the Red Hat downstream bits.  But
it sounds like a possibility.  The ALU scheduler isn't something I am
familiar with so I don't have a comment there.

I'll keep an eye on this thread, let me know what ya come up with!

-b

----- Original Message -----> From: "Xavier Trilla" <xavier.trilla at silicontower.net>
> To: gluster-users at gluster.org
> Sent: Thursday, August 1, 2013 8:52:36 PM
> Subject: [Gluster-users] How to correctly distribute OpenStack VM files...
> 
> 
> 
> Hi,
> 
> 
> 
> We have been playing for a while with GlusterFS (Now with ver 3.4). We are
> running tests and playing with it to check if GlusterFS can be really used
> as the distributed storage for OpenStack block storage (Cinder) as new
> features in KVM, GlusterFS and OpenStack are pointing to GlusterFS as the
> future of OpenStack open source block and object storage.
> 
> 
> 
> But we?ve found a problem just when we started playing with GlusterFS? The
> way distribute translator (DHT) balances the load. I mean, we understand
and
> see the benefits of metadata less setup. Using hashes based on filenames
and
> assigning a hash range to each brick is clever, reliable and fast, but from
> our understanding there is a big problem when it comes to storing VM images
> of a OpenStack deployment.
> 
> 
> 
> I mean, OpenStack Block Storage (Cinder) assigns a name to each volume it
> creates (GUID), so GlusterFS does a hash of the filename and decides in
> which brick it should be stored. But as in this scenario we don?t have many
> files (I mean, we would just have one big file per VM) we may end with a
> really unbalanced storage.
> 
> 
> 
> Let?s say we have a 4 bricks setup with DHT distribute, and we want to
store
> 100 VMs there, so the ideal scenario would be:
> 
> 
> 
> Brick1: 25 VMs
> 
> Brick2: 25 VMs
> 
> Brick3: 25 VMs
> 
> Brick4: 25 VMs
> 
> 
> 
> As VMs are IO intensive it?s really important to correctly balance the
load,
> as each brick has a limited amount of IOPS, but as DHT is just based on a
> filename HASH, we could end with something like the following scenario (Or
> even worse):
> 
> 
> 
> Brick1: 50 VMs
> 
> Brick2: 10 VMs
> 
> Brick3: 35 VMs
> 
> Brick4: 5 VMs
> 
> 
> 
> And if we scale this out, things may get even worse. I mean, we may end
with
> almost all VM file in one or two bricks and all the other bricks almost
> empty. And if we use growing VM disk image files like qcow2 the option
> ?min-free-disk? will not prevent all VMs disk image files being stored in
> the same brick. So, I understand DHT works well for large amount of small
> files, but for few big IO intensive files doesn?t seem to be a really good
> solution? (I mean, we are looking for a solution able to handle around 32
> bricks and around 1500 VM for the initial deployment and able to scale up
to
> 256 bricks and 12000 VMs :/ )
> 
> 
> 
> So, anybody has a suggestion about how to handle this? I mean so far we
only
> see two options: Either using legacy unify translator with ALU scheduler or
> either use cluster/stripe translator with a big block-size so at least load
> gets balanced across all bricks in some way. But obviously we don?t like
> unify as it needs a namespace brick, and using stripping seems to have an
> impact on performance and really complicates backup/restore/recovery
> strategies.
> 
> 
> 
> So, suggestions? :)
> 
> 
> 
> Thanks!
> 
> 
> 
> Saludos cordiales,
> 
> Xavier Trilla P.
> 
> Silicon Hosting
> 
> 
> 
> ?Sab?as que ahora en SiliconHos ting
> 
> resolvemos tus dudas t?cnicas gratis?
> 
> 
> 
> M?s informaci?n en: siliconhosting.com/qa/
> 
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users

Gowrishankar Rajaiyan

2013-Aug-05 13:01 UTC

head link

[Gluster-users] How to correctly distribute OpenStack VM files...

On 08/02/2013 06:22 AM, Xavier Trilla wrote:>
> Hi,
>
> We have been playing for a while with GlusterFS (Now with ver 3.4). We 
> are running tests and playing with it to check if GlusterFS  can be 
> really used as the distributed storage for OpenStack block storage 
> (Cinder) as new features in KVM, GlusterFS and OpenStack are pointing 
> to GlusterFS as the future of OpenStack open source block and object 
> storage.
>
> But we've found a problem just when we started playing with 
> GlusterFS... The way distribute translator (DHT) balances the load. I 
> mean, we understand and see the benefits of metadata less setup. Using 
> hashes based on filenames and assigning a hash range to each brick is 
> clever, reliable and fast, but from our understanding there is a big 
> problem when it comes to storing VM images of a OpenStack deployment.
>
> I mean, OpenStack Block Storage (Cinder) assigns a name to each volume 
> it creates (GUID), so GlusterFS does a hash of the filename and 
> decides in which brick it should be stored. But as in this scenario we 
> don't have many files (I mean, we would just have one big file per VM) 
> we may end with a really unbalanced storage.
>
> Let's say we have a 4 bricks setup with DHT distribute, and we want to 
> store 100 VMs there, so the ideal scenario would be:
>
> Brick1: 25 VMs
>
> Brick2: 25 VMs
>
> Brick3: 25 VMs
>
> Brick4: 25 VMs
>
> As VMs are IO intensive it's really important to correctly balance the 
> load, as each brick has a limited amount of IOPS, but as DHT is just 
> based on a filename HASH, we could end with something like the 
> following scenario (Or even worse):
>
> Brick1: 50 VMs
>
> Brick2: 10 VMs
>
> Brick3: 35 VMs
>
> Brick4: 5 VMs
>
> And if we scale this out, things may get even worse. I mean, we may 
> end with almost all VM file in one or two bricks and all the other 
> bricks almost empty. And if we use growing VM disk image files like 
> qcow2 the option "min-free-disk" will not prevent all VMs disk
image
> files being stored in the same brick. So, I understand DHT works well 
> for large amount of small files, but for few big IO intensive files 
> doesn't seem to be a really good solution... (I mean, we are looking 
> for a solution able to handle around 32 bricks and around 1500 VM for 
> the initial deployment and able to scale up to 256 bricks and 12000 
> VMs :/ )
>
> So, anybody has a suggestion about how to handle this? I mean so far 
> we only see two options: Either using legacy unify translator with ALU 
> scheduler or either use cluster/stripe translator with a big 
> block-size so at least load gets balanced across all bricks in some 
> way.  But obviously we don't like unify as it needs a namespace brick, 
> and using stripping seems to have an impact on performance and really 
> complicates backup/restore/recovery strategies.
>
>
Another suggestion that you may want to try is, have your GlusterFS node 
also serve as OpenStack Cinder and use NUFA[1]

~shanks

[1] 
http://gluster.org/community/documentation/index.php/Translators/cluster/nufa 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130805/98ae88e0/attachment.html>

Gluster users - Aug 2013 - How to correctly distribute OpenStack VM files...

[Gluster-users] How to correctly distribute OpenStack VM files...

[Gluster-users] How to correctly distribute OpenStack VM files...

[Gluster-users] How to correctly distribute OpenStack VM files...

[Gluster-users] How to correctly distribute OpenStack VM files...