thr3ads.net - zfs discuss - [zfs-discuss] Data balance across vdevs [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Jesse Stroik

2009-Nov-20 16:44 UTC

[zfs-discuss] Data balance across vdevs

I''m migrating to ZFS and Solaris for cluster computing storage, and
have
some completely static data sets that need to be as fast as possible. 
One of the scenarios I''m testing is the addition of vdevs to a pool.

Starting out, I populated a pool that had 4 vdevs.  Then, I added 3 more 
vdevs and would like to balance this data across the pool for 
performance.  The data may be in subdirectories like this: 
/proxy_data/instrument_X/domain_Y.  Because of the access pattern across 
the cluster, I need these subdirectories each spread across as many 
disks as possible.  Simply putting the data evenly on all vdevs is 
suboptimal because it is likely the case that different files within a 
single domain from a single instrument may be used with 200 jobs at once.

Because this particular data is 100% static, I cannot count on 
reads/writes automatically balancing the pool.

Best,
Jesse Stroik

Richard Elling

2009-Nov-20 17:28 UTC

head link

[zfs-discuss] Data balance across vdevs

Buy a large, read-optimized SSD (or several) and add it as a cache  
device :-)
  -- richard

On Nov 20, 2009, at 8:44 AM, Jesse Stroik wrote:
> I''m migrating to ZFS and Solaris for cluster computing storage,
and
> have some completely static data sets that need to be as fast as  
> possible. One of the scenarios I''m testing is the addition of
vdevs
> to a pool.
>
> Starting out, I populated a pool that had 4 vdevs.  Then, I added 3  
> more vdevs and would like to balance this data across the pool for  
> performance.  The data may be in subdirectories like this: / 
> proxy_data/instrument_X/domain_Y.  Because of the access pattern  
> across the cluster, I need these subdirectories each spread across  
> as many disks as possible.  Simply putting the data evenly on all  
> vdevs is suboptimal because it is likely the case that different  
> files within a single domain from a single instrument may be used  
> with 200 jobs at once.
>
> Because this particular data is 100% static, I cannot count on reads/ 
> writes automatically balancing the pool.
>
> Best,
> Jesse Stroik
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Erik Trimble

2009-Nov-20 17:45 UTC

head link

[zfs-discuss] Data balance across vdevs

Richard Elling wrote:> Buy a large, read-optimized SSD (or several) and add it as a cache 
> device :-)
>  -- richard
>
> On Nov 20, 2009, at 8:44 AM, Jesse Stroik wrote:
>
>> I''m migrating to ZFS and Solaris for cluster computing
storage, and
>> have some completely static data sets that need to be as fast as 
>> possible. One of the scenarios I''m testing is the addition of
vdevs
>> to a pool.
>>
>> Starting out, I populated a pool that had 4 vdevs.  Then, I added 3 
>> more vdevs and would like to balance this data across the pool for 
>> performance.  The data may be in subdirectories like this: 
>> /proxy_data/instrument_X/domain_Y.  Because of the access pattern 
>> across the cluster, I need these subdirectories each spread across as 
>> many disks as possible.  Simply putting the data evenly on all vdevs 
>> is suboptimal because it is likely the case that different files 
>> within a single domain from a single instrument may be used with 200 
>> jobs at once.
>>
>> Because this particular data is 100% static, I cannot count on 
>> reads/writes automatically balancing the pool.
>>
>> Best,
>> Jesse StroikOK, maybe I''m missing something here, but ZFS should spread ALL data 
across ALL vdevs, with the caveaut that very small files (under the min 
stripe size) will only be found on a portion of the vdevs - that is, 
such small files will only take up a portion of a single stripe.   The 
directory structure is irrelevant to how the data is written.

That is, the only determination as to how the file /foo/bar/baz is 
located on disk is the actual file size of baz itself, and the level of 
fragmentation of the zpool.  For static write-once data like yours, 
fragmentation shouldn''t be an issue.

In your case, where you had a 4 vdev stripe, and then added 3 vdevs, I 
would recommend re-copying the existing data to make sure it now covers 
all 7 vdevs.

Thus, I''d do something like:

% cd /proxy_data
% for i in instrument_*
do
    mkdir ${i}.new
    rsync -a $i/ ${i}.new/
    rm -rf $i
    mv ${i}.new $i
done

Richard''s suggestion, while tongue-in-cheek, has much merit.  If you
are
only going to be doing work on a small portion of your total data set at 
once, but heavily hit that section, then you want to read cache (L2ARC) 
as much of that as possible.  Which means, either buy lots of RAM, or 
get yourself an SSD.  Good news is that you can likely use one of the 
"cheaper" SSDs - the Intel X25-M is a good fit here for a Readzilla.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Jesse Stroik

2009-Nov-20 18:16 UTC

head link

[zfs-discuss] Data balance across vdevs

Thanks for the suggestions thus far,

Erik:
> In your case, where you had a 4 vdev stripe, and then added 3 vdevs, I 
> would recommend re-copying the existing data to make sure it now covers 
> all 7 vdevs.

Yes, this was my initial reaction as well, but I am concerned with the 
fact that I do not know how zfs populates the vdevs.  My naive guess is 
that it either fills the most empty, or (and more likely) fills them at 
a rate relative to their amount of free space -- that is, the new 
devices with more free space will get a disproportionate amount of some 
of the data.


> Richard''s suggestion, while tongue-in-cheek, has much merit.  If
you are
> only going to be doing work on a small portion of your total data set at 
> once, but heavily hit that section, then you want to read cache (L2ARC) 
> as much of that as possible.  Which means, either buy lots of RAM, or 
> get yourself an SSD.  Good news is that you can likely use one of the 
> "cheaper" SSDs - the Intel X25-M is a good fit here for a
Readzilla.

The problem is that caching the data may often not help: we''re storing 
tens of terabytes of data for some instruments, and we may only need to 
read each job worth of data once.  So you could cache the data, but it 
simply wouldn''t be read again.

There are, of course, job types where you use the same set of data for 
multiple jobs, but having even a small amount of extra memory seems to 
be very helpful in that case, as you''ll have several nodes reading the 
same data at roughly the same time.

Best,
Jesse

Bob Friesenhahn

2009-Nov-20 18:21 UTC

head link

[zfs-discuss] Data balance across vdevs

On Fri, 20 Nov 2009, Richard Elling wrote:
> Buy a large, read-optimized SSD (or several) and add it as a cache device
:-)
But first install as much RAM as the machine will accept. :-)

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Nov-20 18:29 UTC

head link

[zfs-discuss] Data balance across vdevs

On Fri, 20 Nov 2009, Jesse Stroik wrote:>
> Yes, this was my initial reaction as well, but I am concerned with the fact
> that I do not know how zfs populates the vdevs.  My naive guess is that it 
> either fills the most empty, or (and more likely) fills them at a rate 
> relative to their amount of free space -- that is, the new devices with
more
> free space will get a disproportionate amount of some of the data.
You are right that you only get to be a virgin once.  After the 
filesystem has been written to many times, it is not likely to perform 
quite as well as it originally did. With the size of your data, it 
seems inconvenient to restart the pool from scratch.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Nov-20 19:37 UTC

head link

[zfs-discuss] Data balance across vdevs

On Nov 20, 2009, at 10:16 AM, Jesse Stroik wrote:> Thanks for the suggestions thus far,
>
> Erik:
>
>> In your case, where you had a 4 vdev stripe, and then added 3  
>> vdevs, I would recommend re-copying the existing data to make sure  
>> it now covers all 7 vdevs.
>
>
> Yes, this was my initial reaction as well, but I am concerned with  
> the fact that I do not know how zfs populates the vdevs.  My naive  
> guess is that it either fills the most empty, or (and more likely)  
> fills them at a rate relative to their amount of free space -- that  
> is, the new devices with more free space will get a disproportionate  
> amount of some of the data.
There is a bias towards empty vdevs during writes. However, that won''t
help
data previously written.  The often-requested block pointer rewrite  
feature
could help rebalance, but do not expect it to be a trivial endeavor  
for very
large pools.
>> Richard''s suggestion, while tongue-in-cheek, has much merit. 
If
>> you are only going to be doing work on a small portion of your  
>> total data set at once, but heavily hit that section, then you want  
>> to read cache (L2ARC) as much of that as possible.  Which means,  
>> either buy lots of RAM, or get yourself an SSD.  Good news is that  
>> you can likely use one of the "cheaper" SSDs - the Intel
X25-M is a
>> good fit here for a Readzilla.
>
>
> The problem is that caching the data may often not help: we''re  
> storing tens of terabytes of data for some instruments, and we may  
> only need to read each job worth of data once.  So you could cache  
> the data, but it simply wouldn''t be read again.
Use the secondarycache property to manage those file systems that
use read-once data.
> There are, of course, job types where you use the same set of data  
> for multiple jobs, but having even a small amount of extra memory  
> seems to be very helpful in that case, as you''ll have several
nodes
> reading the same data at roughly the same time.
Yep. More, faster memory closer to the consumer is always better.
You could buy machines with TBs of RAM, but high-end x86 boxes top
out at 512 GB.
  -- richard

Jesse Stroik

2009-Nov-20 20:14 UTC

head link

[zfs-discuss] Data balance across vdevs

>> There are, of course, job types where you use the same set of data for 
>> multiple jobs, but having even a small amount of extra memory seems to 
>> be very helpful in that case, as you''ll have several nodes
reading the
>> same data at roughly the same time.
> 
> Yep. More, faster memory closer to the consumer is always better.
> You could buy machines with TBs of RAM, but high-end x86 boxes top
> out at 512 GB.

That was our previous approach.  We''re testing doing it with relatively
cheap, consumer-level Sun hardware (ie: machines with 64 or maybe 128GB 
of memory today) that can be easily expanded as the pool''s purpose
changes.

I know what our options are for increasing performance if we want to 
increase the budget.  My question isn''t, "I have this data set,
can you
please tell me how to buy and configure a system."  My question is,
"how
does ZFS balance pools during writes, and how can I force it to balance 
data I want balanced in the way I want it balanced?"  And if the answer 
to that question is, "you can''t reliably do this," then that
is
acceptable.  It''s something I would like to be able to plan around.

Right now, this storage node is very small (~100TB) and in testing.  I 
want to know how I can solve problems like this as we scale it up into a 
full fledged SAN that holds a lot more data and gets moved into 
production.  Knowing the limitations of ZFS is a critical part of 
properly designing and expanding the system.

Best,
Jesse

Bruno Sousa

2009-Nov-20 20:26 UTC

head link

[zfs-discuss] Data balance across vdevs

Interesting, at least to me, the part where/ "this storage node is very
small (~100TB)"  /:)
Anyway, how are you using your ZFS? Are you creating volumes and present
them to end-nodes over iscsi/fiber , nfs, or other? Could be helpfull to
use some sort of cluster filesystem to have some more control in the
"balance"of writes across devices?
I''m just thinking out loud, but if your end-nodes access a zvol over
iscsi,  could you format that zvol in the end-node with LustreFS, and
use the features of the LustreFS?


Bruno

Jesse Stroik wrote:>
>>> There are, of course, job types where you use the same set of data
>>> for multiple jobs, but having even a small amount of extra memory
>>> seems to be very helpful in that case, as you''ll have
several nodes
>>> reading the same data at roughly the same time.
>>
>> Yep. More, faster memory closer to the consumer is always better.
>> You could buy machines with TBs of RAM, but high-end x86 boxes top
>> out at 512 GB.
>
>
> That was our previous approach.  We''re testing doing it with
> relatively cheap, consumer-level Sun hardware (ie: machines with 64 or
> maybe 128GB of memory today) that can be easily expanded as the
pool''s
> purpose changes.
>
> I know what our options are for increasing performance if we want to
> increase the budget.  My question isn''t, "I have this data
set, can
> you please tell me how to buy and configure a system."  My question
> is, "how does ZFS balance pools during writes, and how can I force it
> to balance data I want balanced in the way I want it balanced?"  And
> if the answer to that question is, "you can''t reliably do
this," then
> that is acceptable.  It''s something I would like to be able to
plan
> around.
>
> Right now, this storage node is very small (~100TB) and in testing.  I
> want to know how I can solve problems like this as we scale it up into
> a full fledged SAN that holds a lot more data and gets moved into
> production.  Knowing the limitations of ZFS is a critical part of
> properly designing and expanding the system.
>
> Best,
> Jesse
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3656 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091120/bb1dc33c/attachment.bin>

Jesse Stroik

2009-Nov-20 21:00 UTC

head link

[zfs-discuss] Data balance across vdevs

Bruno,

Bruno Sousa wrote:> Interesting, at least to me, the part where/ "this storage node is
very
> small (~100TB)"  /:)

Well, that''s only as big as two x4540s, and we have lots of those for a
slightly different project.

> Anyway, how are you using your ZFS? Are you creating volumes and present
> them to end-nodes over iscsi/fiber , nfs, or other? Could be helpfull to
> use some sort of cluster filesystem to have some more control in the
> "balance"of writes across devices?

Yes -- right now I''m using Infiniband as our interconnect directly to 
each compute node, and am testing a variety of protocols for 
transferring data, all over IP, sadly.  I seem to have run into some SDP 
performance issues but that''s a different issue altogether.  Most 
likely, we''ll be moving toward Lustre which has RDMA support and is 
expected to support ZFS as a back end next year at which point we''ll 
migrate to directly mounted storage on the nodes.

> I''m just thinking out loud, but if your end-nodes access a zvol
over
> iscsi,  could you format that zvol in the end-node with LustreFS, and
> use the features of the LustreFS?

We''d really like to use ZFS and wait for Lustre to support it as a back
end next year.  At that point, I realize this particular question will 
probably be moot.  However, I don''t want to stand still while waiting 
for a feature to be implemented -- which is why I''m looking to mitigate
performance penalties associated with poorly organized data across vdevs.

Best,
Jesse

Richard Elling

2009-Nov-20 21:36 UTC

head link

[zfs-discuss] Data balance across vdevs

On Nov 20, 2009, at 12:14 PM, Jesse Stroik wrote:>
>>> There are, of course, job types where you use the same set of data
>>> for multiple jobs, but having even a small amount of extra memory  
>>> seems to be very helpful in that case, as you''ll have
several
>>> nodes reading the same data at roughly the same time.
>> Yep. More, faster memory closer to the consumer is always better.
>> You could buy machines with TBs of RAM, but high-end x86 boxes top
>> out at 512 GB.
>
>
> That was our previous approach.  We''re testing doing it with  
> relatively cheap, consumer-level Sun hardware (ie: machines with 64  
> or maybe 128GB of memory today) that can be easily expanded as the  
> pool''s purpose changes.
>
> I know what our options are for increasing performance if we want to  
> increase the budget.  My question isn''t, "I have this data
set, can
> you please tell me how to buy and configure a system."  My question  
> is, "how does ZFS balance pools during writes, and how can I force  
> it to balance data I want balanced in the way I want it balanced?"   
> And if the answer to that question is, "you can''t reliably do
this,"
> then that is acceptable.  It''s something I would like to be able
to
> plan around.
Writes (allocations) are biased towards freer (in the percentage  
sense) of
fully functional vdevs.  However, diversity for copies and affinity  
for gang blocks
is preserved.  The starting point for understanding this in the code  
is at:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c

> Right now, this storage node is very small (~100TB) and in testing.   
> I want to know how I can solve problems like this as we scale it up  
> into a full fledged SAN that holds a lot more data and gets moved  
> into production.  Knowing the limitations of ZFS is a critical part  
> of properly designing and expanding the system.
More work is done to try and level across metaslabs when the metaslabs
have less than 30% free space.  There may be a reasonable rule of thumb
lurking here somewhere, but I''m not sure it can be general enough as it
will depend, to some degree, on the workload.

This is pretty far down in the weeds... do many people think it would be
useful to describe this in human-grokable form?
Sometimes, ignorance is bliss :-)
  -- richard

Erik Trimble

2009-Nov-21 07:22 UTC

head link

[zfs-discuss] Data balance across vdevs

Something occurs to me:   how full is your current 4 vdev pool?  I''m 
assuming it''s not over 70% or so.

yes, by adding another 3 vdevs, any writes will be biased towards the 
"empty" vdevs, but that''s for less-than-full-stripe-width
writes (right,
Richard?).  That is, if I''m doing a write that would be full-stripe 
size, and I''ve got enough space on all vdevs (even if certain ones are 
much fuller than others), then it will write across all vdevs.

So, while you can''t get a virgin pool out of this, I think you can get 
stuff reasonably well-balanced by recopying then deleting say 1TB (or 
less) of data at a time.


Richard Elling wrote:> On Nov 20, 2009, at 12:14 PM, Jesse Stroik wrote:
>>
>>>> There are, of course, job types where you use the same set of
data
>>>> for multiple jobs, but having even a small amount of extra
memory
>>>> seems to be very helpful in that case, as you''ll have
several nodes
>>>> reading the same data at roughly the same time.
>>> Yep. More, faster memory closer to the consumer is always better.
>>> You could buy machines with TBs of RAM, but high-end x86 boxes top
>>> out at 512 GB.
>>
>>
>> That was our previous approach.  We''re testing doing it with 
>> relatively cheap, consumer-level Sun hardware (ie: machines with 64 
>> or maybe 128GB of memory today) that can be easily expanded as the 
>> pool''s purpose changes.
>>
>> I know what our options are for increasing performance if we want to 
>> increase the budget.  My question isn''t, "I have this
data set, can
>> you please tell me how to buy and configure a system."  My
question
>> is, "how does ZFS balance pools during writes, and how can I force
it
>> to balance data I want balanced in the way I want it balanced?" 
And
>> if the answer to that question is, "you can''t reliably do
this," then
>> that is acceptable.  It''s something I would like to be able to
plan
>> around. From a user''s standpoint, you can''t "force" ZFS to
do the block layout
in a manner you specify.  The best you can do is understand what ZFS 
does in a given situation. There''s no ability to TELL ZFS what to do.


> Writes (allocations) are biased towards freer (in the percentage 
> sense) of
> fully functional vdevs.  However, diversity for copies and affinity 
> for gang blocks
> is preserved.  The starting point for understanding this in the code 
> is at:
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c
>
>
>
>> Right now, this storage node is very small (~100TB) and in testing.  
>> I want to know how I can solve problems like this as we scale it up 
>> into a full fledged SAN that holds a lot more data and gets moved 
>> into production.  Knowing the limitations of ZFS is a critical part 
>> of properly designing and expanding the system.For a lot of reasons, I would consider creating NEW zpools when you add 
new disk space in large lots, rather than adding vdevs to existing 
zpools. It should prove no harder to manage, and allows you to get a 
virgin zpool which will provide the best performance.



> Sometimes, ignorance is bliss :-)
>  -- richardoooh, then I must be ecstatically happy!

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Richard Elling

2009-Nov-22 04:58 UTC

head link

[zfs-discuss] Data balance across vdevs

On Nov 20, 2009, at 11:22 PM, Erik Trimble wrote:
> Something occurs to me:   how full is your current 4 vdev pool? 
I''m
> assuming it''s not over 70% or so.
>
> yes, by adding another 3 vdevs, any writes will be biased towards  
> the "empty" vdevs, but that''s for
less-than-full-stripe-width writes
> (right, Richard?).  That is, if I''m doing a write that would be
full-
> stripe size, and I''ve got enough space on all vdevs (even if
certain
> ones are much fuller than others), then it will write across all  
> vdevs.
"vdev" is a generic term.  In this case, we''re looking at
dynamically
striping
data across "top-level vdevs."  Block devices are "leaf
vdevs."
  -- richard

Jesse Stroik

2009-Nov-23 21:22 UTC

head link

[zfs-discuss] Data balance across vdevs

Erik and Richard: thanks for the information -- this is all very good stuff.

Erik Trimble wrote:> Something occurs to me:   how full is your current 4 vdev pool? 
I''m
> assuming it''s not over 70% or so.
> 
> yes, by adding another 3 vdevs, any writes will be biased towards the 
> "empty" vdevs, but that''s for
less-than-full-stripe-width writes (right,
> Richard?).  That is, if I''m doing a write that would be
full-stripe
> size, and I''ve got enough space on all vdevs (even if certain ones
are
> much fuller than others), then it will write across all vdevs.
> 
> So, while you can''t get a virgin pool out of this, I think you can
get
> stuff reasonably well-balanced by recopying then deleting say 1TB (or 
> less) of data at a time.

I''m giving this a shot now.  The four top level vdevs were just under 
50% full, so this should give us good distribution and in fact it looks 
like the data is being spread across the top level vdevs almost equally. 
  Of course, this makes perfect sense since disk may often be less than 
70% used.

> Richard Elling wrote:
> For a lot of reasons, I would consider creating NEW zpools when you add 
> new disk space in large lots, rather than adding vdevs to existing 
> zpools. It should prove no harder to manage, and allows you to get a 
> virgin zpool which will provide the best performance.

Yes, that''s what I will strive for although money flows in interesting 
ways sometimes, and occasionally we will be in a situation where we''ll 
be building a new pool in two or even three expansions.  Depending on 
the pool''s usage, the write patterns of zpool/zfs are good for me to 
understand.

Best,
Jesse

Possibly Parallel Threads

Search for more possibly parallel threads

zfs discuss - Nov 2009 - Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

[zfs-discuss] Data balance across vdevs

Possibly Parallel Threads