I''m migrating to ZFS and Solaris for cluster computing storage, and have some completely static data sets that need to be as fast as possible. One of the scenarios I''m testing is the addition of vdevs to a pool. Starting out, I populated a pool that had 4 vdevs. Then, I added 3 more vdevs and would like to balance this data across the pool for performance. The data may be in subdirectories like this: /proxy_data/instrument_X/domain_Y. Because of the access pattern across the cluster, I need these subdirectories each spread across as many disks as possible. Simply putting the data evenly on all vdevs is suboptimal because it is likely the case that different files within a single domain from a single instrument may be used with 200 jobs at once. Because this particular data is 100% static, I cannot count on reads/writes automatically balancing the pool. Best, Jesse Stroik
Buy a large, read-optimized SSD (or several) and add it as a cache device :-) -- richard On Nov 20, 2009, at 8:44 AM, Jesse Stroik wrote:> I''m migrating to ZFS and Solaris for cluster computing storage, and > have some completely static data sets that need to be as fast as > possible. One of the scenarios I''m testing is the addition of vdevs > to a pool. > > Starting out, I populated a pool that had 4 vdevs. Then, I added 3 > more vdevs and would like to balance this data across the pool for > performance. The data may be in subdirectories like this: / > proxy_data/instrument_X/domain_Y. Because of the access pattern > across the cluster, I need these subdirectories each spread across > as many disks as possible. Simply putting the data evenly on all > vdevs is suboptimal because it is likely the case that different > files within a single domain from a single instrument may be used > with 200 jobs at once. > > Because this particular data is 100% static, I cannot count on reads/ > writes automatically balancing the pool. > > Best, > Jesse Stroik > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling wrote:> Buy a large, read-optimized SSD (or several) and add it as a cache > device :-) > -- richard > > On Nov 20, 2009, at 8:44 AM, Jesse Stroik wrote: > >> I''m migrating to ZFS and Solaris for cluster computing storage, and >> have some completely static data sets that need to be as fast as >> possible. One of the scenarios I''m testing is the addition of vdevs >> to a pool. >> >> Starting out, I populated a pool that had 4 vdevs. Then, I added 3 >> more vdevs and would like to balance this data across the pool for >> performance. The data may be in subdirectories like this: >> /proxy_data/instrument_X/domain_Y. Because of the access pattern >> across the cluster, I need these subdirectories each spread across as >> many disks as possible. Simply putting the data evenly on all vdevs >> is suboptimal because it is likely the case that different files >> within a single domain from a single instrument may be used with 200 >> jobs at once. >> >> Because this particular data is 100% static, I cannot count on >> reads/writes automatically balancing the pool. >> >> Best, >> Jesse StroikOK, maybe I''m missing something here, but ZFS should spread ALL data across ALL vdevs, with the caveaut that very small files (under the min stripe size) will only be found on a portion of the vdevs - that is, such small files will only take up a portion of a single stripe. The directory structure is irrelevant to how the data is written. That is, the only determination as to how the file /foo/bar/baz is located on disk is the actual file size of baz itself, and the level of fragmentation of the zpool. For static write-once data like yours, fragmentation shouldn''t be an issue. In your case, where you had a 4 vdev stripe, and then added 3 vdevs, I would recommend re-copying the existing data to make sure it now covers all 7 vdevs. Thus, I''d do something like: % cd /proxy_data % for i in instrument_* do mkdir ${i}.new rsync -a $i/ ${i}.new/ rm -rf $i mv ${i}.new $i done Richard''s suggestion, while tongue-in-cheek, has much merit. If you are only going to be doing work on a small portion of your total data set at once, but heavily hit that section, then you want to read cache (L2ARC) as much of that as possible. Which means, either buy lots of RAM, or get yourself an SSD. Good news is that you can likely use one of the "cheaper" SSDs - the Intel X25-M is a good fit here for a Readzilla. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Thanks for the suggestions thus far, Erik:> In your case, where you had a 4 vdev stripe, and then added 3 vdevs, I > would recommend re-copying the existing data to make sure it now covers > all 7 vdevs.Yes, this was my initial reaction as well, but I am concerned with the fact that I do not know how zfs populates the vdevs. My naive guess is that it either fills the most empty, or (and more likely) fills them at a rate relative to their amount of free space -- that is, the new devices with more free space will get a disproportionate amount of some of the data.> Richard''s suggestion, while tongue-in-cheek, has much merit. If you are > only going to be doing work on a small portion of your total data set at > once, but heavily hit that section, then you want to read cache (L2ARC) > as much of that as possible. Which means, either buy lots of RAM, or > get yourself an SSD. Good news is that you can likely use one of the > "cheaper" SSDs - the Intel X25-M is a good fit here for a Readzilla.The problem is that caching the data may often not help: we''re storing tens of terabytes of data for some instruments, and we may only need to read each job worth of data once. So you could cache the data, but it simply wouldn''t be read again. There are, of course, job types where you use the same set of data for multiple jobs, but having even a small amount of extra memory seems to be very helpful in that case, as you''ll have several nodes reading the same data at roughly the same time. Best, Jesse
On Fri, 20 Nov 2009, Richard Elling wrote:> Buy a large, read-optimized SSD (or several) and add it as a cache device :-)But first install as much RAM as the machine will accept. :-) Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, 20 Nov 2009, Jesse Stroik wrote:> > Yes, this was my initial reaction as well, but I am concerned with the fact > that I do not know how zfs populates the vdevs. My naive guess is that it > either fills the most empty, or (and more likely) fills them at a rate > relative to their amount of free space -- that is, the new devices with more > free space will get a disproportionate amount of some of the data.You are right that you only get to be a virgin once. After the filesystem has been written to many times, it is not likely to perform quite as well as it originally did. With the size of your data, it seems inconvenient to restart the pool from scratch. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Nov 20, 2009, at 10:16 AM, Jesse Stroik wrote:> Thanks for the suggestions thus far, > > Erik: > >> In your case, where you had a 4 vdev stripe, and then added 3 >> vdevs, I would recommend re-copying the existing data to make sure >> it now covers all 7 vdevs. > > > Yes, this was my initial reaction as well, but I am concerned with > the fact that I do not know how zfs populates the vdevs. My naive > guess is that it either fills the most empty, or (and more likely) > fills them at a rate relative to their amount of free space -- that > is, the new devices with more free space will get a disproportionate > amount of some of the data.There is a bias towards empty vdevs during writes. However, that won''t help data previously written. The often-requested block pointer rewrite feature could help rebalance, but do not expect it to be a trivial endeavor for very large pools.>> Richard''s suggestion, while tongue-in-cheek, has much merit. If >> you are only going to be doing work on a small portion of your >> total data set at once, but heavily hit that section, then you want >> to read cache (L2ARC) as much of that as possible. Which means, >> either buy lots of RAM, or get yourself an SSD. Good news is that >> you can likely use one of the "cheaper" SSDs - the Intel X25-M is a >> good fit here for a Readzilla. > > > The problem is that caching the data may often not help: we''re > storing tens of terabytes of data for some instruments, and we may > only need to read each job worth of data once. So you could cache > the data, but it simply wouldn''t be read again.Use the secondarycache property to manage those file systems that use read-once data.> There are, of course, job types where you use the same set of data > for multiple jobs, but having even a small amount of extra memory > seems to be very helpful in that case, as you''ll have several nodes > reading the same data at roughly the same time.Yep. More, faster memory closer to the consumer is always better. You could buy machines with TBs of RAM, but high-end x86 boxes top out at 512 GB. -- richard
>> There are, of course, job types where you use the same set of data for >> multiple jobs, but having even a small amount of extra memory seems to >> be very helpful in that case, as you''ll have several nodes reading the >> same data at roughly the same time. > > Yep. More, faster memory closer to the consumer is always better. > You could buy machines with TBs of RAM, but high-end x86 boxes top > out at 512 GB.That was our previous approach. We''re testing doing it with relatively cheap, consumer-level Sun hardware (ie: machines with 64 or maybe 128GB of memory today) that can be easily expanded as the pool''s purpose changes. I know what our options are for increasing performance if we want to increase the budget. My question isn''t, "I have this data set, can you please tell me how to buy and configure a system." My question is, "how does ZFS balance pools during writes, and how can I force it to balance data I want balanced in the way I want it balanced?" And if the answer to that question is, "you can''t reliably do this," then that is acceptable. It''s something I would like to be able to plan around. Right now, this storage node is very small (~100TB) and in testing. I want to know how I can solve problems like this as we scale it up into a full fledged SAN that holds a lot more data and gets moved into production. Knowing the limitations of ZFS is a critical part of properly designing and expanding the system. Best, Jesse
Interesting, at least to me, the part where/ "this storage node is very small (~100TB)" /:) Anyway, how are you using your ZFS? Are you creating volumes and present them to end-nodes over iscsi/fiber , nfs, or other? Could be helpfull to use some sort of cluster filesystem to have some more control in the "balance"of writes across devices? I''m just thinking out loud, but if your end-nodes access a zvol over iscsi, could you format that zvol in the end-node with LustreFS, and use the features of the LustreFS? Bruno Jesse Stroik wrote:> >>> There are, of course, job types where you use the same set of data >>> for multiple jobs, but having even a small amount of extra memory >>> seems to be very helpful in that case, as you''ll have several nodes >>> reading the same data at roughly the same time. >> >> Yep. More, faster memory closer to the consumer is always better. >> You could buy machines with TBs of RAM, but high-end x86 boxes top >> out at 512 GB. > > > That was our previous approach. We''re testing doing it with > relatively cheap, consumer-level Sun hardware (ie: machines with 64 or > maybe 128GB of memory today) that can be easily expanded as the pool''s > purpose changes. > > I know what our options are for increasing performance if we want to > increase the budget. My question isn''t, "I have this data set, can > you please tell me how to buy and configure a system." My question > is, "how does ZFS balance pools during writes, and how can I force it > to balance data I want balanced in the way I want it balanced?" And > if the answer to that question is, "you can''t reliably do this," then > that is acceptable. It''s something I would like to be able to plan > around. > > Right now, this storage node is very small (~100TB) and in testing. I > want to know how I can solve problems like this as we scale it up into > a full fledged SAN that holds a lot more data and gets moved into > production. Knowing the limitations of ZFS is a critical part of > properly designing and expanding the system. > > Best, > Jesse > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091120/bb1dc33c/attachment.bin>
Bruno, Bruno Sousa wrote:> Interesting, at least to me, the part where/ "this storage node is very > small (~100TB)" /:)Well, that''s only as big as two x4540s, and we have lots of those for a slightly different project.> Anyway, how are you using your ZFS? Are you creating volumes and present > them to end-nodes over iscsi/fiber , nfs, or other? Could be helpfull to > use some sort of cluster filesystem to have some more control in the > "balance"of writes across devices?Yes -- right now I''m using Infiniband as our interconnect directly to each compute node, and am testing a variety of protocols for transferring data, all over IP, sadly. I seem to have run into some SDP performance issues but that''s a different issue altogether. Most likely, we''ll be moving toward Lustre which has RDMA support and is expected to support ZFS as a back end next year at which point we''ll migrate to directly mounted storage on the nodes.> I''m just thinking out loud, but if your end-nodes access a zvol over > iscsi, could you format that zvol in the end-node with LustreFS, and > use the features of the LustreFS?We''d really like to use ZFS and wait for Lustre to support it as a back end next year. At that point, I realize this particular question will probably be moot. However, I don''t want to stand still while waiting for a feature to be implemented -- which is why I''m looking to mitigate performance penalties associated with poorly organized data across vdevs. Best, Jesse
On Nov 20, 2009, at 12:14 PM, Jesse Stroik wrote:> >>> There are, of course, job types where you use the same set of data >>> for multiple jobs, but having even a small amount of extra memory >>> seems to be very helpful in that case, as you''ll have several >>> nodes reading the same data at roughly the same time. >> Yep. More, faster memory closer to the consumer is always better. >> You could buy machines with TBs of RAM, but high-end x86 boxes top >> out at 512 GB. > > > That was our previous approach. We''re testing doing it with > relatively cheap, consumer-level Sun hardware (ie: machines with 64 > or maybe 128GB of memory today) that can be easily expanded as the > pool''s purpose changes. > > I know what our options are for increasing performance if we want to > increase the budget. My question isn''t, "I have this data set, can > you please tell me how to buy and configure a system." My question > is, "how does ZFS balance pools during writes, and how can I force > it to balance data I want balanced in the way I want it balanced?" > And if the answer to that question is, "you can''t reliably do this," > then that is acceptable. It''s something I would like to be able to > plan around.Writes (allocations) are biased towards freer (in the percentage sense) of fully functional vdevs. However, diversity for copies and affinity for gang blocks is preserved. The starting point for understanding this in the code is at: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c> Right now, this storage node is very small (~100TB) and in testing. > I want to know how I can solve problems like this as we scale it up > into a full fledged SAN that holds a lot more data and gets moved > into production. Knowing the limitations of ZFS is a critical part > of properly designing and expanding the system.More work is done to try and level across metaslabs when the metaslabs have less than 30% free space. There may be a reasonable rule of thumb lurking here somewhere, but I''m not sure it can be general enough as it will depend, to some degree, on the workload. This is pretty far down in the weeds... do many people think it would be useful to describe this in human-grokable form? Sometimes, ignorance is bliss :-) -- richard
Something occurs to me: how full is your current 4 vdev pool? I''m assuming it''s not over 70% or so. yes, by adding another 3 vdevs, any writes will be biased towards the "empty" vdevs, but that''s for less-than-full-stripe-width writes (right, Richard?). That is, if I''m doing a write that would be full-stripe size, and I''ve got enough space on all vdevs (even if certain ones are much fuller than others), then it will write across all vdevs. So, while you can''t get a virgin pool out of this, I think you can get stuff reasonably well-balanced by recopying then deleting say 1TB (or less) of data at a time. Richard Elling wrote:> On Nov 20, 2009, at 12:14 PM, Jesse Stroik wrote: >> >>>> There are, of course, job types where you use the same set of data >>>> for multiple jobs, but having even a small amount of extra memory >>>> seems to be very helpful in that case, as you''ll have several nodes >>>> reading the same data at roughly the same time. >>> Yep. More, faster memory closer to the consumer is always better. >>> You could buy machines with TBs of RAM, but high-end x86 boxes top >>> out at 512 GB. >> >> >> That was our previous approach. We''re testing doing it with >> relatively cheap, consumer-level Sun hardware (ie: machines with 64 >> or maybe 128GB of memory today) that can be easily expanded as the >> pool''s purpose changes. >> >> I know what our options are for increasing performance if we want to >> increase the budget. My question isn''t, "I have this data set, can >> you please tell me how to buy and configure a system." My question >> is, "how does ZFS balance pools during writes, and how can I force it >> to balance data I want balanced in the way I want it balanced?" And >> if the answer to that question is, "you can''t reliably do this," then >> that is acceptable. It''s something I would like to be able to plan >> around.From a user''s standpoint, you can''t "force" ZFS to do the block layout in a manner you specify. The best you can do is understand what ZFS does in a given situation. There''s no ability to TELL ZFS what to do.> Writes (allocations) are biased towards freer (in the percentage > sense) of > fully functional vdevs. However, diversity for copies and affinity > for gang blocks > is preserved. The starting point for understanding this in the code > is at: > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c > > > >> Right now, this storage node is very small (~100TB) and in testing. >> I want to know how I can solve problems like this as we scale it up >> into a full fledged SAN that holds a lot more data and gets moved >> into production. Knowing the limitations of ZFS is a critical part >> of properly designing and expanding the system.For a lot of reasons, I would consider creating NEW zpools when you add new disk space in large lots, rather than adding vdevs to existing zpools. It should prove no harder to manage, and allows you to get a virgin zpool which will provide the best performance.> Sometimes, ignorance is bliss :-) > -- richardoooh, then I must be ecstatically happy! -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Nov 20, 2009, at 11:22 PM, Erik Trimble wrote:> Something occurs to me: how full is your current 4 vdev pool? I''m > assuming it''s not over 70% or so. > > yes, by adding another 3 vdevs, any writes will be biased towards > the "empty" vdevs, but that''s for less-than-full-stripe-width writes > (right, Richard?). That is, if I''m doing a write that would be full- > stripe size, and I''ve got enough space on all vdevs (even if certain > ones are much fuller than others), then it will write across all > vdevs."vdev" is a generic term. In this case, we''re looking at dynamically striping data across "top-level vdevs." Block devices are "leaf vdevs." -- richard
Erik and Richard: thanks for the information -- this is all very good stuff. Erik Trimble wrote:> Something occurs to me: how full is your current 4 vdev pool? I''m > assuming it''s not over 70% or so. > > yes, by adding another 3 vdevs, any writes will be biased towards the > "empty" vdevs, but that''s for less-than-full-stripe-width writes (right, > Richard?). That is, if I''m doing a write that would be full-stripe > size, and I''ve got enough space on all vdevs (even if certain ones are > much fuller than others), then it will write across all vdevs. > > So, while you can''t get a virgin pool out of this, I think you can get > stuff reasonably well-balanced by recopying then deleting say 1TB (or > less) of data at a time.I''m giving this a shot now. The four top level vdevs were just under 50% full, so this should give us good distribution and in fact it looks like the data is being spread across the top level vdevs almost equally. Of course, this makes perfect sense since disk may often be less than 70% used.> Richard Elling wrote: > For a lot of reasons, I would consider creating NEW zpools when you add > new disk space in large lots, rather than adding vdevs to existing > zpools. It should prove no harder to manage, and allows you to get a > virgin zpool which will provide the best performance.Yes, that''s what I will strive for although money flows in interesting ways sometimes, and occasionally we will be in a situation where we''ll be building a new pool in two or even three expansions. Depending on the pool''s usage, the write patterns of zpool/zfs are good for me to understand. Best, Jesse