Hi all I have this server with some 50TB disk space. It originally had 30TB on WD Greens, was filled quite full, and another storage chassis was added. Now, space problem gone, fine, but what about speed? Three of the VDEVs are quite full, as indicated below. VDEV #3 (the one with the spare active) just spent some 72 hours resilvering a 2TB drive. Now, those green drives suck quite hard, but not _that_ hard. I''m guessing the reason for this slowdown is the fill of those three first VDEVs. Now, is there a way, manually or automatically, to somehow balance the data across these LVOLs? My first guess is that doing this _automatically_ will require block pointer rewrite, but then, is there way to hack this thing by hand? PS: Yeah, I know, it''s more disks on the fourth VDEV than on the first three, but this was how wi chose to do it. PPS: c10d1s1 and c11d0s1 is the SLOG mirror, even though zpool iostat doesn''t show it (zfs status does). root at urd:~# zpool iostat -v capacity operations bandwidth pool alloc free read write read write ------------ ----- ----- ----- ----- ----- ----- dpool 38.7T 20.9T 302 39 17.3M 2.20M raidz2 12.1T 552G 81 3 4.72M 205K c7t2d0 - - 34 2 1.18M 41.5K c7t3d0 - - 34 2 1.18M 41.5K c7t4d0 - - 33 2 1.18M 41.5K c7t5d0 - - 33 2 1.18M 41.5K c7t6d0 - - 34 2 1.18M 41.5K c7t7d0 - - 33 2 1.18M 41.5K c8t0d0 - - 35 2 1.18M 41.5K raidz2 12.4T 277G 84 5 4.81M 278K c8t1d0 - - 35 3 1.22M 56.1K c8t2d0 - - 34 3 1.22M 56.1K c8t3d0 - - 35 3 1.22M 56.1K c8t4d0 - - 35 3 1.22M 56.1K c8t5d0 - - 34 3 1.22M 56.1K c8t6d0 - - 35 3 1.22M 56.1K c8t7d0 - - 34 3 1.22M 56.1K raidz2 12.0T 631G 101 7 6.50M 294K c9t0d0 - - 39 3 1.56M 58.7K c9t1d0 - - 39 3 1.56M 58.7K c9t2d0 - - 39 3 1.56M 58.7K c9t3d0 - - 39 3 1.56M 58.7K spare - - 472 42 7.16M 83.9K c9t4d0 - - 39 3 1.56M 58.7K c9t7d0 - - 0 259 2 6.85M c9t5d0 - - 39 3 1.56M 58.7K c9t6d0 - - 38 3 1.56M 58.7K mirror 11.8M 4.96G 0 14 0 1.03M c10d1s0 - - 0 14 0 1.03M c11d0s0 - - 0 14 0 1.03M raidz2 2.24T 19.5T 33 8 1.23M 417K c14t9d0 - - 11 2 212K 42.3K c14t10d0 - - 11 2 208K 42.3K c14t11d0 - - 12 2 211K 42.3K c14t12d0 - - 11 2 211K 42.3K c14t13d0 - - 11 2 207K 42.3K c14t14d0 - - 12 2 211K 42.3K c14t15d0 - - 11 2 211K 42.3K c14t16d0 - - 11 2 208K 42.3K c14t17d0 - - 12 2 212K 42.3K c14t18d0 - - 11 2 212K 42.3K c14t19d0 - - 11 2 209K 42.3K c14t20d0 - - 12 2 211K 42.3K cache - - - - - - c10d1s1 69.5G 7.88M 6 6 314K 765K c11d0s1 69.5G 6.59M 6 6 313K 766K -- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
obviously, I meant VDEVs, not LVOLs... It''s been a long day... ----- Original Message -----> Hi all > > I have this server with some 50TB disk space. It originally had 30TB > on WD Greens, was filled quite full, and another storage chassis was > added. Now, space problem gone, fine, but what about speed? Three of > the VDEVs are quite full, as indicated below. VDEV #3 (the one with > the spare active) just spent some 72 hours resilvering a 2TB drive. > Now, those green drives suck quite hard, but not _that_ hard. I''m > guessing the reason for this slowdown is the fill of those three first > VDEVs. > > Now, is there a way, manually or automatically, to somehow balance the > data across these LVOLs? My first guess is that doing this > _automatically_ will require block pointer rewrite, but then, is there > way to hack this thing by hand? > > PS: Yeah, I know, it''s more disks on the fourth VDEV than on the first > three, but this was how wi chose to do it. > PPS: c10d1s1 and c11d0s1 is the SLOG mirror, even though zpool iostat > doesn''t show it (zfs status does). > > root at urd:~# zpool iostat -v > capacity operations bandwidth > pool alloc free read write read write > ------------ ----- ----- ----- ----- ----- ----- > dpool 38.7T 20.9T 302 39 17.3M 2.20M > raidz2 12.1T 552G 81 3 4.72M 205K > c7t2d0 - - 34 2 1.18M 41.5K > c7t3d0 - - 34 2 1.18M 41.5K > c7t4d0 - - 33 2 1.18M 41.5K > c7t5d0 - - 33 2 1.18M 41.5K > c7t6d0 - - 34 2 1.18M 41.5K > c7t7d0 - - 33 2 1.18M 41.5K > c8t0d0 - - 35 2 1.18M 41.5K > raidz2 12.4T 277G 84 5 4.81M 278K > c8t1d0 - - 35 3 1.22M 56.1K > c8t2d0 - - 34 3 1.22M 56.1K > c8t3d0 - - 35 3 1.22M 56.1K > c8t4d0 - - 35 3 1.22M 56.1K > c8t5d0 - - 34 3 1.22M 56.1K > c8t6d0 - - 35 3 1.22M 56.1K > c8t7d0 - - 34 3 1.22M 56.1K > raidz2 12.0T 631G 101 7 6.50M 294K > c9t0d0 - - 39 3 1.56M 58.7K > c9t1d0 - - 39 3 1.56M 58.7K > c9t2d0 - - 39 3 1.56M 58.7K > c9t3d0 - - 39 3 1.56M 58.7K > spare - - 472 42 7.16M 83.9K > c9t4d0 - - 39 3 1.56M 58.7K > c9t7d0 - - 0 259 2 6.85M > c9t5d0 - - 39 3 1.56M 58.7K > c9t6d0 - - 38 3 1.56M 58.7K > mirror 11.8M 4.96G 0 14 0 1.03M > c10d1s0 - - 0 14 0 1.03M > c11d0s0 - - 0 14 0 1.03M > raidz2 2.24T 19.5T 33 8 1.23M 417K > c14t9d0 - - 11 2 212K 42.3K > c14t10d0 - - 11 2 208K 42.3K > c14t11d0 - - 12 2 211K 42.3K > c14t12d0 - - 11 2 211K 42.3K > c14t13d0 - - 11 2 207K 42.3K > c14t14d0 - - 12 2 211K 42.3K > c14t15d0 - - 11 2 211K 42.3K > c14t16d0 - - 11 2 208K 42.3K > c14t17d0 - - 12 2 212K 42.3K > c14t18d0 - - 11 2 212K 42.3K > c14t19d0 - - 11 2 209K 42.3K > c14t20d0 - - 12 2 211K 42.3K > cache - - - - - - > c10d1s1 69.5G 7.88M 6 6 314K 765K > c11d0s1 69.5G 6.59M 6 6 313K 766K > > > -- > Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > roy at karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres > intelligibelt. Det er et element?rt imperativ for alle pedagoger ? > unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de > fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Tue, Oct 19, 2010 at 7:13 PM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:> I have this server with some 50TB disk space. It originally had 30TB on WD Greens, was filled quite full, and another storage chassis was added. Now, space problem gone, fine, but what about speed? Three of the VDEVs are quite full, as indicated below. VDEV #3 (the one with the spare active) just spent some 72 hours resilvering a 2TB drive. Now, those green drives suck quite hard, but not _that_ hard. I''m guessing the reason for this slowdown is the fill of those three first VDEVs. > > Now, is there a way, manually or automatically, to somehow balance the data across these LVOLs? My first guess is that doing this _automatically_ will require block pointer rewrite, but then, is there way to hack this thing by hand?I described a similar issue in http://opensolaris.org/jive/thread.jspa?threadID=134581&tstart=30. My solution was to copy some datasets over to a new directory, delete the old ones and destroy any snapshots that retain them. Data is read from the old device and written on all, causing large chunks of space to be freed on the old device. I wished for a more aggressive write balancer but that may be too much to ask for.
On Oct 20, 2010, at 2:24 AM, Tuomas Leikola wrote:> On Tue, Oct 19, 2010 at 7:13 PM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote: >> I have this server with some 50TB disk space. It originally had 30TB on WD Greens, was filled quite full, and another storage chassis was added. Now, space problem gone, fine, but what about speed? Three of the VDEVs are quite full, as indicated below. VDEV #3 (the one with the spare active) just spent some 72 hours resilvering a 2TB drive. Now, those green drives suck quite hard, but not _that_ hard. I''m guessing the reason for this slowdown is the fill of those three first VDEVs. >> >> Now, is there a way, manually or automatically, to somehow balance the data across these LVOLs? My first guess is that doing this _automatically_ will require block pointer rewrite, but then, is there way to hack this thing by hand? > > > I described a similar issue in > http://opensolaris.org/jive/thread.jspa?threadID=134581&tstart=30. My > solution was to copy some datasets over to a new directory, delete the > old ones and destroy any snapshots that retain them. Data is read from > the old device and written on all, causing large chunks of space to be > freed on the old device. > > I wished for a more aggressive write balancer but that may be too much > to ask for.This can, of course, be tuned. Would you be interested in characterizing the benefits and costs of a variety of such tunings? -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA ''10 Conference, November 7-12, San Jose, CA ZFS and performance consulting http://www.RichardElling.com
On Wed, Oct 20, 2010 at 5:00 PM, Richard Elling <richard.elling at gmail.com> wrote:>>> Now, is there a way, manually or automatically, to somehow balance the data across these LVOLs? My first guess is that doing this _automatically_ will require block pointer rewrite, but then, is there way to hack this thing by hand? >> >> >> I described a similar issue in >> http://opensolaris.org/jive/thread.jspa?threadID=134581&tstart=30. My >> solution was to copy some datasets over to a new directory, delete the >> old ones and destroy any snapshots that retain them. Data is read from >> the old device and written on all, causing large chunks of space to be >> freed on the old device. >> >> I wished for a more aggressive write balancer but that may be too much >> to ask for. > > This can, of course, be tuned. ?Would you be interested in characterizing the > benefits and costs of a variety of such tunings?If you''re asking whether I wish to test and document my findings with such tunables, then yes, I''m interested, though this is a home file server so it''s not exactly laboratory environment. I also think I can produce enough spare parts to do synthetic tests (maybe in a VM environment). I was not aware of such tunables, though it appeared there might be some emergency mode when a vdev has only a few percent space left. My server is currently running OI_147 but I haven''t yet upgraded the pool so it''s still version 14. I also have 111b and 134 boot environments standing by. -- - Tuomas
On Wed, October 20, 2010 04:24, Tuomas Leikola wrote:> I wished for a more aggressive write balancer but that may be too much > to ask for.I don''t think it can be too much to ask for. Storage servers have long enough lives that adding disks to them is a routine operation; to the extent that that''s a problem, that really needs to be fixed. However, it''s not the sort of thing one should hold one''s breath waiting for! -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On 2010-Oct-21 01:28:46 +0800, David Dyer-Bennet <dd-b at dd-b.net> wrote:>On Wed, October 20, 2010 04:24, Tuomas Leikola wrote: > >> I wished for a more aggressive write balancer but that may be too much >> to ask for. > >I don''t think it can be too much to ask for. Storage servers have long >enough lives that adding disks to them is a routine operation; to the >extent that that''s a problem, that really needs to be fixed.It will (should) arrive as part of the mythical block pointer rewrite project. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101021/8d983239/attachment-0001.bin>
On Thu, Oct 21, 2010 at 12:06 AM, Peter Jeremy <peter.jeremy at alcatel-lucent.com> wrote:> On 2010-Oct-21 01:28:46 +0800, David Dyer-Bennet <dd-b at dd-b.net> wrote: >>On Wed, October 20, 2010 04:24, Tuomas Leikola wrote: >> >>> I wished for a more aggressive write balancer but that may be too much >>> to ask for. >> >>I don''t think it can be too much to ask for. ?Storage servers have long >>enough lives that adding disks to them is a routine operation; to the >>extent that that''s a problem, that really needs to be fixed. > > It will (should) arrive as part of the mythical block pointer rewrite project. >Actually BP rewrite would be needed for data rebalancing after the fact, as I was referring to write balancing that tries to mitigate the problem before is occurs. I was thinking of having a tunable like "writebalance=conservative|aggressive" where conservative would be the current mode and aggressive would be something like aiming that all devices reach 90% at exactly the same time, and avoid writing on devices over 90% altogether. The 90% limit is of course arbitrary, but seems like some kind of tripping point commonly. The downside of using aggressive balancing would of course be smaller write bandwidth, and the data written would not be striped so also subsequent read might have a drawback. Impact would depend heavily on usage pattern, obviously, but I expect most use cases would either not suffer from this, and it is arguable whether somewhat reduced bandwidth is preferable to serious write slowdown later down the road - the difference seems to be orders of magnitude, anyway. -- - Tuomas
On Tue, Oct 19, 2010 at 6:13 AM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:> Now, is there a way, manually or automatically, to somehow balance the data across these LVOLs? My first guess is that doing this _automatically_ will require block pointer rewrite, but then, is there way to hack this thing by hand?You could send | receive some datasets to the same system, then destroy the original and rename the new copy back to the original location. Or send a dataset to a different system, destroy the original, and then send it back again. Most of the new copy should end up on the new vdev, which will help balance things some. Of course, since the new copy is still mostly on one vdev and may not have better performance. Future writes will be able to spread across all the vdevs. You could continue to do this until you feel that datasets have been balanced out. Imagine mixing two fluids by pouring them back and forth between two glasses - After a few times it''ll be a homogenous solution. This makes me wonder if a ''ghetto bp_rewrite'' would be possible by simply preventing future writes to one vdev and duplicating (even via send | receive) all the blocks that are stored on that vdev. -B -- Brandon High : bhigh at freaks.com