I need a bit of a sanity check here. 1) I have a a RAIDZ2 of 8 1TB drives, so 6TB usable, running on an ancient version of OpenSolaris (snv_134 I think). On that zpool (miniraid) I have a zvol (RichRAID) that''s using almost the whole FS. It''s shared out via COMSTAR Fibre Channel target mode. I''d like to move that zvol to a newer server with a larger zpool. Sounds like a job for ZFS send/receive, right? 2) Since ZFS send/receive is snapshot-based I need to create a snapshot. Unfortunately I did not realize that zvols require disk space sufficient to duplicate the zvol, and my zpool wasn''t big enough. After a false start (zpool add is dangerous when low on sleep) I added a 250GB mirror and a pair of 3GB mirrors to miniraid and was able to successfully snapshot the zvol: miniraid/RichRAID at exportable (I ended up booting off an OI 151a5 USB stick to make that work, since I don''t believe snv_134 could handle a 3TB disk). 3) Now it''s easy, right? I enabled root login via SSH on the new host, which is running a zpool "archive1" consisting of a single RAIDZ2 of 3TB drives using ashift=12, and did a ZFS send: ZFS send miniraid/RichRAID at exportable | ssh root at newhost zfs receive archive1/RichRAID It asked for the root password, I gave it that password, and it was off and running. GigE ain''t super fast, but I''ve got time. The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. That... doesn''t look right. (Comparing zfs list -t snapshot and looking at the 5.34 ref for the snapshot vs zfs list on the new system and looking at space used.) Is this a problem? Should I be panicking yet? -- Dave Pooser Manager of Information Services Alford Media http://www.alfordmedia.com
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Sep-15 05:39 UTC
[zfs-discuss] Zvol vs zfs send/zfs receive
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Dave Pooser > > Unfortunately I did not realize that zvols require disk space sufficient > to duplicate the zvol, and my zpool wasn''t big enough. After a false start > (zpool add is dangerous when low on sleep) I added a 250GB mirror and a > pair of 3GB mirrors to miniraid and was able to successfully snapshot the > zvol: miniraid/RichRAID at exportableThis doesn''t make any sense to me. The snapshot should not take up any (significant) space on the sending side. It''s only on the receiving side, trying to receive a snapshot, that you require space. Because it won''t clobber the existing zvol on the receiving side until the complete new zvol was received to clobber it with. But simply creating the snapshot on the sending side should be no problem.> The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. > That... doesn''t look right.I don''t know why that happens, but sometimes it happens. So far, I''ve always waited it out, and so far it''s always succeeded for me.
On 09/15/12 04:46 PM, Dave Pooser wrote:> I need a bit of a sanity check here. > > 1) I have a a RAIDZ2 of 8 1TB drives, so 6TB usable, running on an ancient > version of OpenSolaris (snv_134 I think). On that zpool (miniraid) I have > a zvol (RichRAID) that''s using almost the whole FS. It''s shared out via > COMSTAR Fibre Channel target mode. I''d like to move that zvol to a newer > server with a larger zpool. Sounds like a job for ZFS send/receive, right? > > 2) Since ZFS send/receive is snapshot-based I need to create a snapshot. > Unfortunately I did not realize that zvols require disk space sufficient > to duplicate the zvol, and my zpool wasn''t big enough.To do what? A snapshot only starts to consume space when data in the filesystem/volume changes.> After a false start > (zpool add is dangerous when low on sleep) I added a 250GB mirror and a > pair of 3GB mirrors to miniraid and was able to successfully snapshot the > zvol: miniraid/RichRAID at exportable (I ended up booting off an OI 151a5 USB > stick to make that work, since I don''t believe snv_134 could handle a 3TB > disk). > > 3) Now it''s easy, right? I enabled root login via SSH on the new host, > which is running a zpool "archive1" consisting of a single RAIDZ2 of 3TB > drives using ashift=12, and did a ZFS send: > ZFS send miniraid/RichRAID at exportable | ssh root at newhost zfs receive > archive1/RichRAID > > It asked for the root password, I gave it that password, and it was off > and running. GigE ain''t super fast, but I''ve got time. > > The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. > That... doesn''t look right. (Comparing zfs list -t snapshot and looking at > the 5.34 ref for the snapshot vs zfs list on the new system and looking at > space used.) > > Is this a problem? Should I be panicking yet? >No. Do you have compression on on one side but no the other? Either way, let things run to completion. -- Ian.
On 09/14/12 22:39, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Dave Pooser >> >> Unfortunately I did not realize that zvols require disk space sufficient >> to duplicate the zvol, and my zpool wasn''t big enough. After a false start >> (zpool add is dangerous when low on sleep) I added a 250GB mirror and a >> pair of 3GB mirrors to miniraid and was able to successfully snapshot the >> zvol: miniraid/RichRAID at exportable > > This doesn''t make any sense to me. The snapshot should not take up any (significant) space on the sending side. It''s only on the receiving side, trying to receive a snapshot, that you require space. Because it won''t clobber the existing zvol on the receiving side until the complete new zvol was received to clobber it with. > > But simply creating the snapshot on the sending side should be no problem.By default, zvols have reservations equal to their size (so that writes don''t fail due to the pool being out of space). Creating a snapshot in the presence of a reservation requires reserving enough space to overwrite every block on the device. You can remove or shrink the reservation if you know that the entire device won''t be overwritten.
> The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. > That... doesn''t look right. (Comparing zfs list -t snapshot and looking at > the 5.34 ref for the snapshot vs zfs list on the new system and looking at > space used.) > > Is this a problem? Should I be panicking yet?Well, the zfs send/receive finally finished, at a size of 9.56TB (apologies for the HTML, it was the only way I could make the columns readable): root at archive:/home/admin# zfs get all archive1/RichRAID NAME PROPERTY VALUE SOURCE archive1/RichRAID type volume - archive1/RichRAID creation Fri Sep 14 4:17 2012 - archive1/RichRAID used 9.56T - archive1/RichRAID available 1.10T - archive1/RichRAID referenced 9.56T - archive1/RichRAID compressratio 1.00x - archive1/RichRAID reservation none default archive1/RichRAID volsize 5.08T local archive1/RichRAID volblocksize 8K - archive1/RichRAID checksum on default archive1/RichRAID compression off default archive1/RichRAID readonly off default archive1/RichRAID copies 1 default archive1/RichRAID refreservation none default archive1/RichRAID primarycache all default archive1/RichRAID secondarycache all default archive1/RichRAID usedbysnapshots 0 - archive1/RichRAID usedbydataset 9.56T - archive1/RichRAID usedbychildren 0 - archive1/RichRAID usedbyrefreservation 0 - archive1/RichRAID logbias latency default archive1/RichRAID dedup off default archive1/RichRAID mlslabel none default archive1/RichRAID sync standard default archive1/RichRAID refcompressratio 1.00x - archive1/RichRAID written 9.56T - So used is 9.56TB, volsize is 5.08TB (which is the amount of data used on the volume). The Mac connected to the FC target sees a 5.6TB volume with 5.1TB used, so that makes sense-- but where did the other 4TB go? (I''m about at the point where I''m just going to create and export another volume on a second zpool and then let the Mac copy from one zvol to the other-- this is starting to feel like voodoo here.) -- Dave Pooser Manager of Information Services Alford Media http://www.alfordmedia.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120915/7bc9612e/attachment.html>
On Sat, 15 Sep 2012, Dave Pooser wrote:> The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. > That... doesn''t look right. (Comparing zfs list -t snapshot and looking at > the 5.34 ref for the snapshot vs zfs list on the new system and looking at > space used.) > > Is this a problem? Should I be panicking yet?Does the old pool use 512 byte sectors while the new pool uses 4K sectors? Is there any change to compression settings? With volblocksize of 8k on disks with 4K sectors one might expect very poor space utilization because metadata chunks will use/waste a minimum of 4k. There might be more space consumed by the metadata than the actual data. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Sep 14, 2012 at 11:07 PM, Bill Sommerfeld <sommerfeld at hamachi.org>wrote:> On 09/14/12 22:39, Edward Ned Harvey (**opensolarisisdeadlongliveopens**olaris) > wrote: > >> From: zfs-discuss-bounces@**opensolaris.org<zfs-discuss-bounces at opensolaris.org>[mailto: >>> zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Dave Pooser >>> >>> Unfortunately I did not realize that zvols require disk space sufficient >>> to duplicate the zvol, and my zpool wasn''t big enough. After a false >>> start >>> (zpool add is dangerous when low on sleep) I added a 250GB mirror and a >>> pair of 3GB mirrors to miniraid and was able to successfully snapshot the >>> zvol: miniraid/RichRAID at exportable >>> >> >> This doesn''t make any sense to me. The snapshot should not take up any >> (significant) space on the sending side. It''s only on the receiving side, >> trying to receive a snapshot, that you require space. Because it won''t >> clobber the existing zvol on the receiving side until the complete new zvol >> was received to clobber it with. >> >> But simply creating the snapshot on the sending side should be no problem. >> > > By default, zvols have reservations equal to their size (so that writes > don''t fail due to the pool being out of space). > > Creating a snapshot in the presence of a reservation requires reserving > enough space to overwrite every block on the device. > > You can remove or shrink the reservation if you know that the entire > device won''t be overwritten. > >This is the right idea, but it''s actually the refreservation (reservation on referenced space) that has this behavior, and is set by default on zvols. The reservation (on "used" space) covers the space consumed by snapshots, so taking a snapshot doesn''t affect it (at first, but the reservation will be consumed as you overwrite space and the snapshot "grows"). --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120915/0f2c86f3/attachment.html>
On Sat, Sep 15, 2012 at 2:07 PM, Dave Pooser <dave.zfs at alfordmedia.com>wrote:> The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. > That... doesn''t look right. (Comparing zfs list -t snapshot and looking at > the 5.34 ref for the snapshot vs zfs list on the new system and looking at > space used.) > > Is this a problem? Should I be panicking yet? > > > Well, the zfs send/receive finally finished, at a size of 9.56TB > (apologies for the HTML, it was the only way I could make the columns > readable): > > root at archive:/home/admin# zfs get all archive1/RichRAID > NAME PROPERTY VALUE SOURCE > archive1/RichRAID type volume - > archive1/RichRAID creation Fri Sep 14 4:17 2012 - > archive1/RichRAID used 9.56T - > archive1/RichRAID available 1.10T - > archive1/RichRAID referenced 9.56T - > archive1/RichRAID compressratio 1.00x - > archive1/RichRAID reservation none default > archive1/RichRAID volsize 5.08T local > archive1/RichRAID volblocksize 8K - > archive1/RichRAID checksum on default > archive1/RichRAID compression off default > archive1/RichRAID readonly off default > archive1/RichRAID copies 1 default > archive1/RichRAID refreservation none default > archive1/RichRAID primarycache all default > archive1/RichRAID secondarycache all default > archive1/RichRAID usedbysnapshots 0 - > archive1/RichRAID usedbydataset 9.56T - > archive1/RichRAID usedbychildren 0 - > archive1/RichRAID usedbyrefreservation 0 - > archive1/RichRAID logbias latency default > archive1/RichRAID dedup off default > archive1/RichRAID mlslabel none default > archive1/RichRAID sync standard default > archive1/RichRAID refcompressratio 1.00x - > archive1/RichRAID written 9.56T - > > So used is 9.56TB, volsize is 5.08TB (which is the amount of data used on > the volume). The Mac connected to the FC target sees a 5.6TB volume with > 5.1TB used, so that makes sense-- but where did the other 4TB go? >I''m not sure. The output of "zdb -bbb archive1" might help diagnose it. --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120915/7910c7cb/attachment-0001.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Sep-16 12:43 UTC
[zfs-discuss] Zvol vs zfs send/zfs receive
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bill Sommerfeld > > > But simply creating the snapshot on the sending side should be no > problem. > > By default, zvols have reservations equal to their size (so that writes > don''t fail due to the pool being out of space). > > Creating a snapshot in the presence of a reservation requires reserving > enough space to overwrite every block on the device.This is surprising, because it''s not like the "normal" zfs behavior. Normal ZFS does not reserve snapshot space to guarantee you can always completely overwrite every single used block of every single file in the system. It just starts consuming space for changed blocks, and if you fill up the zpool, further writes are denied until you delete some snaps. But you''re saying it handles zvols differently - That when you create a zvol, it reserves enough space for it, and when you snapshot it, it reserves enough space to completely overwrite it and keep both the snapshot and the current live version without running out of storage space. I never heard that before - and I can see some good reasons to do it this way - But it''s surprising. Based on what I''m hearing now, it also seems - Upon zvol creation, you create a reservation. Upon the first snapshot, you double the reservation, but upon subsequent snapshots, you don''t need to increase the reservation each time. Because snapshots are read-only, the system is able to account for all the used space of the snapshots, plus a reservation for the "live" current version. Total space reserved will be 2x the size of the zvol, plus the actual COW consumed space for all the snapshots. The point is to guarantee that writes to a zvol will never be denied, presumably because there''s an assumption zvol''s are being used by things like VM''s and iscsi shares, which behave very poorly if write is denied. Unlike "normal" files, where write denied is generally an annoyance but doesn''t cause deeper harm, such as virtual servers crashing. There''s another lesson to be learned here. As mentioned by Matthew, you can tweak your reservation (or refreservation) on the zvol, but you do so at your own risk, possibly putting yourself into a situation where writes to the zvol might get denied. But the important implied meaning is the converse - If you have guest VM''s in the filesystem (for example, if you''re sharing NFS to ESX, or if you''re running VirtualBox) then you might want to set the reservation (or refreservation) for those filesystems modeled after the zvol behavior. In other words, you might want to guarantee that ESX or VirtualBox can always write. It''s probably a smart thing to do, in a lot of situations.
On Sun, Sep 16, 2012 at 7:43 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> There''s another lesson to be learned here. > > As mentioned by Matthew, you can tweak your reservation (or refreservation) on the zvol, but you do so at your own risk, possibly putting yourself into a situation where writes to the zvol might get denied. > > But the important implied meaning is the converse - If you have guest VM''s in the filesystem (for example, if you''re sharing NFS to ESX, or if you''re running VirtualBox) then you might want to set the reservation (or refreservation) for those filesystems modeled after the zvol behavior. In other words, you might want to guarantee that ESX or VirtualBox can always write. It''s probably a smart thing to do, in a lot of situations.I''d say just do what you normally do. In my case, I use sparse files or dynamic disk images anyway, so when I use zvols I use "zfs create -s". That single switch sets reservation and refreservation to "none", -- Fajar
On Sep 15, 2012, at 6:03 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Sat, 15 Sep 2012, Dave Pooser wrote: > >> The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. >> That... doesn''t look right. (Comparing zfs list -t snapshot and looking at >> the 5.34 ref for the snapshot vs zfs list on the new system and looking at >> space used.) >> Is this a problem? Should I be panicking yet? > > Does the old pool use 512 byte sectors while the new pool uses 4K sectors? Is there any change to compression settings? > > With volblocksize of 8k on disks with 4K sectors one might expect very poor space utilization because metadata chunks will use/waste a minimum of 4k. There might be more space consumed by the metadata than the actual data.With a zvol of 8K blocksize, 4K sector disks, and raidz you will get 12K (data plus parity) written for every block, regardless of how many disks are in the set. There will also be some metadata overhead, but I don''t know of a metadata sizing formula for the general case. So the bad news is, 4K sector disks with small blocksize zvols tend to have space utilization more like mirroring. The good news is that performance is also more like mirroring. -- richard -- illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120916/c55bf96c/attachment.html>
On 9/16/12 10:40 AM, "Richard Elling" <richard.elling at gmail.com> wrote:>With a zvol of 8K blocksize, 4K sector disks, and raidz you will get 12K >(data >plus parity) written for every block, regardless of how many disks are in >the set. >There will also be some metadata overhead, but I don''t know of a metadata >sizing formula for the general case. > >So the bad news is, 4K sector disks with small blocksize zvols tend to >have space utilization more like mirroring. The good news is that >performance >is also more like mirroring. > -- richardOk, that makes sense. And since there''s no way to change the blocksize of a zvol after creation (AFAIK) I can either live with the size, find 3TB drives with 512byte sectors (I think Seagate Constellations would work) and do yet another send/receive, or create a new zvol with a larger blocksize and copy the files from one zvol to the other. (Leaning toward option 3 because the files are mostly largish graphics files and the like.) Thanks for the help! -- Dave Pooser Manager of Information Services Alford Media http://www.alfordmedia.com