Hi. What do you guys think about implementing ''zfs/zpool rewrite'' command? It''ll read every block older than the date when the command was executed and write it again (using standard ZFS COW mechanism, simlar to how resilvering works, but the data is read from the same disk it is written to). I see few situations where it might be useful: 1. My file system is almost full (or not) and I''d like to enable compression on it. Unfortunately compression will work from now on and I''d also like to compress already stored data. Here comes ''zfs rewrite''! 2. I was bad boy and turned off checksuming. Now I suspect something corrupts my data and I''d really like to checksum everything. Ok, here comes ''zfs rewrite''! 3. I created file system with huge amount of data, where most of the data is read-only. I change my server from intel to sparc64 machine. Adaptive endianess only change byte order to native on write and because file system is mostly read-only, it''ll need to byteswap all the time. And here comes ''zfs rewrite''! 4. Not sure how ZFS traverse blocks tree, if it is done based on files, it my be used to move data from one file closer to each other, which will reduce seek times. Because of the way how ZFS works, the data may become fragmented and ''zfs rewrite'' could be used for defragmentation. 5. Once file system encryption is implemented, this mechanism can be used to encrypt existing file system and also it can be used to change encryption key. What do you think? -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070127/de1cd2e5/attachment.bin>
> What do you guys think about implementing ''zfs/zpool rewrite'' command? > It''ll read every block older than the date when the command was executed > and write it again (using standard ZFS COW mechanism, simlar to how > resilvering works, but the data is read from the same disk it is written to> ).#1 How do you control I/O overhead? #2 Snapshot blocks are never rewritten at the moment. Most of your suggestions seem to imply working on the "live" data, but doing that for snapshots as well might be tricky.> 3. I created file system with huge amount of data, where most of the > data is read-only. I change my server from intel to sparc64 machine. > Adaptive endianess only change byte order to native on write and because > file system is mostly read-only, it''ll need to byteswap all the time. > And here comes ''zfs rewrite''!It''s only the metadata that is modified anyway, not the file data. I would hope that this could be done more easily than a full tree rewrite (and again the issue with snapshots). Also, the overhead there probably isn''t going to be very high (since the metadata will be cached in most cases). Other than that, I''m guessing something like this will be necessary to implement disk evacuation/removal. If you have to rewrite data from one disk to elsewhere in the pool, then rewriting the entire tree shouldn''t be much harder. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
On 26-Jan-07, at 11:34 PM, Pawel Jakub Dawidek wrote:> Hi. > > What do you guys think about implementing ''zfs/zpool rewrite'' command? > It''ll read every block older than the date when the command was > executed > and write it again (using standard ZFS COW mechanism, simlar to how > resilvering works, but the data is read from the same disk it is > written to). > > I see few situations where it might be useful: > > 1. My file system is almost full (or not) and I''d like to enable > compression on it. Unfortunately compression will work from now on and > I''d also like to compress already stored data. Here comes ''zfs > rewrite''! > > 2. I was bad boy and turned off checksuming. Now I suspect something > corrupts my data and I''d really like to checksum everything. Ok, here > comes ''zfs rewrite''!In this case you deserve what you get.> > 3. I created file system with huge amount of data, where most of the > data is read-only. I change my server from intel to sparc64 machine. > Adaptive endianess only change byte order to native on write and > because > file system is mostly read-only, it''ll need to byteswap all the time. > And here comes ''zfs rewrite''!Why would this help? (Obviously file data is never ''swapped''). --T> > 4. Not sure how ZFS traverse blocks tree, if it is done based on > files, > it my be used to move data from one file closer to each other, which > will reduce seek times. Because of the way how ZFS works, the data may > become fragmented and ''zfs rewrite'' could be used for defragmentation. > > 5. Once file system encryption is implemented, this mechanism can be > used to encrypt existing file system and also it can be used to change > encryption key. > > What do you think? > > -- > Pawel Jakub Dawidek http://www.wheel.pl > pjd at FreeBSD.org http://www.FreeBSD.org > FreeBSD committer Am I Evil? Yes, I Am! > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On January 27, 2007 12:27:17 AM -0200 Toby Thain <toby at smartgames.ca> wrote:> On 26-Jan-07, at 11:34 PM, Pawel Jakub Dawidek wrote: >> 3. I created file system with huge amount of data, where most of the >> data is read-only. I change my server from intel to sparc64 machine. >> Adaptive endianess only change byte order to native on write and >> because >> file system is mostly read-only, it''ll need to byteswap all the time. >> And here comes ''zfs rewrite''! > > Why would this help? (Obviously file data is never ''swapped'').Metadata (incl checksums?) still has to be byte-swapped. Or would atime updates also force a metadata update? Or am I totally mistaken. -frank
On Fri, Jan 26, 2007 at 10:57:19PM -0800, Frank Cusack wrote:> On January 27, 2007 12:27:17 AM -0200 Toby Thain <toby at smartgames.ca> wrote: > >On 26-Jan-07, at 11:34 PM, Pawel Jakub Dawidek wrote: > >>3. I created file system with huge amount of data, where most of the > >>data is read-only. I change my server from intel to sparc64 machine. > >>Adaptive endianess only change byte order to native on write and > >>because > >>file system is mostly read-only, it''ll need to byteswap all the time. > >>And here comes ''zfs rewrite''! > > > >Why would this help? (Obviously file data is never ''swapped''). > > Metadata (incl checksums?) still has to be byte-swapped. Or would > atime updates also force a metadata update? Or am I totally mistaken.You''re all correct. File data is never byte-swapped. Most metadata needs to be byte-swapped, but it''s generally only 1-2% of your space. So the overhead shouldn''t be significant, even if you never rewrite. An atime update will indeed cause a znode rewrite (unless you run with "zfs set atime=off"), so znodes will get rewritten by reads. The only other non-trivial metadata is the indirect blocks. All files up to 128k are stored in a single block: ZFS has variable blocksize from 512 bytes to 128k, so a 35k file consumes exactly 35k (not, say, 40k as it would with a fixed 8k blocksize). Single-block files have no indirect blocks, and hence no metadata other than the znode. So all that remains is the indirect blocks for files larger than 128k -- which is to say, not very much. Jeff
On 27-Jan-07, at 4:57 AM, Frank Cusack wrote:> On January 27, 2007 12:27:17 AM -0200 Toby Thain > <toby at smartgames.ca> wrote: >> On 26-Jan-07, at 11:34 PM, Pawel Jakub Dawidek wrote: >>> 3. I created file system with huge amount of data, where most of the >>> data is read-only. I change my server from intel to sparc64 machine. >>> Adaptive endianess only change byte order to native on write and >>> because >>> file system is mostly read-only, it''ll need to byteswap all the >>> time. >>> And here comes ''zfs rewrite''! >> >> Why would this help? (Obviously file data is never ''swapped''). > > Metadata (incl checksums?) still has to be byte-swapped.I''m aware, but is this really ever going to be an issue? --T> Or would > atime updates also force a metadata update? Or am I totally mistaken. > > -frank
On January 27, 2007 6:15:29 AM -0200 Toby Thain <toby at smartgames.ca> wrote:> > On 27-Jan-07, at 4:57 AM, Frank Cusack wrote: > >> On January 27, 2007 12:27:17 AM -0200 Toby Thain >> <toby at smartgames.ca> wrote: >>> On 26-Jan-07, at 11:34 PM, Pawel Jakub Dawidek wrote: >>>> 3. I created file system with huge amount of data, where most of the >>>> data is read-only. I change my server from intel to sparc64 machine. >>>> Adaptive endianess only change byte order to native on write and >>>> because >>>> file system is mostly read-only, it''ll need to byteswap all the >>>> time. >>>> And here comes ''zfs rewrite''! >>> >>> Why would this help? (Obviously file data is never ''swapped''). >> >> Metadata (incl checksums?) still has to be byte-swapped. > > I''m aware, but is this really ever going to be an issue?Well, it IS extra work. But yeah, it seems pretty insignificant to me. -frank
On Fri, Jan 26, 2007 at 06:08:50PM -0800, Darren Dunham wrote:> > What do you guys think about implementing ''zfs/zpool rewrite'' command? > > It''ll read every block older than the date when the command was executed > > and write it again (using standard ZFS COW mechanism, simlar to how > > resilvering works, but the data is read from the same disk it is written to> > ). > > #1 How do you control I/O overhead?The same way it is handled for scrub and resilver.> #2 Snapshot blocks are never rewritten at the moment. Most of your > suggestions seem to imply working on the "live" data, but doing that > for snapshots as well might be tricky.Good point, see below.> > 3. I created file system with huge amount of data, where most of the > > data is read-only. I change my server from intel to sparc64 machine. > > Adaptive endianess only change byte order to native on write and because > > file system is mostly read-only, it''ll need to byteswap all the time. > > And here comes ''zfs rewrite''! > > It''s only the metadata that is modified anyway, not the file data. I > would hope that this could be done more easily than a full tree rewrite > (and again the issue with snapshots). Also, the overhead there probably > isn''t going to be very high (since the metadata will be cached in most > cases).Agreed. Probably in this case there should be rewrite-only-metadata mode. I agree the overhead is probably not high, but on the other hand, I''m quite sure there are workload, which will see the difference, eg. ''find / -name something''.> Other than that, I''m guessing something like this will be necessary to > implement disk evacuation/removal. If you have to rewrite data from one > disk to elsewhere in the pool, then rewriting the entire tree shouldn''t > be much harder.How did I forget about this one?:) That''s right. I belive ZFS will gain such ability at some point and rewrite functionality fits very nice here: mark the disk/mirror/raid-z as no-more-writes and start rewrite process (probably only limited to this entity). To implement such functionality there also has to be a way to migrate snapshot data, so sooner or later there will be a need for moving snapshot blocks. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070128/7768338d/attachment.bin>
On January 28, 2007 4:59:48 PM +0100 Pawel Jakub Dawidek <pjd at FreeBSD.org> wrote:> On Fri, Jan 26, 2007 at 06:08:50PM -0800, Darren Dunham wrote: >> > 3. I created file system with huge amount of data, where most of the >> > data is read-only. I change my server from intel to sparc64 machine. >> > Adaptive endianess only change byte order to native on write and >> > because file system is mostly read-only, it''ll need to byteswap all >> > the time. And here comes ''zfs rewrite''! >> >> It''s only the metadata that is modified anyway, not the file data. I >> would hope that this could be done more easily than a full tree rewrite >> (and again the issue with snapshots). Also, the overhead there probably >> isn''t going to be very high (since the metadata will be cached in most >> cases). > > Agreed. Probably in this case there should be rewrite-only-metadata > mode. I agree the overhead is probably not high, but on the other hand, > I''m quite sure there are workload, which will see the difference, eg. > ''find / -name something''.I''d imagine even for that it wouldn''t matter. The I/O time will dwarf any time spent byte-swapping. Easily tested though. Make sure you set atime=off so that your find isn''t causing write I/O. -frank
Hello Jeff, Saturday, January 27, 2007, 8:27:09 AM, you wrote: JB> You''re all correct. File data is never byte-swapped. Most metadata JB> needs to be byte-swapped, but it''s generally only 1-2% of your space. JB> So the overhead shouldn''t be significant, even if you never rewrite. I remember some time ago Sun touted ZFS has some interesting new technology to deal with endianess and that patent is pending for it. Can you share what was it about? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Pawel Jakub Dawidek wrote:> What do you guys think about implementing ''zfs/zpool rewrite'' command? > It''ll read every block older than the date when the command was executed > and write it again (using standard ZFS COW mechanism, simlar to how > resilvering works, but the data is read from the same disk it is written to).Yeah, that would be great, and in fact we are implementing such a thing right now (to support pool shrinkage, among other features). The tricky part is dealing with block pointers that appear in multiple places (eg, snapshots and clones). Having "rewrite everything" result in more space being used would not be acceptable. --matt