Bill Sommerfeld
2006-Nov-08 03:49 UTC
[zfs-discuss] I/O patterns during a "zpool replace": why write to the disk being replaced?
On a v40z running snv_51, I''m doing a "zpool replace z c1t4d0 c1t5d0". (so, why am I doing the replace? The outgoing disk has been reporting read errors sporadically but with increasing frequency over time..) zpool iostat -v shows writes going to the old (outgoing) disk as well as to the replacement disk. Is this intentional? Seems counterintuitive as I''d think you''d want to touch a suspect disk as little as possible and as nondestructively as possible... A representative snapshot from "zpool iostat -v" : capacity operations bandwidth pool used avail read write read write ------------- ----- ----- ----- ----- ----- ----- z 306G 714G 1.43K 658 23.5M 1.11M raidz1 109G 231G 1.08K 392 22.3M 497K replacing - - 0 1012 0 5.72M c1t4d0 - - 0 753 0 5.73M c1t5d0 - - 0 790 0 5.72M c2t12d0 - - 339 177 9.46M 149K c2t13d0 - - 317 177 9.08M 149K c3t12d0 - - 330 181 9.27M 147K c3t13d0 - - 352 180 9.45M 146K raidz1 100G 240G 117 101 373K 225K c1t3d0 - - 65 33 3.99M 64.1K c2t10d0 - - 60 44 3.77M 63.2K c2t11d0 - - 62 42 3.87M 63.4K c3t10d0 - - 63 42 3.88M 62.3K c3t11d0 - - 65 35 4.06M 61.8K raidz1 96.2G 244G 234 164 768K 415K c1t2d0 - - 129 49 7.85M 112K c2t8d0 - - 133 54 8.05M 112K c2t9d0 - - 132 56 8.08M 113K c3t8d0 - - 132 52 8.01M 113K c3t9d0 - - 132 49 8.16M 112K - Bill
Erblichs
2006-Nov-08 09:54 UTC
[zfs-discuss] I/O patterns during a "zpool replace": why write tothe disk being replaced?
Bill Sommerfield, Are their any existing snaps? Can you have any scripts that may be removing aged files? Mitchell Erblich ------------------ Bill Sommerfeld wrote:> > On a v40z running snv_51, I''m doing a "zpool replace z c1t4d0 c1t5d0". > > (so, why am I doing the replace? The outgoing disk has been reporting > read errors sporadically but with increasing frequency over time..) > > zpool iostat -v shows writes going to the old (outgoing) disk as well as > to the replacement disk. Is this intentional? > > Seems counterintuitive as I''d think you''d want to touch a suspect disk > as little as possible and as nondestructively as possible... > > A representative snapshot from "zpool iostat -v" : > > capacity operations bandwidth > pool used avail read write read write > ------------- ----- ----- ----- ----- ----- ----- > z 306G 714G 1.43K 658 23.5M 1.11M > raidz1 109G 231G 1.08K 392 22.3M 497K > replacing - - 0 1012 0 5.72M > c1t4d0 - - 0 753 0 5.73M > c1t5d0 - - 0 790 0 5.72M > c2t12d0 - - 339 177 9.46M 149K > c2t13d0 - - 317 177 9.08M 149K > c3t12d0 - - 330 181 9.27M 147K > c3t13d0 - - 352 180 9.45M 146K > raidz1 100G 240G 117 101 373K 225K > c1t3d0 - - 65 33 3.99M 64.1K > c2t10d0 - - 60 44 3.77M 63.2K > c2t11d0 - - 62 42 3.87M 63.4K > c3t10d0 - - 63 42 3.88M 62.3K > c3t11d0 - - 65 35 4.06M 61.8K > raidz1 96.2G 244G 234 164 768K 415K > c1t2d0 - - 129 49 7.85M 112K > c2t8d0 - - 133 54 8.05M 112K > c2t9d0 - - 132 56 8.08M 113K > c3t8d0 - - 132 52 8.01M 113K > c3t9d0 - - 132 49 8.16M 112K > > - Bill > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bill Sommerfeld
2006-Nov-08 13:56 UTC
[zfs-discuss] I/O patterns during a "zpool replace": why write tothe disk being replaced?
On Wed, 2006-11-08 at 01:54 -0800, Erblichs wrote:> > Bill Sommerfield,that''s not how my name is spelled> > Are their any existing snaps?no. why do you think this would matter?> > Can you have any scripts that may be > removing aged files?no; there was essentially no other activity on the pool other than the "replace". why do you think this would matter? - Bill
Erblichs
2006-Nov-10 03:18 UTC
[zfs-discuss] I/O patterns during a "zpool replace": why writetothe disk being replaced?
Bill Sommerfield, Because, first, I have seen alot of I/O occur while a snapshot is being aged out of a system. I don''t think that during the resilvering process accesses (read, writes) are completely stopped to the orig_dev. I expect at least some meta reads are going on. With some normal sporadic read failure, accessing the whole spool may force repeated reads for the replace. So, I was thinking that a read access that could ALSO be updating the znode. This newer time/date stamp is causing alot of writes. Depending on how the fs meta and blocks are being accessed, the orig_dev may also have some normal writes until it is offlined. Mitchell Erblich ----------------- Bill Sommerfeld wrote:> > On Wed, 2006-11-08 at 01:54 -0800, Erblichs wrote: > > > > Bill Sommerfield, > > that''s not how my name is spelled > > > > Are their any existing snaps? > no. why do you think this would matter? > > > > Can you have any scripts that may be > > removing aged files? > no; there was essentially no other activity on the pool other than the > "replace". > > why do you think this would matter? > > - Bill
Bill Sommerfeld
2006-Nov-10 03:42 UTC
[zfs-discuss] I/O patterns during a "zpool replace": why writetothe disk being replaced?
On Thu, 2006-11-09 at 19:18 -0800, Erblichs wrote:> Bill Sommerfield,Again, that''s not how my name is spelled.> With some normal sporadic read failure, accessing > the whole spool may force repeated reads for > the replace.please look again at the iostat I posted: capacity operations bandwidth pool used avail read write read write ------------- ----- ----- ----- ----- ----- ----- z 306G 714G 1.43K 658 23.5M 1.11M raidz1 109G 231G 1.08K 392 22.3M 497K replacing - - 0 1012 0 5.72M c1t4d0 - - 0 753 0 5.73M c1t5d0 - - 0 790 0 5.72M c2t12d0 - - 339 177 9.46M 149K c2t13d0 - - 317 177 9.08M 149K c3t12d0 - - 330 181 9.27M 147K c3t13d0 - - 352 180 9.45M 146K raidz1 100G 240G 117 101 373K 225K c1t3d0 - - 65 33 3.99M 64.1K c2t10d0 - - 60 44 3.77M 63.2K c2t11d0 - - 62 42 3.87M 63.4K c3t10d0 - - 63 42 3.88M 62.3K c3t11d0 - - 65 35 4.06M 61.8K raidz1 96.2G 244G 234 164 768K 415K c1t2d0 - - 129 49 7.85M 112K c2t8d0 - - 133 54 8.05M 112K c2t9d0 - - 132 56 8.08M 113K c3t8d0 - - 132 52 8.01M 113K c3t9d0 - - 132 49 8.16M 112K there were no (zero, none, nada, zilch) reads directed to the failing device. there were a lot of WRITES to the failing device; in fact, the the same volume of data was being written to BOTH the failing device and the new device.> So, I was thinking that a read access > that could ALSO be updating the znode. This newer > time/date stamp is causing alot of writes.that''s not going to be significant as a source of traffic; again, look at the above iostat, which was representative of the load throughout the resilver.
Erblichs
2006-Nov-10 05:19 UTC
[zfs-discuss] I/O patterns during a "zpool replace": whywritetothe disk being replaced?
Bill, Sommerfeld, Sorry, However, I am trying to explain what I think is happening on your system and why I consider this normal. Most of the reads/FS "replace" are normally at the block level. To copy a FS, some level of reading MUST be done at the orig_dev. At what level and whether it is recorded as a normal vnode read / mmap op for the direct and indirect blocks is another story. But it is being done. It is just not being recorded in FS stats. Read stats are normally used for normal FS object access requests. Secondly, maybe starting with the ?uberblock?, the rest of the meta data is probably being read. And because of the normal asyn access of FSs, it would not surprise me that then each znode''s access time field is updated. Remember, that unless you are just touching a FS low-level(file) object, all writes are proceeded by at least 1 read. Mitchell Erblich ---------------- Bill Sommerfeld wrote:> > On Thu, 2006-11-09 at 19:18 -0800, Erblichs wrote: > > Bill Sommerfield, > > Again, that''s not how my name is spelled. > > > With some normal sporadic read failure, accessing > > the whole spool may force repeated reads for > > the replace. > > please look again at the iostat I posted: > > capacity operations bandwidth > pool used avail read write read write > ------------- ----- ----- ----- ----- ----- ----- > z 306G 714G 1.43K 658 23.5M 1.11M > raidz1 109G 231G 1.08K 392 22.3M 497K > replacing - - 0 1012 0 5.72M > c1t4d0 - - 0 753 0 5.73M > c1t5d0 - - 0 790 0 5.72M > c2t12d0 - - 339 177 9.46M 149K > c2t13d0 - - 317 177 9.08M 149K > c3t12d0 - - 330 181 9.27M 147K > c3t13d0 - - 352 180 9.45M 146K > raidz1 100G 240G 117 101 373K 225K > c1t3d0 - - 65 33 3.99M 64.1K > c2t10d0 - - 60 44 3.77M 63.2K > c2t11d0 - - 62 42 3.87M 63.4K > c3t10d0 - - 63 42 3.88M 62.3K > c3t11d0 - - 65 35 4.06M 61.8K > raidz1 96.2G 244G 234 164 768K 415K > c1t2d0 - - 129 49 7.85M 112K > c2t8d0 - - 133 54 8.05M 112K > c2t9d0 - - 132 56 8.08M 113K > c3t8d0 - - 132 52 8.01M 113K > c3t9d0 - - 132 49 8.16M 112K > > there were no (zero, none, nada, zilch) reads directed to the failing > device. there were a lot of WRITES to the failing device; in fact, the > the same volume of data was being written to BOTH the failing device and > the new device. > > > So, I was thinking that a read access > > that could ALSO be updating the znode. This newer > > time/date stamp is causing alot of writes. > > that''s not going to be significant as a source of traffic; again, look > at the above iostat, which was representative of the load throughout the > resilver.
Al Hopper
2006-Nov-10 13:32 UTC
[zfs-discuss] I/O patterns during a "zpool replace": whywritetothe disk being replaced?
On Thu, 9 Nov 2006, Erblichs wrote:> > Bill, Sommerfeld, Sorry, > > However, I am trying to explain what I think is > happening on your system and why I consider this > normal. > > Most of the reads/FS "replace" are normally^^^^^> at the block level. > > To copy a FS, some level of reading MUST be done^^^^^^^^> at the orig_dev. > At what level and whether it is recorded as a > normal vnode read / mmap op for the direct and^^^^^> indirect blocks is another story. > > But it is being done. It is just not being > recorded in FS stats. Read stats are normally used^^^^^> for normal FS object access requests. > > Secondly, maybe starting with the ?uberblock?, the > rest of the meta data is probably being read. And^^^^^^^^^^^> because of the normal asyn access of FSs, it would > not surprise me that then each znode''s access time > field is updated. Remember, that unless you are just > touching a FS low-level(file) object, all writes are > proceeded by at least 1 read.^^^^^^^^> Mitchell Erblich > ---------------- >Mitchell - Bill is asking about WRITES and you''re talking READS! Your posts are making absolutely no sense to me....> > > Bill Sommerfeld wrote: > > > > On Thu, 2006-11-09 at 19:18 -0800, Erblichs wrote: > > > Bill Sommerfield, > > > > Again, that''s not how my name is spelled. > > > > > With some normal sporadic read failure, accessing > > > the whole spool may force repeated reads for > > > the replace. > > > > please look again at the iostat I posted: > > > > capacity operations bandwidth > > pool used avail read write read write > > ------------- ----- ----- ----- ----- ----- ----- > > z 306G 714G 1.43K 658 23.5M 1.11M > > raidz1 109G 231G 1.08K 392 22.3M 497K > > replacing - - 0 1012 0 5.72M > > c1t4d0 - - 0 753 0 5.73M > > c1t5d0 - - 0 790 0 5.72M > > c2t12d0 - - 339 177 9.46M 149K > > c2t13d0 - - 317 177 9.08M 149K > > c3t12d0 - - 330 181 9.27M 147K > > c3t13d0 - - 352 180 9.45M 146K > > raidz1 100G 240G 117 101 373K 225K > > c1t3d0 - - 65 33 3.99M 64.1K > > c2t10d0 - - 60 44 3.77M 63.2K > > c2t11d0 - - 62 42 3.87M 63.4K > > c3t10d0 - - 63 42 3.88M 62.3K > > c3t11d0 - - 65 35 4.06M 61.8K > > raidz1 96.2G 244G 234 164 768K 415K > > c1t2d0 - - 129 49 7.85M 112K > > c2t8d0 - - 133 54 8.05M 112K > > c2t9d0 - - 132 56 8.08M 113K > > c3t8d0 - - 132 52 8.01M 113K > > c3t9d0 - - 132 49 8.16M 112K > > > > there were no (zero, none, nada, zilch) reads directed to the failing > > device. there were a lot of WRITES to the failing device; in fact, the > > the same volume of data was being written to BOTH the failing device and > > the new device. > > > > > So, I was thinking that a read access > > > that could ALSO be updating the znode. This newer > > > time/date stamp is causing alot of writes. > > > > that''s not going to be significant as a source of traffic; again, look > > at the above iostat, which was representative of the load throughout the > > resilver. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Bill Sommerfeld
2006-Nov-10 14:47 UTC
[zfs-discuss] I/O patterns during a "zpool replace": whywritetothe disk being replaced?
On Thu, 2006-11-09 at 21:19 -0800, Erblichs wrote:> Bill, Sommerfeld, Sorry, > > However, I am trying to explain what I think is > happening on your system and why I consider this > normal.I''m not interested in speculation. Please do not respond to this message.> To copy a FS, some level of reading MUST be done > at the orig_dev.you appear to suffer from poor reading comprehension. according to zpool iostat, the original device was not being read. AT ALL. it was being WRITTEN. I found this behavior unusual and wanted to know from someone actually RESPONSIBLE for ZFS whether this was expected behavior or not. I''d appreciate it if only people who have made changes to the ZFS codebase found in opensolaris respond further to this thread. - Bill
Anton B. Rang
2006-Nov-10 15:47 UTC
[zfs-discuss] Re: I/O patterns during a "zpool replace":whywritetothe disk being replaced?
>I''d appreciate it if only people who have made changes to the ZFS >codebase found in opensolaris respond further to this thread.Well. I haven''t made changes, but I can read code. When replacing a device, ZFS internally takes the device being replaced and creates a mirror between the old and new device for the duration of the replacement. This is presumably done to leverage the existing resilvering code to copy data from one device to the other. There''s nothing special done to prevent writes to either side of the resulting mirror, which is why you see roughly equal amounts of data being written to each side. Every new block written to the disk being replaced will be written to both the old and new device. This message posted from opensolaris.org