Hello everyone, As one of the steps of improving my ZFS home fileserver (snv_134) I wanted to replace a 1TB disk with a newer one of the same vendor/model/size because this new one has 64MB cache vs. 16MB in the previous one. The removed disk will be use for backups, so I thought it''s better off to have a 64MB cache disk in the on-line pool than in the backup set sitting off-line all day. To replace it a I did: $ zpool datos offline c12t0d0 Shutdown the server, replace the physical disk and boot up back again. Then: $ zpool datos replace c12t0d0 Now the resilvering is taking way too much time to complete and the disk throughput is horrible. Take a look: zpool status pool: datos state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 1h7m, 0.04% done, 2744h48m to go config: NAME STATE READ WRITE CKSUM datos DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c12t1d0 ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 c12t0d0s0/o FAULTED 0 0 0 corrupted data c12t0d0 ONLINE 0 0 0 313M resilvered c12t2d0 ONLINE 0 0 0 errors: No known data errors zpool iostat capacity operations bandwidth pool alloc free read write read write ----------------- ----- ----- ----- ----- ----- ----- datos 2.27T 460G 14 29 32.3K 64.7K raidz1 2.27T 460G 14 29 32.3K 64.7K c12t1d0 - - 14 0 16.3K 0 replacing - - 0 44 0 50.1K c12t0d0s0/o - - 0 0 0 0 c12t0d0 - - 0 44 0 50.1K c12t2d0 - - 14 0 16.9K 0 ----------------- ----- ----- ----- ----- ----- ----- iostat cpu us sy wt id 0 0 0 100 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 47.6 0.0 58.8 0.0 10.0 0.0 210.1 0 100 c12t0d0 16.8 0.0 33.1 0.0 0.0 0.2 0.0 13.8 0 22 c12t1d0 16.6 0.0 20.1 0.0 0.0 0.4 0.0 22.0 0 29 c12t2d0 Needless to say that this was working perfectly fine an hour ago with the previous disk and the new one was already tested during the last couple of days copying data and doing scrub without any issue. I already checked the cables on that disk but the throughput remains the same. Do you have any idea what''s going on with the resilver? Thanks in advance. Regards, Leandro. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100424/bcb0a6f3/attachment.html>
ZFS first does a scan of indicies and such, which requires lots of seeks. After that, the resilvering starts. I guess if you give it an hour, it''ll be done roy ----- "Leandro Vanden Bosch" <l.vbosch at gmail.com> skrev: Hello everyone, As one of the steps of improving my ZFS home fileserver (snv_134) I wanted to replace a 1TB disk with a newer one of the same vendor/model/size because this new one has 64MB cache vs. 16MB in the previous one. The removed disk will be use for backups, so I thought it''s better off to have a 64MB cache disk in the on-line pool than in the backup set sitting off-line all day. pool: datos state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 1h7m, 0.04% done, 2744h48m to go -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100424/6197be08/attachment.html>
Thanks Roy for your reply. I actually waited a little more than an hour, but I''m still going to wait a little longer following your suggestion and a little hunch of mine. I just found out that this new WD10EARS is one of the new 4k disks. I believed that only the 2TB models where 4k. See: BEFORE Apr 23 02:51:37 alexia sata: [ID 663010 kern.info] /pci at 0,0/pci1043,82d4 at 1f,2 : Apr 23 02:51:37 alexia sata: [ID 761595 kern.info] SATA disk device at port 0 Apr 23 02:51:37 alexia sata: [ID 846691 kern.info] model WDC WD10EACS-00ZJB0 Apr 23 02:51:37 alexia sata: [ID 693010 kern.info] firmware 01.01B01 Apr 23 02:51:37 alexia sata: [ID 163988 kern.info] serial number WD-WCASJ1862768 Apr 23 02:51:37 alexia sata: [ID 594940 kern.info] supported features: Apr 23 02:51:37 alexia sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Apr 23 02:51:37 alexia sata: [ID 643337 kern.info] SATA Gen2 signaling speed (3.0Gbps) Apr 23 02:51:37 alexia sata: [ID 349649 kern.info] Supported queue depth 32 Apr 23 02:51:37 alexia sata: [ID 349649 kern.info] capacity 1953525168 sectors Apr 23 02:51:37 alexia scsi: [ID 583861 kern.info] sd7 at ahci0: target 0 lun 0 Apr 23 02:51:37 alexia genunix: [ID 936769 kern.info] sd7 is /pci at 0 ,0/pci1043,82d4 at 1f,2/disk at 0,0 Apr 23 02:51:37 alexia genunix: [ID 408114 kern.info] /pci at 0 ,0/pci1043,82d4 at 1f,2/disk at 0,0 (sd7) online Apr 23 02:51:37 alexia sata: [ID 663010 kern.info] /pci at 0,0/pci1043,82d4 at 1f,2 : Apr 23 02:51:37 alexia sata: [ID 761595 kern.info] SATA disk device at port 1 Apr 23 02:51:37 alexia sata: [ID 846691 kern.info] model WDC WD10EACS-00ZJB0 Apr 23 02:51:37 alexia sata: [ID 693010 kern.info] firmware 01.01B01 Apr 23 02:51:37 alexia sata: [ID 163988 kern.info] serial number WD-WCASJ1859398 Apr 23 02:51:37 alexia sata: [ID 594940 kern.info] supported features: Apr 23 02:51:37 alexia sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Apr 23 02:51:37 alexia sata: [ID 643337 kern.info] SATA Gen2 signaling speed (3.0Gbps) Apr 23 02:51:37 alexia sata: [ID 349649 kern.info] Supported queue depth 32 Apr 23 02:51:37 alexia sata: [ID 349649 kern.info] capacity 1953525168 sectors Apr 23 02:51:37 alexia scsi: [ID 583861 kern.info] sd8 at ahci0: target 1 lun 0 Apr 23 02:51:37 alexia genunix: [ID 936769 kern.info] sd8 is /pci at 0 ,0/pci1043,82d4 at 1f,2/disk at 1,0 Apr 23 02:51:37 alexia genunix: [ID 408114 kern.info] /pci at 0 ,0/pci1043,82d4 at 1f,2/disk at 1,0 (sd8) online Apr 23 02:51:37 alexia sata: [ID 663010 kern.info] /pci at 0,0/pci1043,82d4 at 1f,2 : Apr 23 02:51:37 alexia sata: [ID 761595 kern.info] SATA disk device at port 2 Apr 23 02:51:37 alexia sata: [ID 846691 kern.info] model WDC WD10EADS-65L5B1 Apr 23 02:51:37 alexia sata: [ID 693010 kern.info] firmware 01.01A01 Apr 23 02:51:37 alexia sata: [ID 163988 kern.info] serial number WD-WCAU4A901698 Apr 23 02:51:37 alexia sata: [ID 594940 kern.info] supported features: Apr 23 02:51:37 alexia sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Apr 23 02:51:37 alexia sata: [ID 643337 kern.info] SATA Gen2 signaling speed (3.0Gbps) Apr 23 02:51:37 alexia sata: [ID 349649 kern.info] Supported queue depth 32 Apr 23 02:51:37 alexia sata: [ID 349649 kern.info] capacity 1953525168 sectors Apr 23 02:51:37 alexia scsi: [ID 583861 kern.info] sd9 at ahci0: target 2 lun 0 AFTER Apr 24 16:19:15 alexia sata: [ID 663010 kern.info] /pci at 0,0/pci1043,82d4 at 1f,2 : Apr 24 16:19:15 alexia sata: [ID 761595 kern.info] SATA disk device at port 0 Apr 24 16:19:15 alexia sata: [ID 846691 kern.info] model WDC WD10EARS-00Y5B1 Apr 24 16:19:15 alexia sata: [ID 693010 kern.info] firmware 80.00A80 Apr 24 16:19:15 alexia sata: [ID 163988 kern.info] serial number WD-WCAV57869374 Apr 24 16:19:15 alexia sata: [ID 594940 kern.info] supported features: Apr 24 16:19:15 alexia sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Apr 24 16:19:15 alexia sata: [ID 643337 kern.info] SATA Gen2 signaling speed (3.0Gbps) Apr 24 16:19:15 alexia sata: [ID 349649 kern.info] Supported queue depth 32 Apr 24 16:19:15 alexia sata: [ID 349649 kern.info] capacity 1953525168 sectors Apr 24 16:19:15 alexia scsi: [ID 583861 kern.info] sd7 at ahci0: target 0 lun 0 Apr 24 16:19:15 alexia genunix: [ID 936769 kern.info] sd7 is /pci at 0 ,0/pci1043,82d4 at 1f,2/disk at 0,0 Apr 24 16:19:15 alexia genunix: [ID 408114 kern.info] /pci at 0 ,0/pci1043,82d4 at 1f,2/disk at 0,0 (sd7) online Apr 24 16:19:15 alexia sata: [ID 663010 kern.info] /pci at 0,0/pci1043,82d4 at 1f,2 : Apr 24 16:19:15 alexia sata: [ID 761595 kern.info] SATA disk device at port 1 Apr 24 16:19:15 alexia sata: [ID 846691 kern.info] model WDC WD10EACS-00ZJB0 Apr 24 16:19:15 alexia sata: [ID 693010 kern.info] firmware 01.01B01 Apr 24 16:19:15 alexia sata: [ID 163988 kern.info] serial number WD-WCASJ1859398 Apr 24 16:19:15 alexia sata: [ID 594940 kern.info] supported features: Apr 24 16:19:15 alexia sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Apr 24 16:19:15 alexia sata: [ID 643337 kern.info] SATA Gen2 signaling speed (3.0Gbps) Apr 24 16:19:15 alexia sata: [ID 349649 kern.info] Supported queue depth 32 Apr 24 16:19:15 alexia sata: [ID 349649 kern.info] capacity 1953525168 sectors Apr 24 16:19:15 alexia scsi: [ID 583861 kern.info] sd8 at ahci0: target 1 lun 0 Apr 24 16:19:15 alexia genunix: [ID 936769 kern.info] sd8 is /pci at 0 ,0/pci1043,82d4 at 1f,2/disk at 1,0 Apr 24 16:19:15 alexia genunix: [ID 408114 kern.info] /pci at 0 ,0/pci1043,82d4 at 1f,2/disk at 1,0 (sd8) online Apr 24 16:19:15 alexia sata: [ID 663010 kern.info] /pci at 0,0/pci1043,82d4 at 1f,2 : Apr 24 16:19:15 alexia sata: [ID 761595 kern.info] SATA disk device at port 2 Apr 24 16:19:15 alexia sata: [ID 846691 kern.info] model WDC WD10EADS-65L5B1 Apr 24 16:19:15 alexia sata: [ID 693010 kern.info] firmware 01.01A01 Apr 24 16:19:15 alexia sata: [ID 163988 kern.info] serial number WD-WCAU4A901698 Apr 24 16:19:15 alexia sata: [ID 594940 kern.info] supported features: Apr 24 16:19:15 alexia sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Apr 24 16:19:15 alexia sata: [ID 643337 kern.info] SATA Gen2 signaling speed (3.0Gbps) Apr 24 16:19:15 alexia sata: [ID 349649 kern.info] Supported queue depth 32 Apr 24 16:19:15 alexia sata: [ID 349649 kern.info] capacity 1953525168 sectors Apr 24 16:19:15 alexia scsi: [ID 583861 kern.info] sd9 at ahci0: target 2 lun 0 I had two EACS and one EADS. Now one of each: EARS, EACS and EADS. I''m searching a little bit about this 4k sectors and OpenSolaris. I believe I''m gonna swap the old disk back in and see what happens. Even trying to query something from that EARS disk (format, prtvtoc) is a few seconds slower than the other two. Leandro. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100424/36bd0be7/attachment.html>
Confirmed then that the issue was with the WD10EARS. I swapped it out with the old one and things look a lot better: pool: datos state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h3m, 0.02% done, 256h30m to go config: NAME STATE READ WRITE CKSUM datos DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c12t1d0 ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 c12t0d0s0/o OFFLINE 0 0 0 c12t0d0 ONLINE 0 0 0 184M resilvered c12t2d0 ONLINE 0 0 0 Look now that the ETC is about 256h and with the EARS was more than 2500h. When doing scrubs on this pool I usually saw that initially the performance sucks but after some time it increases significantly reaching 30-50 MB/s. Something else I want to add. When I wanted to replace the disk this last time at had some resistance from ZFS because of this: leandro at alexia <http://mail.opensolaris.org/mailman/listinfo/zfs-discuss>:~$ zpool status pool: datos state: DEGRADED scrub: resilver completed after 0h0m with 0 errors on Sat Apr 24 17:34:03 2010 config: NAME STATE READ WRITE CKSUM datos DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c12t1d0 ONLINE 0 0 0 replacing-1 DEGRADED 0 0 4 c12t0d0s0/o ONLINE 0 0 0 16.0M resilvered c12t0d0 OFFLINE 0 0 0 c12t2d0 ONLINE 0 0 0 [] The pool had c12t0d0s0/o and just a simple replace wouldn''t work: $ pfexec zpool replace datos c12t0d0 invalid vdev specification use ''-f'' to override the following errors: /dev/dsk/c12t0d0s0 is part of active ZFS pool datos. Please see zpool(1M). [] The -f option neither worked: $ pfexec zpool replace -f datos c12t0d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c12t0d0s0 is part of active ZFS pool datos. Please see zpool(1M). [] No way to detach the intruder: $ pfexec zpool detach datos c12t0d0s0 cannot detach c12t0d0s0: no such device in pool [] Well, maybe not with that name: $ pfexec zpool detach datos c12t0d0s0/o $ zpool status datos pool: datos state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using ''zpool online'' or replace the device with ''zpool replace''. scrub: resilver completed after 0h0m with 0 errors on Sat Apr 24 17:34:03 2010 config: NAME STATE READ WRITE CKSUM datos DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c12t1d0 ONLINE 0 0 0 c12t0d0 OFFLINE 0 0 0 c12t2d0 ONLINE 0 0 0 errors: No known data errors [] Now the replace will work: $ pfexec zpool replace datos c12t0d0 $ zpool status datos pool: datos state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 0.00% done, 177h22m to go config: NAME STATE READ WRITE CKSUM datos DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c12t1d0 ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 c12t0d0s0/o OFFLINE 0 0 0 c12t0d0 ONLINE 0 0 0 19.5M resilvered c12t2d0 ONLINE 0 0 0 errors: No known data errors [] By the time I''m writing this last lines the performance already improved: $ zpool status datos pool: datos state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h23m, 0.30% done, 126h14m to go config: NAME STATE READ WRITE CKSUM datos DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c12t1d0 ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 c12t0d0s0/o OFFLINE 0 0 0 c12t0d0 ONLINE 0 0 0 2.34G resilvered c12t2d0 ONLINE 0 0 0 errors: No known data errors $ zpool iostat datos 5 capacity operations bandwidth pool alloc free read write read write ----------------- ----- ----- ----- ----- ----- ----- datos 2.27T 460G 134 1 9.62M 7.75K raidz1 2.27T 460G 134 1 9.62M 7.75K c12t1d0 - - 118 0 4.80M 0 replacing - - 0 135 0 4.82M c12t0d0s0/o - - 0 0 0 0 c12t0d0 - - 0 132 0 4.82M c12t2d0 - - 106 0 4.84M 0 ----------------- ----- ----- ----- ----- ----- ----- Thanks for reading. :) Leandro. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100424/a36ea9b1/attachment.html>
Just in case someone of you want to jump in, I created the case #100426-001820 to WD to ask for a firmware update to the WD??EARS drives without any 512-byte emulation, just the 4K sectors directly exposed. The WD forum thread: http://community.wdc.com/t5/Desktop/Poor-performace-in-OpenSolaris-with-4K-sector-drive-WD10EARS-in/m-p/20947#M1263 is checked by the dev department and would have a lot more weight if more comments/signatures were added by you guys. Have a good night. Regards, Leandro. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100428/e31cac00/attachment.html>
On Sat, Apr 24, 2010 at 5:02 PM, Leandro Vanden Bosch <l.vbosch at gmail.com> wrote:> Confirmed then that the issue was with the WD10EARS. > I swapped it out with the old one and things look a lot better:The problem with the EARS drive is that it was not 4k aligned. The solaris partition table was, but that does not take into account the fdisk MBR. As a result, everything was off by one cylinder. -B -- Brandon High : bhigh at freaks.com
Brandon, Thanks for replying to the message. I believe that this is more related to the variable stripe size of RAIDZ than the fdisk MBR. I say this because the disk works without any issues in a mirror configuration or as standalone reaching 80 MB/s burst transfer rates. In RAIDZ, however, the transfer rates are in the KB/s order. Of course that the variable stripe size does not respect any I/O alignment when the disk''s firmware does not expose the real 4K sector size and thus, the performance is horrible. Besides, the disk has an EFI label: Total disk size is 60800 cylinders Cylinder size is 32130 (512 byte) blocks Cylinders Partition Status Type Start End Length % ========= ====== ============ ===== === ====== == 1 EFI 0 60800 60801 100 ...that uses the whole disk: prtvtoc command output: * /dev/rdsk/c12t0d0 partition map * * Dimensions: * 512 bytes/sector * 1953525168 sectors * 1953525101 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First Sector Last * Sector Count Sector * 34 222 255 * * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 4 00 256 1953508495 1953508750 8 11 00 1953508751 16384 1953525134 Though there''s an MBR in there (you can check it out with dd), I know that it doesn''t affect the alignment because the usable slice starts at sector 256 and being multiple of 2, it mantains the 4K physical sector alignment. Different is the situation if the usable slice would have been started at sector 1 right after the sector 0 MBR. That''s because logical sectors 0 through 3 belong to the same 4K physical sector and having that moved by an offset of one, would definitely alter the I/O. To make this clearer for the ones that read about this for the first time, if the logical and physical layouts are not aligned, an operation on one logical stripe/cluster could be partially impacting another physical sector, therefore deteriorating final performance because of the overhead. What would be awesome to do, is to trace all of the I/O access that ZFS does on the pool and try to match that to the physical layout. I already saw someone''s work running a DTrace script that records all the accesses and then he creates an animation (black and green sectors) showing the activity on the disk. An incredibly awesome work. I can''t find that link right now. That script would throw some light on this. Regards, Leandro. On Thu, May 20, 2010 at 8:53 PM, Brandon High <bhigh at freaks.com> wrote:> On Sat, Apr 24, 2010 at 5:02 PM, Leandro Vanden Bosch > <l.vbosch at gmail.com> wrote: > > Confirmed then that the issue was with the WD10EARS. > > I swapped it out with the old one and things look a lot better: > > The problem with the EARS drive is that it was not 4k aligned. > > The solaris partition table was, but that does not take into account > the fdisk MBR. As a result, everything was off by one cylinder. > > -B > > -- > Brandon High : bhigh at freaks.com >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100521/d2f77dae/attachment.html>