Michael Kucharski
2007-Oct-09 10:29 UTC
[zfs-discuss] ZFS 60 second pause times to read 1K
Every day we see pause times of sometime 60 seconds to read 1K of a file for local reads as well as NFS in a test setup. We have a x4500 setup as a single 4*( raid2z 9 + 2)+2 spare pool and have the files system mounted over v5 krb5 NFS and accessed directly. The pool is a 20TB pool and is using . There are three filesystems, backup, test and home. Test has about 20 million files and uses 4TB. These files range from 100B to 200MB. Test has a cron job to take snapshots every 15 minutes from 1m on the hour. Every 15min at 2min on the hour a cron batch job runs to zfs send/recv to the backup filesystem. Home has only 100GB. The test dir has 3 directories, 1 has 130,000 files other 2 have 10,000,000. We have 4 processes, 2 over NFS, 2 local, and 2 reading the dir with 130,000 files, the other 2 reading the dir with 10,000,000. Every 35 seconds each process reads 1K of 64thK of 10 files and records the latency, then reads for 1 second and record the throughput. At times of no other activity (outside the snapshot and send/recv times) we see read latencies of up to 60 seconds, maybe once a day at random times. We are using an unpatched Solaris 10 08/07 build. Pause times this long can to timeouts and jobs to fails which is problematic for us. Is this expected behaviour? Can anything be done to mitigate or diagnose the issue? This message posted from opensolaris.org
Are there any clues in the logs? I have had a similar problem when a disk bad block was uncovered by zfs. I''ve also seen this when using the Silicon Image driver without the recommended patch. The former became evident when I ran a scrub. I saw the SCSI timeout errors pop up in the "kern" syslogs. I solved this by replacing the disk. Gary This message posted from opensolaris.org
Excellent. Oct 9 13:36:01 zeta1 scsi: [ID 107833 kern.warning] WARNING: /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 2,0 (sd13): Oct 9 13:36:01 zeta1 Error for Command: read Error Level: Retryable Scrubbing now. Big thanks ggendel. This message posted from opensolaris.org
Michael <m.kucharski <at> bigfoot.com> writes:> > Excellent. > > Oct 9 13:36:01 zeta1 scsi: [ID 107833 kern.warning] WARNING: > /pci <at> 2,0/pci1022,7458 <at> 8/pci11ab,11ab <at> 1/disk <at> 2,0 (sd13): > Oct 9 13:36:01 zeta1 Error for Command: read ErrorLevel: Retryable> > Scrubbing now.This is only a part of the complete error message. Look a few lines above this one. If you see something like: sata: [ID 801593 kern.notice] NOTICE: /pci at 0,0/pci8086,25ae at 1c/pci11ab,11ab at 2: port 1: device reset sata: [ID 801593 kern.notice] NOTICE: /pci at 0,0/pci8086,25ae at 1c/pci11ab,11ab at 2: port 1: link lost sata: [ID 801593 kern.notice] NOTICE: /pci at 0,0/pci8086,25ae at 1c/pci11ab,11ab at 2: port 1: link established marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 1: marvell88sx: [ID 517869 kern.info] device disconnected marvell88sx: [ID 517869 kern.info] device connected Then it means you are probably affected by bug http://bugs.opensolaris.org/view_bug.do?bug_id=6587133 This bug is fixed in Solaris Express build 73 and above, and will likely be fixed in Solaris 10 Update 5. The workaround is to disable SATA NCQ and queuing by adding "set sata:sata_func_enable = 0x4" to /etc/system. -marc
Thanks. Looks like I have this bug. Is it a hardware problem combined with a software problem? Oct 9 09:35:43 zeta1 sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1: Oct 9 09:35:43 zeta1 port 3: device reset Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.warning] WARNING: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1/disk at 3,0 (sd18): Oct 9 09:35:43 zeta1 Error for Command: read(10) Error Level: Retryable Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.notice] Requested Block: 6063576 Error Block: 6063576 Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Oct 9 09:35:43 zeta1 sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1: Oct 9 09:35:43 zeta1 port 3: device reset Oct 9 09:35:43 zeta1 sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1: Oct 9 09:35:43 zeta1 port 3: link lost Oct 9 09:35:43 zeta1 sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1: Oct 9 09:35:43 zeta1 port 3: link established Oct 9 09:35:43 zeta1 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx4: error on port 3: Oct 9 09:35:43 zeta1 marvell88sx: [ID 517869 kern.info] device disconnected Oct 9 09:35:43 zeta1 marvell88sx: [ID 517869 kern.info] device connected This message posted from opensolaris.org
I''m not sure. But when I would re-run a scrub, I got the errors at the same block numbers, which indicated that the disk was really bad. It wouldn''t hurt to make the entry in the /etc/system file, reboot, and then try the scrub again. If the problem disappears then it is a driver bug. Gary This message posted from opensolaris.org
Michael Kucharski wrote:> We have a x4500 setup as a single 4*( raid2z 9 + 2)+2 spare pool and have > the files system mounted over v5 krb5 NFS and accessed directly. The pool > is a 20TB pool and is using . There are three filesystems, backup, test > and home. Test has about 20 million files and uses 4TB. These files range > from 100B to 200MB. Test has a cron job to take snapshots every 15 minutes > from 1m on the hour. Every 15min at 2min on the hour a cron batch job runs > to zfs send/recv to the backup filesystem. Home has only 100GB.Are you doing this send|recv within the same pool? What''s the motivation for that? Can''t you just use "zfs clone", which would be much faster and use less disk space? Or if you want another copy (which seems unlikely since you can already tolerate any 2 disks failing in your pool), then use "zfs set copies=2 <fs>". --matt
I''ve got the box on eval, and just pushing through its paces. Ideally I would be replicating to another x4500, but I don''t have another one and didn''t want to use 22 disks for another pool. This message posted from opensolaris.org