thr3ads.net - zfs discuss - [zfs-discuss] cp -r hanged copying a directory [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Simon Breden

2008-Apr-27 20:59 UTC

[zfs-discuss] cp -r hanged copying a directory

I installed b87 today and then I made a copy of a directory.

To my surprise, a few seconds later the drive access light went out. Upon
inspection, only a couple of the files had been copied, and the cp command
appeared to have hung.

I did: cp -r dir1 dir2

ps -el showed that cp was still running but not taking much, if any, processor
time. Nothing changed after a minute or so, so I tried to kill the cp process
with -15, then a -9. Even that failed.

Using the file manager (nautilus?), all I saw was a spinning logo and no files
when I tried to access the directory where I copied data to.

Thinking this was a bug in b87, I rolled back to b85. Back in b85, I tried the
same thing, and again, it failed in exactly the same way. So it was nothing to
do with b87, but in fact, something funny to do with my files. I did a scrub and
saw no errors.

What I now need to do is diagnose the problem. What do people here suggest I do
next to find the root cause of this problem? I suspect there''s
something screwy with the files in the source directory -- some funny attribute
somewhere (I remember some bug in b82 with file dates around 1970 or something).
Luckily this directory didn''t contain many files and I have a copy
anyway, and the data is not critical.

---------------------------------
Blog: http://breden.org.uk
---------------------------------
 
 
This message posted from opensolaris.org

Richard Elling

2008-Apr-27 23:03 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Simon Breden wrote:> I installed b87 today and then I made a copy of a directory.
>
> To my surprise, a few seconds later the drive access light went out. Upon
inspection, only a couple of the files had been copied, and the cp command
appeared to have hung.
>
> I did: cp -r dir1 dir2
>
> ps -el showed that cp was still running but not taking much, if any,
processor time. Nothing changed after a minute or so, so I tried to kill the cp
process with -15, then a -9. Even that failed.
>   
This is usually a symptom of broken hardware.  The reason kill -9
didn''t work is because the process is waiting for the I/O to complete
and it hasn''t completed yet.  When the I/O completes or times out,
then the process will start, get the kill signal and die. The timeout
may be on the order of 3-5 minutes for simple disks, though it may
be longer for some configurations.  Take a look at "fmdump -e"
to see if there are errors being logged.

[note to self: file RFE for the system log viewer to grab fmdump output]
 -- richard
> Using the file manager (nautilus?), all I saw was a spinning logo and no
files when I tried to access the directory where I copied data to.
>
> Thinking this was a bug in b87, I rolled back to b85. Back in b85, I tried
the same thing, and again, it failed in exactly the same way. So it was nothing
to do with b87, but in fact, something funny to do with my files. I did a scrub
and saw no errors.
>
> What I now need to do is diagnose the problem. What do people here suggest
I do next to find the root cause of this problem? I suspect there''s
something screwy with the files in the source directory -- some funny attribute
somewhere (I remember some bug in b82 with file dates around 1970 or something).
Luckily this directory didn''t contain many files and I have a copy
anyway, and the data is not critical.
>
> ---------------------------------
> Blog: http://breden.org.uk
> ---------------------------------
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Simon Breden

2008-Apr-28 10:10 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

I don''t like the sound of broken hardware :(

I did the cp -r dir1 dir2 again and when it hanged I issued ''fmdump
-e'' like you said -- here is the output:

# fmdump -e
TIME                 CLASS
fmdump: /var/fm/fmd/errlog is empty
# 

I also checked /var/adm/messages and I didn''t see anything in there
either. So I don''t know what else I can do to understand
what''s happening. Any ideas?
 
 
This message posted from opensolaris.org

Rob Logan

2008-Apr-28 15:13 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

> I did the cp -r dir1 dir2 again and when it hanged
when its hung, can you type:  iostat -xce 1
in another window and is there a 100 in the %b column?
when you reset and try the cp again, and look at
iostat -xce 1 on the second hang, is the same disk at 100 in %b?

if all your windows are hung, does your keyboard num locks LED
track the num locks key?

			Rob

Simon Breden

2008-May-01 15:08 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Sorry for the delay. Here is the output for a couple of seconds:

# iostat -xce 1
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     1.5    0.7   20.8    4.2  0.0  0.0    9.0   0   1   0   0   0   0   2 
1  0 98
sd0       0.3    0.2   29.4   25.1  0.1  0.0  174.5   0   0   0   0   0   0 
sd1       0.3    0.2   33.2   25.0  0.1  0.0  166.8   0   0   0   0   0   0 
sd2       0.2    0.2   26.8   24.8  0.1  0.0  180.3   0   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   0 
0  0 100
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1 
1  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   2 
1  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0  26 
3  0 70
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   2 
0  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    1.0    0.0    2.0  0.0  0.0    0.1   0   0   0   0   0   0   1 
1  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0  18 
1  0 81
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   6 
1  0 94
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    2.0    0.0    1.5  0.0  0.0    0.2   0   0   0   0   0   0   1 
1  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1 
0  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1 
0  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1 
0  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
^C# 

It seems like ZFS filesystem accesses have hung up forever now... time to
reboot... will investigate more, but I have no clue what to do next...
 
 
This message posted from opensolaris.org

Richard Elling

2008-May-01 15:37 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Simon Breden wrote:> Sorry for the delay. Here is the output for a couple of seconds:
>   
This is the smoking gun...
> # iostat -xce 1
>                          extended device statistics         ---- errors ---
cpu
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot
us sy wt id
> cmdk0     1.5    0.7   20.8    4.2  0.0  0.0    9.0   0   1   0   0   0   0
2  1  0 98
> sd0       0.3    0.2   29.4   25.1  0.1  0.0  174.5   0   0   0   0   0   0
> sd1       0.3    0.2   33.2   25.0  0.1  0.0  166.8   0   0   0   0   0   0
> sd2       0.2    0.2   26.8   24.8  0.1  0.0  180.3   0   0   0   0   0   0
> sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5
>   
I/O is moving, but not much activity...
>                          extended device statistics         ---- errors ---
cpu
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot
us sy wt id
> cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0
0  0  0 100
> sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0
> sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0
> sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0
> sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5
>   
I/O is stuck at the device.  %w[ait] of 100 means that something is
stuck on the device.  This is probably not a ZFS issue, per se, but
somehow the ZFS workload triggers it.
 -- richard

Rob Logan

2008-May-01 17:06 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

hmm, three drives with 35 io requests in the queue
and none active? remind me not to buy a drive
with that FW..

1) upgrade the FW in the drives or

2) turn off NCQ with:
echo "set sata:sata_max_queue_depth = 0x1" >> /etc/system


			Rob

Simon Breden

2008-May-01 17:53 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Thanks a lot Richard. To give a bit more info, I''ve copied my
/var/adm/messages from booting up the machine:

And @picker: I guess the 35 requests are stacked up waiting for the hanging
request to be serviced?

The question I have is where do I go from now, to get some more info on what is
causing cp to have problems.

I will now try another tack: use rsync to copy the directory to a disk outside
the pool (i.e. my home directory on the boot drive), to see if it is happy doing
that.


May  1 17:04:15 zfsbox su: [ID 810491 auth.crit] ''su root''
failed for simon on /dev/pts/5
May  1 17:48:15 zfsbox genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11
Version snv_85 64-bit
May  1 17:48:15 zfsbox genunix: [ID 172908 kern.notice] Copyright 1983-2008 Sun
Microsystems, Inc.  All rights reserved.
May  1 17:48:15 zfsbox Use is subject to license terms.
May  1 17:48:15 zfsbox unix: [ID 126719 kern.info] features:
13f6fff<cpuid,tscp,cmp,cx16,sse3,nx,asysc,sse2,sse,pat,cx8,pae,mca,mmx,cmov,de,pge,mtrr,msr,tsc,lgpg>
May  1 17:48:15 zfsbox unix: [ID 168242 kern.info] mem = 4192764K (0xffe7f000)
May  1 17:48:15 zfsbox rootnex: [ID 466748 kern.info] root nexus = i86pc
May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] pseudo0 at root
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] pseudo0 is /pseudo
May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] scsi_vhci0 at root
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] scsi_vhci0 is /scsi_vhci
May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] isa0 at root
May  1 17:48:15 zfsbox pci_autoconfig: [ID 139057 kern.info] NOTICE: reprogram
io-range on ppb[0/f/0]: 0x2000 ~ 0x2fff
May  1 17:48:15 zfsbox pci_autoconfig: [ID 595143 kern.info] NOTICE: add
io-range on subtractive ppb[0/6/0]: 0x3000 ~ 0x3fff
May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local nmi:
0 1 1 1
May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local nmi:
1 1 1 1
May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local nmi:
2 1 1 1
May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local nmi:
3 1 1 1
May  1 17:48:15 zfsbox pcplusmp: [ID 177709 kern.info] pcplusmp: vector 0x9
ioapic 0x4 intin 0x9 is bound to cpu 1
May  1 17:48:15 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: acpippm0
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] acpippm0 is
/pseudo/acpippm at 0
May  1 17:48:15 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: ppm0
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] ppm0 is /pseudo/ppm at 0
May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] npe0 at root: space 0
offset 0
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] npe0 is /pci at 0,0
May  1 17:48:15 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: ide (ata)
instance 0 vector 0xe ioapic 0x4 intin 0xe is bound to cpu 0
May  1 17:48:15 zfsbox genunix: [ID 640982 kern.info] 	IDE device at targ 0, lun
0 lastlun 0x0
May  1 17:48:15 zfsbox genunix: [ID 846691 kern.info] 	model HDS722516VLAT80
May  1 17:48:15 zfsbox genunix: [ID 479077 kern.info] 	ATA/ATAPI-6 supported,
majver 0x7c minver 0x19
May  1 17:48:15 zfsbox genunix: [ID 640982 kern.info] 	ATAPI device at targ 1,
lun 0 lastlun 0x0
May  1 17:48:15 zfsbox genunix: [ID 846691 kern.info] 	model TSSTcorpDVD-ROM
SH-D162C
May  1 17:48:15 zfsbox npe: [ID 236367 kern.info] PCI Express-device: ide at 0,
ata0
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] ata0 is /pci at
0,0/pci-ide at 4/ide at 0
May  1 17:48:15 zfsbox genunix: [ID 773945 kern.info] 	UltraDMA mode 2 selected
May  1 17:48:15 zfsbox genunix: [ID 773945 kern.info] 	UltraDMA mode 5 selected
May  1 17:48:15 zfsbox gda: [ID 243001 kern.info] Disk0:	<Vendor
''Gen-ATA '' Product ''HDS722516VLAT80 ''>
May  1 17:48:17 zfsbox ata: [ID 496167 kern.info] cmdk0 at ata0 target 0 lun 0
May  1 17:48:17 zfsbox genunix: [ID 936769 kern.info] cmdk0 is /pci at
0,0/pci-ide at 4/ide at 0/cmdk at 0,0
May  1 17:48:20 zfsbox unix: [ID 190185 kern.info] SMBIOS v2.4 loaded (2169
bytes)
May  1 17:48:20 zfsbox genunix: [ID 408114 kern.info] /cpus (cpunex0) online
May  1 17:48:20 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: dld0
May  1 17:48:20 zfsbox genunix: [ID 936769 kern.info] dld0 is /pseudo/dld at 0
May  1 17:48:20 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp:
pciexclass,060400 (pcie_pci) instance 1 vector 0x18 ioapic 0xff intin 0xff is
bound to cpu 1
May  1 17:48:20 zfsbox npe: [ID 236367 kern.info] PCI Express-device:
pci10de,377 at f, pcie_pci1
May  1 17:48:20 zfsbox genunix: [ID 936769 kern.info] pcie_pci1 is /pci at
0,0/pci10de,377 at f
May  1 17:48:21 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: pci10de,163
(nvidia) instance 0 vector 0x10 ioapic 0x4 intin 0x10 is bound to cpu 0
May  1 17:48:21 zfsbox pcie_pci: [ID 586369 kern.info] PCIE-device: display at
0, nvidia0
May  1 17:48:21 zfsbox genunix: [ID 936769 kern.info] nvidia0 is /pci at
0,0/pci10de,377 at f/display at 0
May  1 17:48:21 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: i8042 (i8042)
instance 0 vector 0x1 ioapic 0x4 intin 0x1 is bound to cpu 1
May  1 17:48:21 zfsbox pcplusmp: [ID 444295 kern.info] pcplusmp: i8042 (i8042)
instance #0 vector 0xc ioapic 0x4 intin 0xc is bound to cpu 1
May  1 17:48:21 zfsbox i8042: [ID 526150 kern.info] 8042 device:  keyboard at 0,
kb8042 # 0
May  1 17:48:21 zfsbox genunix: [ID 936769 kern.info] kb80420 is /isa/i8042 at
1,60/keyboard at 0
May  1 17:48:21 zfsbox i8042: [ID 526150 kern.info] 8042 device:  mouse at 1,
mouse8042 # 0
May  1 17:48:21 zfsbox genunix: [ID 936769 kern.info] mouse80420 is /isa/i8042
at 1,60/mouse at 1
May  1 17:48:21 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: pciclass,0c0320
(ehci) instance 0 vector 0x14 ioapic 0x4 intin 0x14 is bound to cpu 0
May  1 17:48:23 zfsbox npe: [ID 236367 kern.info] PCI Express-device:
pci1043,8239 at 2,1, ehci0
May  1 17:48:23 zfsbox genunix: [ID 936769 kern.info] ehci0 is /pci at
0,0/pci1043,8239 at 2,1
May  1 17:48:23 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: pciclass,0c0310
(ohci) instance 0 vector 0x15 ioapic 0x4 intin 0x15 is bound to cpu 1
May  1 17:48:23 zfsbox npe: [ID 236367 kern.info] PCI Express-device:
pci1043,8239 at 2, ohci0
May  1 17:48:23 zfsbox genunix: [ID 936769 kern.info] ohci0 is /pci at
0,0/pci1043,8239 at 2
May  1 17:48:23 zfsbox unix: [ID 950921 kern.info] cpu0: x86 (chipid 0x0
AuthenticAMD 60FB2 family 15 model 107 step 2 clock 2110 MHz)
May  1 17:48:23 zfsbox unix: [ID 950921 kern.info] cpu0: AMD Athlon(tm) X2 Dual
Core Processor BE-2350
May  1 17:48:23 zfsbox unix: [ID 950921 kern.info] cpu1: x86 (chipid 0x0
AuthenticAMD 60FB2 family 15 model 107 step 2 clock 2110 MHz)
May  1 17:48:23 zfsbox unix: [ID 950921 kern.info] cpu1: AMD Athlon(tm) X2 Dual
Core Processor BE-2350
May  1 17:48:23 zfsbox unix: [ID 557827 kern.info] cpu1 initialization complete
- online
May  1 17:48:23 zfsbox unix: [ID 212709 kern.info] workaround applied for cpu
erratum #122
May  1 17:48:23 zfsbox unix: [ID 877799 kern.info] workaround applied for cpu
issue #6336786
May  1 17:48:23 zfsbox npe: [ID 236367 kern.info] PCI Express-device:
pci10de,370 at 6, pci_pci0
May  1 17:48:23 zfsbox genunix: [ID 936769 kern.info] pci_pci0 is /pci at
0,0/pci10de,370 at 6
May  1 17:48:23 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: pciclass,0c0010
(hci1394) instance 0 vector 0x13 ioapic 0x4 intin 0x13 is bound to cpu 0
May  1 17:48:23 zfsbox pci_pci: [ID 370704 kern.info] PCI-device: pci1043,808b
at b, hci13940
May  1 17:48:23 zfsbox genunix: [ID 936769 kern.info] hci13940 is /pci at
0,0/pci10de,370 at 6/pci1043,808b at b
May  1 17:48:24 zfsbox unix: [ID 469452 kern.info] NOTICE: cpqhpc: 64-bit driver
module not found
May  1 17:48:24 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: tzmon0
May  1 17:48:24 zfsbox genunix: [ID 936769 kern.info] tzmon0 is /pseudo/tzmon at
0
May  1 17:48:26 zfsbox unix: [ID 469452 kern.info] NOTICE: cpqhpc: 64-bit driver
module not found
May  1 17:48:27 zfsbox nge: [ID 801725 kern.info] NOTICE: nge0: Using FIXED
interrupt type
May  1 17:48:27 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: pci10de,373
(nge) instance 0 vector 0x16 ioapic 0x4 intin 0x16 is bound to cpu 1
May  1 17:48:27 zfsbox mac: [ID 469746 kern.info] NOTICE: nge0 registered
May  1 17:48:27 zfsbox nge: [ID 801725 kern.info] NOTICE: nge1: Using FIXED
interrupt type
May  1 17:48:27 zfsbox pcplusmp: [ID 444295 kern.info] pcplusmp: pci10de,373
(nge) instance #1 vector 0x17 ioapic 0x4 intin 0x17 is bound to cpu 1
May  1 17:48:27 zfsbox mac: [ID 469746 kern.info] NOTICE: nge1 registered
May  1 17:48:27 zfsbox mac: [ID 469746 kern.info] NOTICE: aggr1 registered
May  1 17:48:28 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: devinfo0
May  1 17:48:28 zfsbox genunix: [ID 936769 kern.info] devinfo0 is
/pseudo/devinfo at 0
May  1 17:48:29 zfsbox genunix: [ID 454863 kern.info] dump on /dev/dsk/c0d0s1
size 2055 MB
May  1 17:48:30 zfsbox mac: [ID 435574 kern.info] NOTICE: nge0 link up, 1000
Mbps, full duplex
May  1 17:48:30 zfsbox mac: [ID 435574 kern.info] NOTICE: aggr1 link up, 1000
Mbps, full duplex
May  1 17:48:30 zfsbox mac: [ID 435574 kern.info] NOTICE: nge1 link up, 1000
Mbps, full duplex
May  1 17:48:35 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: zfs0
May  1 17:48:35 zfsbox genunix: [ID 936769 kern.info] zfs0 is /pseudo/zfs at 0
May  1 17:48:35 zfsbox unix: [ID 954099 kern.info] NOTICE: IRQ20 is being shared
by drivers with different interrupt levels.
May  1 17:48:35 zfsbox This may result in reduced system performance.
May  1 17:48:36 zfsbox sata: [ID 663010 kern.info] /pci at 0,0/pci1043,8239 at 5
:
May  1 17:48:36 zfsbox sata: [ID 761595 kern.info] 	SATA disk device at port 0
May  1 17:48:36 zfsbox sata: [ID 846691 kern.info] 	model WDC WD7500AAKS-00RBA0
May  1 17:48:36 zfsbox sata: [ID 693010 kern.info] 	firmware 30.04G30
May  1 17:48:36 zfsbox sata: [ID 163988 kern.info] 	serial number     
WD-WCAPT0506736
May  1 17:48:36 zfsbox sata: [ID 594940 kern.info] 	supported features:
May  1 17:48:36 zfsbox sata: [ID 981177 kern.info] 	 48-bit LBA, DMA, Native
Command Queueing, SMART, SMART self-test
May  1 17:48:36 zfsbox sata: [ID 643337 kern.info] 	SATA Gen2 signaling speed
(3.0Gbps)
May  1 17:48:36 zfsbox sata: [ID 349649 kern.info] 	Supported queue depth 32,
limited to 1
May  1 17:48:36 zfsbox sata: [ID 349649 kern.info] 	capacity = 1465149168
sectors
May  1 17:48:36 zfsbox scsi: [ID 193665 kern.info] sd0 at nv_sata0: target 0 lun
0
May  1 17:48:36 zfsbox genunix: [ID 936769 kern.info] sd0 is /pci at
0,0/pci1043,8239 at 5/disk at 0,0
May  1 17:48:36 zfsbox genunix: [ID 408114 kern.info] /pci at 0,0/pci1043,8239
at 5/disk at 0,0 (sd0) online
May  1 17:48:36 zfsbox sata: [ID 663010 kern.info] /pci at 0,0/pci1043,8239 at 5
:
May  1 17:48:36 zfsbox sata: [ID 761595 kern.info] 	SATA disk device at port 1
May  1 17:48:36 zfsbox sata: [ID 846691 kern.info] 	model WDC WD7500AAKS-00RBA0
May  1 17:48:36 zfsbox sata: [ID 693010 kern.info] 	firmware 30.04G30
May  1 17:48:36 zfsbox sata: [ID 163988 kern.info] 	serial number     
WD-WCAPT0506630
May  1 17:48:36 zfsbox sata: [ID 594940 kern.info] 	supported features:
May  1 17:48:36 zfsbox sata: [ID 981177 kern.info] 	 48-bit LBA, DMA, Native
Command Queueing, SMART, SMART self-test
May  1 17:48:36 zfsbox sata: [ID 643337 kern.info] 	SATA Gen2 signaling speed
(3.0Gbps)
May  1 17:48:36 zfsbox sata: [ID 349649 kern.info] 	Supported queue depth 32,
limited to 1
May  1 17:48:36 zfsbox sata: [ID 349649 kern.info] 	capacity = 1465149168
sectors
May  1 17:48:36 zfsbox scsi: [ID 193665 kern.info] sd1 at nv_sata0: target 1 lun
0
May  1 17:48:36 zfsbox genunix: [ID 936769 kern.info] sd1 is /pci at
0,0/pci1043,8239 at 5/disk at 1,0
May  1 17:48:36 zfsbox genunix: [ID 408114 kern.info] /pci at 0,0/pci1043,8239
at 5/disk at 1,0 (sd1) online
May  1 17:48:36 zfsbox unix: [ID 469452 kern.info] NOTICE: ncrs: 64-bit driver
module not found
May  1 17:48:36 zfsbox unix: [ID 469452 kern.info] NOTICE: hpfc: 64-bit driver
module not found
May  1 17:48:37 zfsbox unix: [ID 469452 kern.info] NOTICE: adp: 64-bit driver
module not found
May  1 17:48:37 zfsbox unix: [ID 469452 kern.info] NOTICE: cadp: 64-bit driver
module not found
May  1 17:48:37 zfsbox pcplusmp: [ID 444295 kern.info] pcplusmp: ide (ata)
instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 0
May  1 17:48:37 zfsbox unix: [ID 469452 kern.info] NOTICE: mscsi: 64-bit driver
module not found
May  1 17:48:37 zfsbox scsi: [ID 193665 kern.info] sd3 at ata0: target 1 lun 0
May  1 17:48:37 zfsbox genunix: [ID 936769 kern.info] sd3 is /pci at 0,0/pci-ide
at 4/ide at 0/sd at 1,0
May  1 17:48:37 zfsbox unix: [ID 469452 kern.info] NOTICE: symhisl: 64-bit
driver module not found
May  1 17:48:37 zfsbox unix: [ID 954099 kern.info] NOTICE: IRQ21 is being shared
by drivers with different interrupt levels.
May  1 17:48:37 zfsbox This may result in reduced system performance.
May  1 17:48:37 zfsbox sata: [ID 663010 kern.info] /pci at 0,0/pci1043,8239 at
5,1 :
May  1 17:48:37 zfsbox sata: [ID 761595 kern.info] 	SATA disk device at port 0
May  1 17:48:37 zfsbox sata: [ID 846691 kern.info] 	model WDC WD7500AAKS-00RBA0
May  1 17:48:37 zfsbox sata: [ID 693010 kern.info] 	firmware 30.04G30
May  1 17:48:37 zfsbox sata: [ID 163988 kern.info] 	serial number     
WD-WCAPT0592269
May  1 17:48:37 zfsbox sata: [ID 594940 kern.info] 	supported features:
May  1 17:48:37 zfsbox sata: [ID 981177 kern.info] 	 48-bit LBA, DMA, Native
Command Queueing, SMART, SMART self-test
May  1 17:48:37 zfsbox sata: [ID 643337 kern.info] 	SATA Gen2 signaling speed
(3.0Gbps)
May  1 17:48:37 zfsbox sata: [ID 349649 kern.info] 	Supported queue depth 32,
limited to 1
May  1 17:48:37 zfsbox sata: [ID 349649 kern.info] 	capacity = 1465149168
sectors
May  1 17:48:37 zfsbox scsi: [ID 193665 kern.info] sd2 at nv_sata1: target 0 lun
0
May  1 17:48:37 zfsbox genunix: [ID 936769 kern.info] sd2 is /pci at
0,0/pci1043,8239 at 5,1/disk at 0,0
May  1 17:48:37 zfsbox genunix: [ID 408114 kern.info] /pci at 0,0/pci1043,8239
at 5,1/disk at 0,0 (sd2) online
May  1 17:48:37 zfsbox unix: [ID 954099 kern.info] NOTICE: IRQ22 is being shared
by drivers with different interrupt levels.
May  1 17:48:37 zfsbox This may result in reduced system performance.
May  1 17:48:39 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: pm0
May  1 17:48:39 zfsbox genunix: [ID 936769 kern.info] pm0 is /pseudo/pm at 0
May  1 17:48:40 zfsbox rootnex: [ID 349649 kern.info] xsvc0 at root: space 0
offset 0
May  1 17:48:40 zfsbox genunix: [ID 936769 kern.info] xsvc0 is /xsvc at 0,0
May  1 17:48:40 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: power0
May  1 17:48:40 zfsbox genunix: [ID 936769 kern.info] power0 is /pseudo/power at
0
May  1 17:48:40 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: srn0
May  1 17:48:40 zfsbox genunix: [ID 936769 kern.info] srn0 is /pseudo/srn at 0
May  1 17:48:40 zfsbox /usr/lib/power/powerd: [ID 387247 daemon.error] Able to
open /dev/srn
May  1 17:49:20 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: pool0
May  1 17:49:20 zfsbox genunix: [ID 936769 kern.info] pool0 is /pseudo/pool at 0
May  1 17:49:35 zfsbox smbsrv: [ID 852980 kern.notice] NOTICE: SmbOplocks:
disabled
May  1 17:49:35 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: dtrace0
May  1 17:49:35 zfsbox genunix: [ID 936769 kern.info] dtrace0 is /pseudo/dtrace
at 0
May  1 17:49:38 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: devinfo0
May  1 17:49:38 zfsbox genunix: [ID 936769 kern.info] devinfo0 is
/pseudo/devinfo at 0
May  1 17:49:45 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: nvidia255
May  1 17:49:45 zfsbox genunix: [ID 936769 kern.info] nvidia255 is
/pseudo/nvidia at 255
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2008-May-01 18:23 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Hi Simon,

Simon Breden wrote:> Thanks a lot Richard. To give a bit more info, I''ve copied my
/var/adm/messages from booting up the machine:
>
> And @picker: I guess the 35 requests are stacked up waiting for the hanging
request to be serviced?
>
> The question I have is where do I go from now, to get some more info on
what is causing cp to have problems.
>
> I will now try another tack: use rsync to copy the directory to a disk
outside the pool (i.e. my home directory on the boot drive), to see if it is
happy doing that.
>   What does truss show the cp doing? 
max

Simon Breden

2008-May-01 19:00 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

This list seems out of sync (delayed) with email messages I receive.

Why is that?

Which are the best tools to use when reading / replying to these posts? 

Anyway from my email I can see that Max has sent me a question about truss --
here is my reply:

Hi Max,

I haven''t used truss before, but give me the command line + switches
and I''ll be happy to run it.

Simon
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2008-May-01 19:20 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Hi Simon,


Simon Breden wrote:> Hi Max,
>
> I haven''t used truss before, but give me the command line +
switches
> and I''ll be happy to run it.
>
> Simon# truss -p pid_from_cp

where pid_from_cp is... the pid of the cp process that is "hung".  The
pid you can get from ps.

I am curious if the cp is stuck on a specific file, or is just very 
slow, or is hung in the kernel.
Also, can you kill the cp when it hangs?

thanks,
max>
> 2008/5/1 max at bruningsystems.com <mailto:max at bruningsystems.com>
> <max at bruningsystems.com <mailto:max at bruningsystems.com>>:
>
>     Hi Simon,
>
>     Simon Breden wrote:
>
>         Thanks a lot Richard. To give a bit more info, I''ve copied
my
>         /var/adm/messages from booting up the machine:
>
>         And @picker: I guess the 35 requests are stacked up waiting
>         for the hanging request to be serviced?
>
>         The question I have is where do I go from now, to get some
>         more info on what is causing cp to have problems.
>
>         I will now try another tack: use rsync to copy the directory
>         to a disk outside the pool (i.e. my home directory on the boot
>         drive), to see if it is happy doing that.
>          
>
>     What does truss show the cp doing? max
>
>
>

Simon Breden

2008-May-01 19:49 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

This mailing list seems broken and out of sync -- your post is as
''Guest'' and appears as a new post in the main zfs-discuss list
-- and the main thread is out of sync with the replies, and I just got a java
exception trying to post to the main thread -- what''s going on here?
 
 
This message posted from opensolaris.org

Simon Breden

2008-May-01 19:50 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Hi Max,

I re-ran the cp command and when it hanged I ran ''ps -el''
looked up the cp command, got it''s PID and then ran:

# truss -p PID_of_cp

and it output nothing at all -- i.e. it hanged too -- just showing a flashing
cursor.

The system is still operational as I am typing into the browser.

Before I ran the cp command I did a ''tail -f
/var/adm/messages'' and there is no output. I also did a ''tail
-f /var/log/syslog'' and there is also no output.

If I try ''kill -15 PID_of_cp'' and then ''ps
-el'' cp is still running.
And if I try ''kill -9 PID_of_cp'' and then ''ps
-el'' cp is still running.

What next ?

For what it''s worth, here is further output from iostat, the first line
looks interesting, but it appears to be buffered, as it''s the same each
time I run it -- possibly the last info from cp before it died???  (see below)

So, any ideas what I can do next? Just going to reboot to kill the hanging cp --
wow, this is like Windows now :(  (attempt to keep humor)

bash-3.2$ iostat -xce 1
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0    21.4    2.7  191.4   32.4  0.4  0.1   20.8   3   5   0   0   0   0   4 
2  0 94
sd0       9.9   12.6 1193.8 1478.2 11.0  0.0  489.0  32   4   0   0   0   0 
sd1       7.5   12.6  836.2 1479.1 10.9  0.0  543.7  32   4   0   0   0   0 
sd2       7.7   12.6  877.1 1479.4 10.9  0.0  537.1  32   4   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1 
1  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   2 
0  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1 
1  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   2 
0  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   4 
1  0 96
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
^C
bash-3.2$ 



and I just saw this:  (seems to be from yesterday -- I don''t know if
it''s relevant to this though ??)



# fmdump -e
TIME                 CLASS
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum         
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum         
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum         
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum         
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum         
Apr 30 01:12:14.9896 ereport.fs.zfs.data             
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum         
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum         
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum         
Apr 30 01:12:14.9802 ereport.fs.zfs.data             
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum         
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum         
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum         
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum         
Apr 30 01:12:21.1130 ereport.fs.zfs.data             
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum         
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum         
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum         
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum         
Apr 30 01:12:21.5217 ereport.fs.zfs.data             
#
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2008-May-01 20:22 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Hi Simon,

Simon Breden wrote:> Hi Max,
>
> I re-ran the cp command and when it hanged I ran ''ps -el''
looked up the cp command, got it''s PID and then ran:
>
> # truss -p PID_of_cp
>
> and it output nothing at all -- i.e. it hanged too -- just showing a
flashing cursor.
>
> The system is still operational as I am typing into the browser.
>
> Before I ran the cp command I did a ''tail -f
/var/adm/messages'' and there is no output. I also did a ''tail
-f /var/log/syslog'' and there is also no output.
>
> If I try ''kill -15 PID_of_cp'' and then ''ps
-el'' cp is still running.
> And if I try ''kill -9 PID_of_cp'' and then ''ps
-el'' cp is still running.
>
> What next ?
>   You can try the following:

# mdb -k
::pgrep cp   <-- this should give you a line with the cp you are 
running.  Next to "cp" is an address, use this address in the next
line:

address_from_pgrep::walk thread | ::threadlist -v

This will give you a stack trace.  Please post it.

$q  <-- this gets you out of mdb

max

Simon Breden

2008-May-01 21:27 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Keep getting Java exceptions posting to the proper thread for this -- just lost
an hour --- WTF???

Had to reply to my own post as Max''s reply (which I saw in my email
inbox) has not appeared here. Again, what is wrong with this forum software --
it seems so buggy, or am I missing something here (I''m quite prepared
to be wrong, but I''m getting really p*ssed off with this forum engine
as I''m wasting time and getting Java errors about missing parent which
I will post if I get more errors trying to post with this broken forum software
-- if it allows me to)


Thanks for your advice Max, and here is my reply to your suggestion:


# mdb -k
Loading modules: [ unix genunix specfs dtrace cpu.generic cpu_ms.AuthenticAMD.15
uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba s1394 nca lofs zfs random
md sppp smbsrv nfs ptm ipc crypto ]> ::pgrep cpS    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R    889    868    889    868    501 0x4a004000 ffffff01deca9048
cp> ffffff01deca9048::walk thread | ::threadlist -v            ADDR             PROC              LWP CLS PRI            WCHAN
ffffff01e0045840 ffffff01deca9048 ffffff01de9d9210   2  60 ffffff01d861ca80
  PC: _resume_from_idle+0xf1    CMD: cp -pr testdir z1
  stack pointer for thread ffffff01e0045840: ffffff0007fcdf00
  [ ffffff0007fcdf00 _resume_from_idle+0xf1() ]
    swtch+0x17f()
    cv_wait+0x61()
    zio_wait+0x5f()
    dbuf_read+0x1b5()
    dbuf_findbp+0xe8()
    dbuf_prefetch+0x9b()
    dmu_zfetch_fetch+0x43()
    dmu_zfetch_dofetch+0xc2()
    dmu_zfetch_find+0x3a1()
    dmu_zfetch+0xa5()
    dbuf_read+0xe3()
    dmu_buf_hold_array_by_dnode+0x1c4()
    dmu_read+0xd4()
    zfs_fillpage+0x15e()
    zfs_getpage+0x187()
    fop_getpage+0x9f()
    segvn_fault+0x9ef()
    as_fault+0x5ae()
    pagefault+0x95()
    trap+0x1286()                     
    0xfffffffffb8001d9()              
    fuword8+0x21()                    
    zfs_write+0x147()                 
    fop_write+0x69()                  
    write+0x2af()                     
    write32+0x1e()                    
    sys_syscall32+0x101()             
                                      > 
 
This message posted from opensolaris.org

Simon Breden

2008-May-01 21:31 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Just to reduce my stress levels and to give the webmaster some useful info to
help fix this broken forum:

I tried posting a reply to the main thread for ''cp -r hanged copying a
directory'' and got the following error -- seems like it can''t
find the parent thread/message''s id in the database at a guess:

Got this multiple times when trying to post, and as no information is given on
how to contact teh system administrator in the message -- here is the message
and exception stack trace:


 An error in the system has occurred. Please contact the system administrator if
the problem persists.

type: java.lang.IllegalArgumentException

java.lang.IllegalArgumentException: Parent key 230968 not found when adding
child 231029.
	at com.jivesoftware.util.LongTree.addChild(Unknown Source)
	at com.jivesoftware.forum.database.DbTreeWalker.addChild(Unknown Source)
	at com.jivesoftware.forum.database.DbForumThread.addMessage(Unknown Source)
	at com.jivesoftware.forum.proxy.ForumThreadProxy.addMessage(Unknown Source)
	at org.opensolaris.jive.action.ForumPostAction.createMessage(Unknown Source)
	at com.jivesoftware.forum.action.PostAction.execute(Unknown Source)
	at com.jivesoftware.forum.action.PostAction.doPost(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor881.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at
com.opensymphony.xwork.DefaultActionInvocation.invokeAction(DefaultActionInvocation.java:300)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:166)
	at
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at com.jivesoftware.forum.action.JiveExceptionInterceptor.intercept(Unknown
Source)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at com.jivesoftware.base.action.JiveObjectLoaderInterceptor.intercept(Unknown
Source)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.webwork.interceptor.FileUploadInterceptor.intercept(FileUploadInterceptor.java:71)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
	at
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
	at
com.opensymphony.xwork.DefaultActionProxy.execute(DefaultActionProxy.java:116)
	at
com.opensymphony.webwork.dispatcher.ServletDispatcher.serviceAction(ServletDispatcher.java:272)
	at com.jivesoftware.base.util.JiveWebWorkServlet.service(Unknown Source)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
	at sun.reflect.GeneratedMethodAccessor156.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
	at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:275)
	at
org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)
	at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:245)
	at
org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:50)
	at
org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:156)
	at java.security.AccessController.doPrivileged(Native Method)
	at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:152)
	at com.jivesoftware.util.SetResponseCharacterEncodingFilter.doFilter(Unknown
Source)
	at sun.reflect.GeneratedMethodAccessor322.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
	at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:275)
	at
org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:217)
	at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:197)
	at
org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:50)
	at
org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:156)
	at java.security.AccessController.doPrivileged(Native Method)
	at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:152)
	at
com.opensymphony.module.sitemesh.filter.PageFilter.parsePage(PageFilter.java:118)
	at
com.opensymphony.module.sitemesh.filter.PageFilter.doFilter(PageFilter.java:52)
	at sun.reflect.GeneratedMethodAccessor321.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
	at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:275)
	at
org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:217)
	at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:197)
	at
org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:50)
	at
org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:156)
	at java.security.AccessController.doPrivileged(Native Method)
	at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:152)
	at com.jivesoftware.util.SetRequestCharacterEncodingFilter.doFilter(Unknown
Source)
	at sun.reflect.GeneratedMethodAccessor320.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
	at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:275)
	at
org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:217)
	at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:197)
	at
org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:50)
	at
org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:156)
	at java.security.AccessController.doPrivileged(Native Method)
	at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:152)
	at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
	at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
	at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
	at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541)
	at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
	at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:199)
	at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:282)
	at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:754)
	at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:684)
	at
org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:876)
	at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
	at java.lang.Thread.run(Thread.java:595)
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2008-May-01 22:03 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Hi Simon,
Simon Breden wrote:>
> Thanks for your advice Max, and here is my reply to your suggestion:
>
>
> # mdb -k
> Loading modules: [ unix genunix specfs dtrace cpu.generic
cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba
s1394 nca lofs zfs random md sppp smbsrv nfs ptm ipc crypto ]
>   
>> ::pgrep cp
>>     
> S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
> R    889    868    889    868    501 0x4a004000 ffffff01deca9048 cp
>   
>> ffffff01deca9048::walk thread | ::threadlist -v
>>     
>             ADDR             PROC              LWP CLS PRI            WCHAN
> ffffff01e0045840 ffffff01deca9048 ffffff01de9d9210   2  60 ffffff01d861ca80
>   PC: _resume_from_idle+0xf1    CMD: cp -pr testdir z1
>   stack pointer for thread ffffff01e0045840: ffffff0007fcdf00
>   [ ffffff0007fcdf00 _resume_from_idle+0xf1() ]
>     swtch+0x17f()
>     cv_wait+0x61()
>     zio_wait+0x5f()
>     dbuf_read+0x1b5()
>     dbuf_findbp+0xe8()
>     dbuf_prefetch+0x9b()
>     dmu_zfetch_fetch+0x43()
>     dmu_zfetch_dofetch+0xc2()
>     dmu_zfetch_find+0x3a1()
>     dmu_zfetch+0xa5()
>     dbuf_read+0xe3()
>     dmu_buf_hold_array_by_dnode+0x1c4()
>     dmu_read+0xd4()
>     zfs_fillpage+0x15e()
>     zfs_getpage+0x187()
>     fop_getpage+0x9f()
>     segvn_fault+0x9ef()
>     as_fault+0x5ae()
>     pagefault+0x95()
>     trap+0x1286()                     
>     0xfffffffffb8001d9()              
>     fuword8+0x21()                    
>     zfs_write+0x147()                 
>     fop_write+0x69()                  
>     write+0x2af()                     
>     write32+0x1e()                    
>     sys_syscall32+0x101()             
>                                       
>   So, a write has been issued, zfs is retrieving a page and is waiting for 
the pagein  to complete.  I''ll take a further look tomorrow,
but maybe someone else reading this has an idea.  (It is midnight here).

max
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>

Richard Elling

2008-May-01 22:24 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Simon Breden wrote:> Thanks a lot Richard. To give a bit more info, I''ve copied my
/var/adm/messages from
> booting up the machine:
>   
> And @picker: I guess the 35 requests are stacked up waiting for the hanging
request to be serviced?
>
> The question I have is where do I go from now, to get some more info on
what is causing cp to have problems.
>
> I will now try another tack: use rsync to copy the directory to a disk
outside the pool (i.e. my home directory on the boot drive), to see if it is
happy doing that.
>
>
> May  1 17:04:15 zfsbox su: [ID 810491 auth.crit] ''su
root'' failed for simon on /dev/pts/5
> May  1 17:48:15 zfsbox genunix: [ID 540533 kern.notice] ^MSunOS Release
5.11 Version snv_85 64-bit
> May  1 17:48:15 zfsbox genunix: [ID 172908 kern.notice] Copyright 1983-2008
Sun Microsystems, Inc.  All rights reserved.
> May  1 17:48:15 zfsbox Use is subject to license terms.
> May  1 17:48:15 zfsbox unix: [ID 126719 kern.info] features:
13f6fff<cpuid,tscp,cmp,cx16,sse3,nx,asysc,sse2,sse,pat,cx8,pae,mca,mmx,cmov,de,pge,mtrr,msr,tsc,lgpg>
> May  1 17:48:15 zfsbox unix: [ID 168242 kern.info] mem = 4192764K
(0xffe7f000)
> May  1 17:48:15 zfsbox rootnex: [ID 466748 kern.info] root nexus = i86pc
> May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] pseudo0 at root
> May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] pseudo0 is /pseudo
> May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] scsi_vhci0 at root
> May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] scsi_vhci0 is
/scsi_vhci
> May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] isa0 at root
> May  1 17:48:15 zfsbox pci_autoconfig: [ID 139057 kern.info] NOTICE:
reprogram io-range on ppb[0/f/0]: 0x2000 ~ 0x2fff
> May  1 17:48:15 zfsbox pci_autoconfig: [ID 595143 kern.info] NOTICE: add
io-range on subtractive ppb[0/6/0]: 0x3000 ~ 0x3fff
> May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local
nmi: 0 1 1 1
> May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local
nmi: 1 1 1 1
> May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local
nmi: 2 1 1 1
> May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local
nmi: 3 1 1 1
> May  1 17:48:15 zfsbox pcplusmp: [ID 177709 kern.info] pcplusmp: vector 0x9
ioapic 0x4 intin 0x9 is bound to cpu 1
> May  1 17:48:15 zfsbox pseudo: [ID 129642 kern.info] pseudo-device:
acpippm0
> May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] acpippm0 is
/pseudo/acpippm at 0
> May  1 17:48:15 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: ppm0
> May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] ppm0 is /pseudo/ppm
at 0
> May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] npe0 at root: space 0
offset 0
> May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] npe0 is /pci at 0,0
> May  1 17:48:15 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: ide (ata)
instance 0 vector 0xe ioapic 0x4 intin 0xe is bound to cpu 0
> May  1 17:48:15 zfsbox genunix: [ID 640982 kern.info] 	IDE device at targ
0, lun 0 lastlun 0x0
> May  1 17:48:15 zfsbox genunix: [ID 846691 kern.info] 	model
HDS722516VLAT80
> May  1 17:48:15 zfsbox genunix: [ID 479077 kern.info] 	ATA/ATAPI-6
supported, majver 0x7c minver 0x19
> May  1 17:48:15 zfsbox genunix: [ID 640982 kern.info] 	ATAPI device at targ
1, lun 0 lastlun 0x0
> May  1 17:48:15 zfsbox genunix: [ID 846691 kern.info] 	model
TSSTcorpDVD-ROM SH-D162C
> May  1 17:48:15 zfsbox npe: [ID 236367 kern.info] PCI Express-device: ide
at 0, ata0
> May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] ata0 is /pci at
0,0/pci-ide at 4/ide at 0
> May  1 17:48:15 zfsbox genunix: [ID 773945 kern.info] 	UltraDMA mode 2
selected
> May  1 17:48:15 zfsbox genunix: [ID 773945 kern.info] 	UltraDMA mode 5
selected
> May  1 17:48:15 zfsbox gda: [ID 243001 kern.info] Disk0:	<Vendor
''Gen-ATA '' Product ''HDS722516VLAT80 ''>
> May  1 17:48:17 zfsbox ata: [ID 496167 kern.info] cmdk0 at ata0 target 0
lun 0
> May  1 17:48:17 zfsbox genunix: [ID 936769 kern.info] cmdk0 is /pci at
0,0/pci-ide at 4/ide at 0/cmdk at 0,0
> May  1 17:48:20 zfsbox unix: [ID 190185 kern.info] SMBIOS v2.4 loaded (2169
bytes)
> May  1 17:48:20 zfsbox genunix: [ID 408114 kern.info] /cpus (cpunex0)
online
> May  1 17:48:20 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: dld0
> May  1 17:48:20 zfsbox genunix: [ID 936769 kern.info] dld0 is /pseudo/dld
at 0
> May  1 17:48:20 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp:
pciexclass,060400 (pcie_pci) instance 1 vector 0x18 ioapic 0xff intin 0xff is
bound to cpu 1
> May  1 17:48:20 zfsbox npe: [ID 236367 kern.info] PCI Express-device:
pci10de,377 at f, pcie_pci1
> May  1 17:48:20 zfsbox genunix: [ID 936769 kern.info] pcie_pci1 is /pci at
0,0/pci10de,377 at f
> May  1 17:48:21 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp:
pci10de,163 (nvidia) instance 0 vector 0x10 ioapic 0x4 intin 0x10 is bound to
cpu 0
> May  1 17:48:21 zfsbox pcie_pci: [ID 586369 kern.info] PCIE-device: display
at 0, nvidia0
> May  1 17:48:21 zfsbox genunix: [ID 936769 kern.info] nvidia0 is /pci at
0,0/pci10de,377 at f/display at 0
> May  1 17:48:21 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: i8042
(i8042) instance 0 vector 0x1 ioapic 0x4 intin 0x1 is bound to cpu 1
> May  1 17:48:21 zfsbox pcplusmp: [ID 444295 kern.info] pcplusmp: i8042
(i8042) instance #0 vector 0xc ioapic 0x4 intin 0xc is bound to cpu 1
> May  1 17:48:21 zfsbox i8042: [ID 526150 kern.info] 8042 device:  keyboard
at 0, kb8042 # 0
> May  1 17:48:21 zfsbox genunix: [ID 936769 kern.info] kb80420 is /isa/i8042
at 1,60/keyboard at 0
> May  1 17:48:21 zfsbox i8042: [ID 526150 kern.info] 8042 device:  mouse at
1, mouse8042 # 0
> May  1 17:48:21 zfsbox genunix: [ID 936769 kern.info] mouse80420 is
/isa/i8042 at 1,60/mouse at 1
> May  1 17:48:21 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp:
pciclass,0c0320 (ehci) instance 0 vector 0x14 ioapic 0x4 intin 0x14 is bound to
cpu 0
> May  1 17:48:23 zfsbox npe: [ID 236367 kern.info] PCI Express-device:
pci1043,8239 at 2,1, ehci0
> May  1 17:48:23 zfsbox genunix: [ID 936769 kern.info] ehci0 is /pci at
0,0/pci1043,8239 at 2,1
> May  1 17:48:23 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp:
pciclass,0c0310 (ohci) instance 0 vector 0x15 ioapic 0x4 intin 0x15 is bound to
cpu 1
> May  1 17:48:23 zfsbox npe: [ID 236367 kern.info] PCI Express-device:
pci1043,8239 at 2, ohci0
> May  1 17:48:23 zfsbox genunix: [ID 936769 kern.info] ohci0 is /pci at
0,0/pci1043,8239 at 2
> May  1 17:48:23 zfsbox unix: [ID 950921 kern.info] cpu0: x86 (chipid 0x0
AuthenticAMD 60FB2 family 15 model 107 step 2 clock 2110 MHz)
> May  1 17:48:23 zfsbox unix: [ID 950921 kern.info] cpu0: AMD Athlon(tm) X2
Dual Core Processor BE-2350
> May  1 17:48:23 zfsbox unix: [ID 950921 kern.info] cpu1: x86 (chipid 0x0
AuthenticAMD 60FB2 family 15 model 107 step 2 clock 2110 MHz)
> May  1 17:48:23 zfsbox unix: [ID 950921 kern.info] cpu1: AMD Athlon(tm) X2
Dual Core Processor BE-2350
> May  1 17:48:23 zfsbox unix: [ID 557827 kern.info] cpu1 initialization
complete - online
> May  1 17:48:23 zfsbox unix: [ID 212709 kern.info] workaround applied for
cpu erratum #122
> May  1 17:48:23 zfsbox unix: [ID 877799 kern.info] workaround applied for
cpu issue #6336786
> May  1 17:48:23 zfsbox npe: [ID 236367 kern.info] PCI Express-device:
pci10de,370 at 6, pci_pci0
> May  1 17:48:23 zfsbox genunix: [ID 936769 kern.info] pci_pci0 is /pci at
0,0/pci10de,370 at 6
> May  1 17:48:23 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp:
pciclass,0c0010 (hci1394) instance 0 vector 0x13 ioapic 0x4 intin 0x13 is bound
to cpu 0
> May  1 17:48:23 zfsbox pci_pci: [ID 370704 kern.info] PCI-device:
pci1043,808b at b, hci13940
> May  1 17:48:23 zfsbox genunix: [ID 936769 kern.info] hci13940 is /pci at
0,0/pci10de,370 at 6/pci1043,808b at b
> May  1 17:48:24 zfsbox unix: [ID 469452 kern.info] NOTICE: cpqhpc: 64-bit
driver module not found
> May  1 17:48:24 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: tzmon0
> May  1 17:48:24 zfsbox genunix: [ID 936769 kern.info] tzmon0 is
/pseudo/tzmon at 0
> May  1 17:48:26 zfsbox unix: [ID 469452 kern.info] NOTICE: cpqhpc: 64-bit
driver module not found
> May  1 17:48:27 zfsbox nge: [ID 801725 kern.info] NOTICE: nge0: Using FIXED
interrupt type
> May  1 17:48:27 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp:
pci10de,373 (nge) instance 0 vector 0x16 ioapic 0x4 intin 0x16 is bound to cpu 1
> May  1 17:48:27 zfsbox mac: [ID 469746 kern.info] NOTICE: nge0 registered
> May  1 17:48:27 zfsbox nge: [ID 801725 kern.info] NOTICE: nge1: Using FIXED
interrupt type
> May  1 17:48:27 zfsbox pcplusmp: [ID 444295 kern.info] pcplusmp:
pci10de,373 (nge) instance #1 vector 0x17 ioapic 0x4 intin 0x17 is bound to cpu
1
> May  1 17:48:27 zfsbox mac: [ID 469746 kern.info] NOTICE: nge1 registered
> May  1 17:48:27 zfsbox mac: [ID 469746 kern.info] NOTICE: aggr1 registered
> May  1 17:48:28 zfsbox pseudo: [ID 129642 kern.info] pseudo-device:
devinfo0
> May  1 17:48:28 zfsbox genunix: [ID 936769 kern.info] devinfo0 is
/pseudo/devinfo at 0
> May  1 17:48:29 zfsbox genunix: [ID 454863 kern.info] dump on
/dev/dsk/c0d0s1 size 2055 MB
> May  1 17:48:30 zfsbox mac: [ID 435574 kern.info] NOTICE: nge0 link up,
1000 Mbps, full duplex
> May  1 17:48:30 zfsbox mac: [ID 435574 kern.info] NOTICE: aggr1 link up,
1000 Mbps, full duplex
> May  1 17:48:30 zfsbox mac: [ID 435574 kern.info] NOTICE: nge1 link up,
1000 Mbps, full duplex
> May  1 17:48:35 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: zfs0
> May  1 17:48:35 zfsbox genunix: [ID 936769 kern.info] zfs0 is /pseudo/zfs
at 0
> May  1 17:48:35 zfsbox unix: [ID 954099 kern.info] NOTICE: IRQ20 is being
shared by drivers with different interrupt levels.
> May  1 17:48:35 zfsbox This may result in reduced system performance.
> May  1 17:48:36 zfsbox sata: [ID 663010 kern.info] /pci at 0,0/pci1043,8239
at 5 :
> May  1 17:48:36 zfsbox sata: [ID 761595 kern.info] 	SATA disk device at
port 0
> May  1 17:48:36 zfsbox sata: [ID 846691 kern.info] 	model WDC
WD7500AAKS-00RBA0
> May  1 17:48:36 zfsbox sata: [ID 693010 kern.info] 	firmware 30.04G30
> May  1 17:48:36 zfsbox sata: [ID 163988 kern.info] 	serial number     
WD-WCAPT0506736
> May  1 17:48:36 zfsbox sata: [ID 594940 kern.info] 	supported features:
> May  1 17:48:36 zfsbox sata: [ID 981177 kern.info] 	 48-bit LBA, DMA,
Native Command Queueing, SMART, SMART self-test
> May  1 17:48:36 zfsbox sata: [ID 643337 kern.info] 	SATA Gen2 signaling
speed (3.0Gbps)
> May  1 17:48:36 zfsbox sata: [ID 349649 kern.info] 	Supported queue depth
32, limited to 1
> May  1 17:48:36 zfsbox sata: [ID 349649 kern.info] 	capacity = 1465149168
sectors
> May  1 17:48:36 zfsbox scsi: [ID 193665 kern.info] sd0 at nv_sata0: target
0 lun 0
> May  1 17:48:36 zfsbox genunix: [ID 936769 kern.info] sd0 is /pci at
0,0/pci1043,8239 at 5/disk at 0,0
> May  1 17:48:36 zfsbox genunix: [ID 408114 kern.info] /pci at
0,0/pci1043,8239 at 5/disk at 0,0 (sd0) online
> May  1 17:48:36 zfsbox sata: [ID 663010 kern.info] /pci at 0,0/pci1043,8239
at 5 :
> May  1 17:48:36 zfsbox sata: [ID 761595 kern.info] 	SATA disk device at
port 1
> May  1 17:48:36 zfsbox sata: [ID 846691 kern.info] 	model WDC
WD7500AAKS-00RBA0
> May  1 17:48:36 zfsbox sata: [ID 693010 kern.info] 	firmware 30.04G30
> May  1 17:48:36 zfsbox sata: [ID 163988 kern.info] 	serial number     
WD-WCAPT0506630
> May  1 17:48:36 zfsbox sata: [ID 594940 kern.info] 	supported features:
> May  1 17:48:36 zfsbox sata: [ID 981177 kern.info] 	 48-bit LBA, DMA,
Native Command Queueing, SMART, SMART self-test
> May  1 17:48:36 zfsbox sata: [ID 643337 kern.info] 	SATA Gen2 signaling
speed (3.0Gbps)
> May  1 17:48:36 zfsbox sata: [ID 349649 kern.info] 	Supported queue depth
32, limited to 1
> May  1 17:48:36 zfsbox sata: [ID 349649 kern.info] 	capacity = 1465149168
sectors
> May  1 17:48:36 zfsbox scsi: [ID 193665 kern.info] sd1 at nv_sata0: target
1 lun 0
> May  1 17:48:36 zfsbox genunix: [ID 936769 kern.info] sd1 is /pci at
0,0/pci1043,8239 at 5/disk at 1,0
> May  1 17:48:36 zfsbox genunix: [ID 408114 kern.info] /pci at
0,0/pci1043,8239 at 5/disk at 1,0 (sd1) online
> May  1 17:48:36 zfsbox unix: [ID 469452 kern.info] NOTICE: ncrs: 64-bit
driver module not found
> May  1 17:48:36 zfsbox unix: [ID 469452 kern.info] NOTICE: hpfc: 64-bit
driver module not found
> May  1 17:48:37 zfsbox unix: [ID 469452 kern.info] NOTICE: adp: 64-bit
driver module not found
> May  1 17:48:37 zfsbox unix: [ID 469452 kern.info] NOTICE: cadp: 64-bit
driver module not found
> May  1 17:48:37 zfsbox pcplusmp: [ID 444295 kern.info] pcplusmp: ide (ata)
instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 0
> May  1 17:48:37 zfsbox unix: [ID 469452 kern.info] NOTICE: mscsi: 64-bit
driver module not found
> May  1 17:48:37 zfsbox scsi: [ID 193665 kern.info] sd3 at ata0: target 1
lun 0
> May  1 17:48:37 zfsbox genunix: [ID 936769 kern.info] sd3 is /pci at
0,0/pci-ide at 4/ide at 0/sd at 1,0
> May  1 17:48:37 zfsbox unix: [ID 469452 kern.info] NOTICE: symhisl: 64-bit
driver module not found
> May  1 17:48:37 zfsbox unix: [ID 954099 kern.info] NOTICE: IRQ21 is being
shared by drivers with different interrupt levels.
> May  1 17:48:37 zfsbox This may result in reduced system performance.
> May  1 17:48:37 zfsbox sata: [ID 663010 kern.info] /pci at 0,0/pci1043,8239
at 5,1 :
> May  1 17:48:37 zfsbox sata: [ID 761595 kern.info] 	SATA disk device at
port 0
> May  1 17:48:37 zfsbox sata: [ID 846691 kern.info] 	model WDC
WD7500AAKS-00RBA0
> May  1 17:48:37 zfsbox sata: [ID 693010 kern.info] 	firmware 30.04G30
> May  1 17:48:37 zfsbox sata: [ID 163988 kern.info] 	serial number     
WD-WCAPT0592269
> May  1 17:48:37 zfsbox sata: [ID 594940 kern.info] 	supported features:
> May  1 17:48:37 zfsbox sata: [ID 981177 kern.info] 	 48-bit LBA, DMA,
Native Command Queueing, SMART, SMART self-test
> May  1 17:48:37 zfsbox sata: [ID 643337 kern.info] 	SATA Gen2 signaling
speed (3.0Gbps)
> May  1 17:48:37 zfsbox sata: [ID 349649 kern.info] 	Supported queue depth
32, limited to 1
> May  1 17:48:37 zfsbox sata: [ID 349649 kern.info] 	capacity = 1465149168
sectors
> May  1 17:48:37 zfsbox scsi: [ID 193665 kern.info] sd2 at nv_sata1: target
0 lun 0
> May  1 17:48:37 zfsbox genunix: [ID 936769 kern.info] sd2 is /pci at
0,0/pci1043,8239 at 5,1/disk at 0,0
> May  1 17:48:37 zfsbox genunix: [ID 408114 kern.info] /pci at
0,0/pci1043,8239 at 5,1/disk at 0,0 (sd2) online
> May  1 17:48:37 zfsbox unix: [ID 954099 kern.info] NOTICE: IRQ22 is being
shared by drivers with different interrupt levels.
> May  1 17:48:37 zfsbox This may result in reduced system performance.
> May  1 17:48:39 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: pm0
> May  1 17:48:39 zfsbox genunix: [ID 936769 kern.info] pm0 is /pseudo/pm at
0
> May  1 17:48:40 zfsbox rootnex: [ID 349649 kern.info] xsvc0 at root: space
0 offset 0
> May  1 17:48:40 zfsbox genunix: [ID 936769 kern.info] xsvc0 is /xsvc at 0,0
> May  1 17:48:40 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: power0
> May  1 17:48:40 zfsbox genunix: [ID 936769 kern.info] power0 is
/pseudo/power at 0
> May  1 17:48:40 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: srn0
> May  1 17:48:40 zfsbox genunix: [ID 936769 kern.info] srn0 is /pseudo/srn
at 0
> May  1 17:48:40 zfsbox /usr/lib/power/powerd: [ID 387247 daemon.error] Able
to open /dev/srn
> May  1 17:49:20 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: pool0
> May  1 17:49:20 zfsbox genunix: [ID 936769 kern.info] pool0 is /pseudo/pool
at 0
> May  1 17:49:35 zfsbox smbsrv: [ID 852980 kern.notice] NOTICE: SmbOplocks:
disabled
> May  1 17:49:35 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: dtrace0
> May  1 17:49:35 zfsbox genunix: [ID 936769 kern.info] dtrace0 is
/pseudo/dtrace at 0
> May  1 17:49:38 zfsbox pseudo: [ID 129642 kern.info] pseudo-device:
devinfo0
> May  1 17:49:38 zfsbox genunix: [ID 936769 kern.info] devinfo0 is
/pseudo/devinfo at 0
> May  1 17:49:45 zfsbox pseudo: [ID 129642 kern.info] pseudo-device:
nvidia255
> May  1 17:49:45 zfsbox genunix: [ID 936769 kern.info] nvidia255 is
/pseudo/nvidia at 255
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2008-May-01 22:32 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

[forget the BUI forum, e-mail works better, IMHO]

Simon Breden wrote:> Thanks a lot Richard. To give a bit more info, I''ve copied my
/var/adm/messages from booting up the machine:
>   
I don''t see any major issues related to this problem in the messages.
> And @picker: I guess the 35 requests are stacked up waiting for the hanging
request to be serviced?
>   
The 35 iops are queued in the device driver waiting to be sent to the
device (wait).  The iops queued to the device is in the active queue
(actv).
> The question I have is where do I go from now, to get some more info on
what is causing cp to have problems.
>   
As your later mdb test confirmed, ZFS is patiently waiting on an I/O
to complete.  Solaris has sent that I/O to a device and hasn''t heard a
reply yet.  By default, the sd driver will retry every 60 seconds, and
eventually fail the I/O which will bounce back up to ZFS as an I/O
error.  Please double check your drive firmware.
 -- richard

Simon Breden

2008-May-02 16:58 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Thanks Max,

I have not been able to find any new firmware for these drives (Western Digital
WD7500AAKS) so I have sent an email to Western Digital to enquire about firmware
updates. I''ll see what they reply with, but I''m not too
hopeful.

In the meantime I decided to copy the files one at a time until the problem
re-occurred. I have found the file that failed, but of course, it''s
probably likely that it will not be the same file being copied each time it
fails, so it''s probably a pointless task, but I was just curious. I
will embark on iteration 2 now...

Simon
 
 
This message posted from opensolaris.org

Simon Breden

2008-May-02 17:03 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

OK, I tried replying by email, and got a message that a moderator will approve
the message sometime... but that was a few hours ago, so I''m reverting
to this forum software again :)

Here''s the reply I emailed:

Hi Richard,

I ran the format comand, selected the number of one of the disks in the RAIDZ
vdev and then selected the ''inquiry'' subcommand, and
here''s the output:

format> inquiry
Vendor:   ATA    
Product:  WDC WD7500AAKS-0
Revision: 4G30
format>

So it looks like the drive firmware is ''4G30''

Now I will hunt around to see if there are any known problems with that firmware
version. However, I already looked yesterday on the Western Digital site for
these drives, model # WD7500AAKS, and I didn''t see any firmware
available to download...

I emailed Western Digital now to enquire about drive firmware for this drive
model, so I''ll see what they say...

Simon
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2008-May-02 17:43 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Hi Simon,

One quick note.  You don''t have to cp each file one at a time to see  
which one it hangs on.  Just run truss.  It should be the last file  
that it opened.  To see this with truss, do:

truss cp -r ...

Don''t worry about all the truss output.  You are probably only  
concerned with the end.  If the file is large and does not fail for  
many mmap/write calls, you can just truss opens and closes:

truss -topen,close cp -r ...

max

Quoting Simon Breden <sbreden at gmail.com>:
> OK, I tried replying by email, and got a message that a moderator   
> will approve the message sometime... but that was a few hours ago,   
> so I''m reverting to this forum software again :)
>
> Here''s the reply I emailed:
>
> Hi Richard,
>
> I ran the format comand, selected the number of one of the disks in   
> the RAIDZ vdev and then selected the ''inquiry''
subcommand, and
> here''s the output:
>
> format> inquiry
> Vendor:   ATA
> Product:  WDC WD7500AAKS-0
> Revision: 4G30
> format>
>
> So it looks like the drive firmware is ''4G30''
>
> Now I will hunt around to see if there are any known problems with   
> that firmware version. However, I already looked yesterday on the   
> Western Digital site for these drives, model # WD7500AAKS, and I   
> didn''t see any firmware available to download...
>
> I emailed Western Digital now to enquire about drive firmware for   
> this drive model, so I''ll see what they say...
>
> Simon
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Rob Logan

2008-May-02 17:50 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

or work around the NCQ bug in the drive''s FW by typing:

su
echo "set sata:sata_max_queue_depth = 0x1" >> /etc/system
reboot

			Rob

Cindy.Swearingen at Sun.COM

2008-May-02 18:36 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Simon,

I think you should review the checksum error reports from the fmdump
output (dated 4/30) that you supplied previously.

You can get more details by using fmdump -ev.

Use "zpool status -v" to identify checksum errors as well.

Cindy


Simon Breden wrote:> Thanks Max,
> 
> I have not been able to find any new firmware for these drives (Western
Digital WD7500AAKS) so I have sent an email to Western Digital to enquire about
firmware updates. I''ll see what they reply with, but I''m not
too hopeful.
> 
> In the meantime I decided to copy the files one at a time until the problem
re-occurred. I have found the file that failed, but of course, it''s
probably likely that it will not be the same file being copied each time it
fails, so it''s probably a pointless task, but I was just curious. I
will embark on iteration 2 now...
> 
> Simon
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Simon Breden

2008-May-02 19:04 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Thanks Cindy.

Here''s the zpool status -v output:

# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    c1t1d0  ONLINE       0     0     0
	    c1t0d0  ONLINE       0     0     0
	    c2t0d0  ONLINE       0     0     0

errors: No known data errors
# 

And here''s the fmdump output:

# fmdump -ev
TIME                 CLASS                                 ENA
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum               0x00bcf295d8f00001
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum               0x00bcfb9991f00401
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum               0x00bcfb9991f00401
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum               0x00bcfb9991f00401
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum               0x00bcfb9991f00401
Apr 30 01:12:14.9896 ereport.fs.zfs.data                   0x00bcfb9991f00401
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum               0x00bcf295d8f00001
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum               0x00bcf295d8f00001
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum               0x00bcf295d8f00001
Apr 30 01:12:14.9802 ereport.fs.zfs.data                   0x00bcf295d8f00001
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum               0x00d3cb44fe900401
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum               0x00d3cb44fe900401
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum               0x00d3cb44fe900401
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum               0x00d3cb44fe900401
Apr 30 01:12:21.1130 ereport.fs.zfs.data                   0x00d3cb44fe900401
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum               0x00d5510ce6700401
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum               0x00d5510ce6700401
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum               0x00d5510ce6700401
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum               0x00d5510ce6700401
Apr 30 01:12:21.5217 ereport.fs.zfs.data                   0x00d5510ce6700401
# 

Simon
 
 
This message posted from opensolaris.org

Cindy.Swearingen at Sun.COM

2008-May-02 20:40 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Okay, thanks.

I wanted to rule out that the checksum errors reported on 4/30
were persistent enough to be picked up by zpool status. ZFS is
generally quick to identify device problems.

Since fmdump doesn''t show any add''l recent errors either, then
I
think you can rule out hardware problems other than the slowness
problem.

You can generate more detailed fmdump output, but I don''t think
that is necessary unless I missing some detail in the story so
far...

Cindy

Simon Breden wrote:> Thanks Cindy.
> 
> Here''s the zpool status -v output:
> 
> # zpool status -v tank
>   pool: tank
>  state: ONLINE
>  scrub: none requested
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	tank        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    c1t1d0  ONLINE       0     0     0
> 	    c1t0d0  ONLINE       0     0     0
> 	    c2t0d0  ONLINE       0     0     0
> 
> errors: No known data errors
> # 
> 
> And here''s the fmdump output:
> 
> # fmdump -ev
> TIME                 CLASS                                 ENA
> Apr 30 01:12:14.9802 ereport.fs.zfs.checksum              
0x00bcf295d8f00001
> Apr 30 01:12:14.9896 ereport.fs.zfs.checksum              
0x00bcfb9991f00401
> Apr 30 01:12:14.9896 ereport.fs.zfs.checksum              
0x00bcfb9991f00401
> Apr 30 01:12:14.9896 ereport.fs.zfs.checksum              
0x00bcfb9991f00401
> Apr 30 01:12:14.9896 ereport.fs.zfs.checksum              
0x00bcfb9991f00401
> Apr 30 01:12:14.9896 ereport.fs.zfs.data                  
0x00bcfb9991f00401
> Apr 30 01:12:14.9802 ereport.fs.zfs.checksum              
0x00bcf295d8f00001
> Apr 30 01:12:14.9802 ereport.fs.zfs.checksum              
0x00bcf295d8f00001
> Apr 30 01:12:14.9802 ereport.fs.zfs.checksum              
0x00bcf295d8f00001
> Apr 30 01:12:14.9802 ereport.fs.zfs.data                  
0x00bcf295d8f00001
> Apr 30 01:12:21.1130 ereport.fs.zfs.checksum              
0x00d3cb44fe900401
> Apr 30 01:12:21.1130 ereport.fs.zfs.checksum              
0x00d3cb44fe900401
> Apr 30 01:12:21.1130 ereport.fs.zfs.checksum              
0x00d3cb44fe900401
> Apr 30 01:12:21.1130 ereport.fs.zfs.checksum              
0x00d3cb44fe900401
> Apr 30 01:12:21.1130 ereport.fs.zfs.data                  
0x00d3cb44fe900401
> Apr 30 01:12:21.5217 ereport.fs.zfs.checksum              
0x00d5510ce6700401
> Apr 30 01:12:21.5217 ereport.fs.zfs.checksum              
0x00d5510ce6700401
> Apr 30 01:12:21.5217 ereport.fs.zfs.checksum              
0x00d5510ce6700401
> Apr 30 01:12:21.5217 ereport.fs.zfs.checksum              
0x00d5510ce6700401
> Apr 30 01:12:21.5217 ereport.fs.zfs.data                  
0x00d5510ce6700401
> # 
> 
> Simon
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Simon Breden

2008-May-02 20:50 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

OK then, thanks Cindy.

I have 2 current lines of investigation left (at least):

1. assumption the problem could relate to a drive firmware bug
2. there''s a new BIOS for the motherboard available which might
possibly have some effect

For the idea that it''s a drive firmware bug, I''m currently
trying to see how to disable NCQ...

Simon
 
 
This message posted from opensolaris.org

Simon Breden

2008-May-02 22:20 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Assuming that this problem could be related to a drive firmware bug I have:

1. tuned off NCQ -- or in fact limited the queue depth to 1
2. used truss with the cp command

I found this for NCQ: http://blogs.sun.com/erickustarz/entry/ncq_tunable

============================================
NCQ sata_max_queue_depth tunable

Previously, i did some analysis on NCQ in OpenSolaris. It turned out that to get
good multi-stream sequential read performance, you had to disable NCQ via the
tunable ''sata_func_enable''. Disabling NCQ actually does two
things: 1) sets the number of concurrent I/Os to 1 and 2) changes what you send
down protocol wise.

Turns out, the first is all we really need to get good performance for the
multi-stream sequential read case, and doing the second actually exposes a bug
on certain firmware of certain disks. So i highly recommend the newly added
''sata_max_queue_depth'' tunable instead of
''sata_func_enable''. As a reminder, put the following in
/etc/system and reboot:

set sata:sata_max_queue_depth = 0x1

============================================
Anyway, after adding the line above into /etc/system, I rebooted and then
re-tried the copy with truss:

truss cp -r testdir z4

It seems to hang on random files -- so it''s not always the same file
that it hangs on.

On this particular run here are the last few lines of truss output, although
they''re probably not useful:

write(5, " r8A pF2D9B69DCF96ADD2 O".., 8388608)	= 8388608
mmap64(0xFE400000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 4, 0x23000000) =
0xFE400000
write(5, "C0A6 IF5 IB9A6 7 0F794 }".., 8388608)	= 8388608
mmap64(0xFE400000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 4, 0x23800000) =
0xFE400000
write(5, "8014D819B407 f ?F8F3E5FF".., 8388608)	= 8388608
mmap64(0xFE400000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 4, 0x24000000) =
0xFE400000
write(5, " q01DFD0 + % y7FB8 ^ >AD".., 8388608)	= 8388608
mmap64(0xFE400000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 4, 0x24800000) =
0xFE400000

Any ideas what I can do next?
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2008-May-03 07:17 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Simon Breden wrote:> set sata:sata_max_queue_depth = 0x1
>
> ============================================>
> Anyway, after adding the line above into /etc/system, I rebooted and then
re-tried the copy with truss:
>
> truss cp -r testdir z4
>
> It seems to hang on random files -- so it''s not always the same
file that it hangs on.
>
> On this particular run here are the last few lines of truss output,
although they''re probably not useful:
>   Hi Simon,
Try with:

truss -topen cp -r testdir z4

This will only show you the files being opened.  The last file opened in 
testdir is the one it is hanging on.
(Unless it is hanging in getdents(2), but I don''t think so based on the
kernel stacktrace).
But, if it is hanging on random files, this is not going to help either.
How long do you wait before deciding it''s hung?  I think usually you 
should get
console output saying I/O has been retried if the device does not 
respond to a
previously sent I/O.

max

Simon Breden

2008-May-03 12:15 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Thanks Max, I have done a few tests with what you suggest and I have listed the
output below. I wait a few minutes before deciding it''s failed, and
there is never any console output about anything failing, and nothing in any log
files I''ve looked in: /var/adm/messages or /var/log/syslog. Maybe if I
left it 2 hours I might see a message somewhere, but who knows?

This is a nasty problem, as:
1. it appears to be failing on different files, although I think I''m
seeing a common pattern where it fails on the third file often.
2. it copied all the files successfullly once, see log below for run #2, but
then I do run #3 immediately afterwards and it fails again, so I list debug
output for run #3
3. I cannot kill the hanging cp command and then my whole zfs filesystem is
locked up meaning I have to reboot. Even doing an ''ls'' command
hangs often due to the hanging ''cp'' command.
4. I cannot use ''shutdown -y -g 0 -i 5'' to shutdown the
machine, as it seems to be blocked by the hanging cp command
5. The way to shutdown the machine is to hit the reset button, and I
don''t like doing this when there are, theoretically, write operations
occurring, or at least pending.

Anyway here is the long output. Perhaps if people reply they can avoid keeping
this text as part of their reply or we''ll be lost in a sea of dump
output :)


=======================================
output: run #1 (after reboot)

bash-3.2$ truss -topen cp -r testdir z
open("/var/ld/ld.config", O_RDONLY)		Err#2 ENOENT
open("/lib/libc.so.1", O_RDONLY)		= 3
open("/usr/lib/locale/en_GB.UTF-8/en_GB.UTF-8.so.3", O_RDONLY) = 3
open("/usr/lib/locale/common/methods_unicode.so.3", O_RDONLY) = 3
open("/lib/libsec.so.1", O_RDONLY)		= 3
open("/lib/libcmdutils.so.1", O_RDONLY)		= 3
open("/lib/libavl.so.1", O_RDONLY)		= 3
open64("testdir/f06", O_RDONLY)			= 4
open64("testdir/f15", O_RDONLY)			= 4
open64("testdir/f12", O_RDONLY)			= 4



# mdb -k
Loading modules: [ unix genunix specfs dtrace cpu.generic cpu_ms.AuthenticAMD.15
uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba s1394 nca lofs zfs random
md sppp smbsrv nfs ptm crypto ipc ]> ::pgrep cpS    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R    910    909    909    869    501 0x4a004000 ffffff01d96e6e30
cp> ffffff01d96e6e30::walk thread | ::threadlist -v            ADDR             PROC              LWP CLS PRI            WCHAN
ffffff01d4371b60 ffffff01d96e6e30 ffffff01d9b28930   2  60 ffffff01f4aa52c0
  PC: _resume_from_idle+0xf1    CMD: cp -r testdir z
  stack pointer for thread ffffff01d4371b60: ffffff000949f260
  [ ffffff000949f260 _resume_from_idle+0xf1() ]
    swtch+0x17f()
    cv_wait+0x61()
    zio_wait+0x5f()
    dmu_buf_hold_array_by_dnode+0x214()
    dmu_read+0xd4()
    zfs_fillpage+0x15e()
    zfs_getpage+0x187()
    fop_getpage+0x9f()
    segvn_fault+0x9ef()
    as_fault+0x5ae()
    pagefault+0x95()
    trap+0x1286()
    0xfffffffffb8001d9()
    fuword8+0x21()
    zfs_write+0x147()
    fop_write+0x69()
    write+0x2af()
    write32+0x1e()
    sys_syscall32+0x101()             
                                      > 

bash-3.2$ iostat -xce 1
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0    33.5    5.7  431.0   32.6  0.4  0.2   14.5   6  12   0   0   0   0   4 
2  0 94
sd0       1.3    4.3  111.1   45.0  0.1  0.0   17.3   0   1   0   0   0   0 
sd1       2.0    4.4  210.0   45.1  0.1  0.0   14.5   0   1   0   0   0   0 
sd2       1.3    4.3  111.1   45.2  0.1  0.0   17.4   0   1   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
48  0 51
sd0     325.6   56.3 36340.5  146.7 26.2  0.9   71.0  92  92   0   0   0   0 
sd1     518.6   49.2 65734.5  117.1 25.9  0.9   47.0  86  85   0   0   0   0 
sd2     327.6   57.3 36983.7  144.7 27.3  0.9   73.4  94  93   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   0
43  0 57
sd0     301.1    1.0 33550.0    0.0 23.6  0.8   80.8  84  84   0   0   0   0 
sd1     556.2    1.0 69661.1    0.0 26.1  0.8   48.3  92  83   0   0   0   0 
sd2     300.1    5.0 33229.9    4.0 23.8  0.8   80.7  84  84   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   0
54  0 46
sd0     343.1    0.0 39052.6    0.0 32.2  1.0   96.9 100  99   0   0   0   0 
sd1     633.2    0.0 80986.1    0.0 33.1  1.0   53.8 100  99   0   0   0   0 
sd2     366.1    0.0 41677.4    0.0 31.7  1.0   89.4 100  99   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
48  0 50
sd0     275.0    0.0 30590.9    0.0 14.7  0.7   56.2  73  73   0   0   0   0 
sd1     459.0    0.0 57149.9    0.0 18.4  0.7   41.6  67  66   0   0   0   0 
sd2     262.0    0.0 28031.0    0.0 10.1  0.6   40.9  63  63   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
87  0 12
sd0     151.0  388.0 18557.6 49593.7 24.4  1.0   47.1  96  96   0   0   0   0 
sd1     105.0  373.0 11390.5 47737.9 24.1  1.0   52.5  95  96   0   0   0   0 
sd2     101.0  391.0 11262.6 49913.6 25.1  0.9   52.9  96  94   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
18  0 81
sd0       0.0  556.1    0.0 71183.9 33.9  1.0   62.8 100  99   0   0   0   0 
sd1       0.0  565.1    0.0 72336.2 33.9  1.0   61.8 100  99   0   0   0   0 
sd2       0.0  534.1    0.0 68367.3 33.9  1.0   65.3 100  99   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   0
15  0 84
sd0       0.0  547.0    0.0 70016.1 33.9  1.0   63.8 100  99   0   0   0   0 
sd1       0.0  563.0    0.0 72064.1 33.9  1.0   62.0 100  99   0   0   0   0 
sd2       0.0  543.0    0.0 69504.0 33.9  1.0   64.3 100  99   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
15  0 85
sd0       0.0  564.0    0.0 72191.7 33.9  1.0   61.9 100  99   0   0   0   0 
sd1       0.0  547.0    0.0 70015.7 33.9  1.0   63.8 100  99   0   0   0   0 
sd2       0.0  560.0    0.0 71679.7 33.9  1.0   62.3 100  99   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
15  0 84
sd0       0.0  554.0    0.0 70912.1 33.9  1.0   63.0 100  99   0   0   0   0 
sd1       0.0  591.0    0.0 75648.1 33.9  1.0   59.1 100  99   0   0   0   0 
sd2       0.0  574.0    0.0 73472.1 33.9  1.0   60.8 100  99   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
40  0 59
sd0     415.0  153.0 53116.4 18320.2 33.9  1.0   61.5 100  99   0   0   0   0 
sd1     258.0  123.0 29054.0 14480.0 33.9  1.0   91.6 100  99   0   0   0   0 
sd2     247.0  161.0 28094.1 19343.2 34.0  1.0   85.7 100  99   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
46  0 52
sd0     529.1   44.0 67654.7  177.5 30.6  0.9   54.9  93  92   0   0   0   0 
sd1     258.0   42.0 29186.9  110.5 25.7  0.8   88.3  86  82   0   0   0   0 
sd2     266.0   44.0 30275.0  113.5 28.5  0.9   94.9  95  89   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
48  0 50
sd0     253.8   77.9 28645.9 8216.5 26.1  0.9   81.4  96  90   0   0   0   0 
sd1     478.6   77.9 60041.3 8210.5 27.7  0.9   51.4  93  90   0   0   0   0 
sd2     274.7   78.9 30692.1 8207.5 26.2  0.9   76.5  93  90   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1
52  0 48
sd0     310.2   29.0 35036.3  100.6 25.4  0.9   77.6  96  90   0   0   0   0 
sd1     533.4   38.0 67507.7  104.1 23.9  0.9   43.3  91  87   0   0   0   0 
sd2     295.2   34.0 33175.9  110.1 22.8  0.9   72.0  90  86   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   0
40  0 60
sd0     336.0    4.0 38020.1    4.0 32.1  1.0   97.3 100  99   0   0   0   0 
sd1     574.1    0.0 72391.8    0.0 32.9  1.0   59.0 100  99   0   0   0   0 
sd2     423.0    4.0 49925.4    4.0 29.8  1.0   72.1 100  99   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    1.0    0.0    1.5  0.0  0.0    0.2   0   0   0   0   0   0   1
40  0 60
sd0     318.0    0.0 36225.4    0.0 33.9  1.0  109.9 100  99   0   0   0   0 
sd1     330.0    0.0 37569.5    0.0 33.9  1.0  105.8 100  99   0   0   0   0 
sd2     584.0    0.0 74755.0    0.0 33.9  1.0   59.7 100  99   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    1.0    0.0    8.0  0.0  0.0    2.0   0   0   0   0   0   0   1
40  0 59
sd0      92.0   76.0 10115.5 9600.5 22.1  0.4  134.2  81  41   0   0   0   0 
sd1      87.0   92.0 9664.5 11140.5 20.0  0.4  114.1  76  42   0   0   0   0 
sd2     142.0   88.0 17731.9 11136.5 20.6  0.4   91.3  74  37   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     1.0    3.0    4.0  124.0  0.0  0.0    7.1   0   1   0   0   0   0   1
55  0 44
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   1 
1  0 99
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 


=================================================================
run #2 & #3:  (after reboot, leaving machine idle for 1 hour, then repeating
this test)

bash-3.2$ mkdir z2
bash-3.2$ truss -topen cp -r testdir z2
open("/var/ld/ld.config", O_RDONLY)		Err#2 ENOENT
open("/lib/libc.so.1", O_RDONLY)		= 3
open("/usr/lib/locale/en_GB.UTF-8/en_GB.UTF-8.so.3", O_RDONLY) = 3
open("/usr/lib/locale/common/methods_unicode.so.3", O_RDONLY) = 3
open("/lib/libsec.so.1", O_RDONLY)		= 3
open("/lib/libcmdutils.so.1", O_RDONLY)		= 3
open("/lib/libavl.so.1", O_RDONLY)		= 3
open64("testdir/f06", O_RDONLY)			= 4
open64("testdir/f15", O_RDONLY)			= 4
open64("testdir/f12", O_RDONLY)			= 4
open64("testdir/f01", O_RDONLY)			= 4
open64("testdir/f05", O_RDONLY)			= 4
open64("testdir/f16", O_RDONLY)			= 4
open64("testdir/f11", O_RDONLY)			= 4
open64("testdir/f02", O_RDONLY)			= 4
open64("testdir/f08", O_RDONLY)			= 4
open64("testdir/f13", O_RDONLY)			= 4
open64("testdir/f14", O_RDONLY)			= 4
open64("testdir/f07", O_RDONLY)			= 4
open64("testdir/f09", O_RDONLY)			= 4
open64("testdir/f03", O_RDONLY)			= 4
open64("testdir/f10", O_RDONLY)			= 4
open64("testdir/f04", O_RDONLY)			= 4

bash-3.2$ mkdir z3
bash-3.2$ truss -topen cp -r testdir z3
open("/var/ld/ld.config", O_RDONLY)		Err#2 ENOENT
open("/lib/libc.so.1", O_RDONLY)		= 3
open("/usr/lib/locale/en_GB.UTF-8/en_GB.UTF-8.so.3", O_RDONLY) = 3
open("/usr/lib/locale/common/methods_unicode.so.3", O_RDONLY) = 3
open("/lib/libsec.so.1", O_RDONLY)		= 3
open("/lib/libcmdutils.so.1", O_RDONLY)		= 3
open("/lib/libavl.so.1", O_RDONLY)		= 3
open64("testdir/f06", O_RDONLY)			= 4
open64("testdir/f15", O_RDONLY)			= 4
open64("testdir/f12", O_RDONLY)			= 4
open64("testdir/f01", O_RDONLY)			= 4


# mdb -k
Loading modules: [ unix genunix specfs dtrace cpu.generic cpu_ms.AuthenticAMD.15
uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba s1394 fctl nca lofs zfs
random md sppp smbsrv nfs ptm ipc crypto cpc fcp fcip logindmux nsctl sdbc sv ii
nsmb rdc ]> ::pgrep cpS    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R    927    926    926    918    501 0x4a004000 ffffff01dea2ecb0
cp> ffffff01dea2ecb0::walk thread | ::threadlist -v            ADDR             PROC              LWP CLS PRI            WCHAN
ffffff01dee52ac0 ffffff01dea2ecb0 ffffff01df063300   2  59 ffffff01d413ed16
  PC: _resume_from_idle+0xf1    CMD: cp -r testdir z3
  stack pointer for thread ffffff01dee52ac0: ffffff000874bb60
  [ ffffff000874bb60 _resume_from_idle+0xf1() ]
    swtch+0x17f()
    cv_wait+0x61()
    txg_wait_open+0x7f()
    dmu_tx_wait+0xb6()
    zfs_write+0x360()
    fop_write+0x69()
    write+0x2af()
    write32+0x1e()
    sys_syscall32+0x101()
> 
# iostat -xce 1
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0    14.5    1.2  132.7   10.0  0.3  0.1   19.8   2   4   0   0   0   0   1 
2  0 97
sd0       8.1   12.5  938.4 1514.3  3.5  0.0  169.5  10   4   0   0   0   0 
sd1      11.6   12.5 1444.1 1514.7  3.4  0.0  144.6  10   4   0   0   0   0 
sd2       7.9   12.5  895.4 1514.4  3.4  0.0  170.0  10   4   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     1.0    0.0    4.2    0.0  0.0  0.0    6.2   0   1   0   0   0   0   1 
1  0 98
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
                         extended device statistics         ---- errors ---     
cpu
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
cmdk0     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0   0 
1  0 99
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd1       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd2       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   5   0   5 
^C# 

============================================ 
 
This message posted from opensolaris.org

Simon Breden

2008-May-03 14:05 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Well, I had some more ideas and ran some more tests:

1. cp -r testdir ~/z1

This copied the testdir directory from the zfs pool into my home directory on
the IDE boot drive, so not part of the zfs pool, and this worked.

2. cp -r ~/z1 .

This copied the files back from my home directory on the IDE boot disk and into
the ZFS pool. This worked.

3. cp -r z1 z2 

This copied the files from the ZFS pool to another directory in the ZFS pool and
this has not worked -- it hanged again, but differently this time. It copied a
couple of files, then the hanged. The mouse wouldn''t move, keyboard
inactive, I hit loads of keys including ALT TAB and finally the mouse was back,
the copying continued to copy 2 or 3 more files and then hanged again, this time
no more files are being copied, and its hanged copying a different file from
other times.


So from these tests, it appears that copying the test directory out of the ZFS
pool is successful, copying it in from outside the pool is successful, but
reading and writing the files completely within the pool is failing. My gut
instinct is that reading and writing purely within the pool is stressing the
disks due to more I/O demands on the disks used for the ZFS pool, but this may
be completely wrong.
 
 
This message posted from opensolaris.org

Simon Breden

2008-May-03 14:22 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

The plot thickens. I replaced ''cp'' with
''rsync'' and it worked -- I ran it a few times and it
didn''t hang so far.

So on the face of it, it appears that ''cp'' is doing something
that causes my system to hang if the files are read from and written to the same
pool, but simply replacing ''cp'' with ''rsync''
works. Hmmm... anyone have a clue about what I can do next to home in on the
problem with ''cp'' ?

Here is the output using ''rsync'' :

bash-3.2$ truss -topen rsync -a z1 z2
open("/var/ld/ld.config", O_RDONLY)		Err#2 ENOENT
open("/lib/libsocket.so.1", O_RDONLY)		= 3
open("/lib/libnsl.so.1", O_RDONLY)		= 3
open("/lib/libc.so.1", O_RDONLY)		= 3
open("/usr/lib/locale/en_GB.UTF-8/en_GB.UTF-8.so.3", O_RDONLY) = 3
open("/usr/lib/locale/common/methods_unicode.so.3", O_RDONLY) = 3
open64("/etc/popt", O_RDONLY)			Err#2 ENOENT
open64("/export/home/simon/.popt", O_RDONLY)	Err#2 ENOENT
open("/usr/lib/iconv/UTF-8%UTF-8.so", O_RDONLY)	= 3
open64("/var/run/name_service_door", O_RDONLY)	= 3
open64("z1/testdir/f01", O_RDONLY)		= 5
open64("z1/testdir/f02", O_RDONLY)		= 5
open64("z1/testdir/f03", O_RDONLY)		= 5
open64("z1/testdir/f04", O_RDONLY)		= 5
open64("z1/testdir/f05", O_RDONLY)		= 5
open64("z1/testdir/f06", O_RDONLY)		= 5
open64("z1/testdir/f07", O_RDONLY)		= 5
open64("z1/testdir/f08", O_RDONLY)		= 5
open64("z1/testdir/f09", O_RDONLY)		= 5
open64("z1/testdir/f10", O_RDONLY)		= 5
open64("z1/testdir/f11", O_RDONLY)		= 5
open64("z1/testdir/f12", O_RDONLY)		= 5
open64("z1/testdir/f13", O_RDONLY)		= 5
open64("z1/testdir/f14", O_RDONLY)		= 5
open64("z1/testdir/f15", O_RDONLY)		= 5
open64("z1/testdir/f16", O_RDONLY)		= 5
    Received signal #18, SIGCLD, in pollsys() [caught]
      siginfo: SIGCLD CLD_EXITED pid=910 status=0x0000
bash-3.2$
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2008-May-03 15:07 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Hi Simon,
Simon Breden wrote:> The plot thickens. I replaced ''cp'' with
''rsync'' and it worked -- I ran it a few times and it
didn''t hang so far.
>
> So on the face of it, it appears that ''cp'' is doing
something that causes my system to hang if the files are read from and written
to the same pool, but simply replacing ''cp'' with
''rsync'' works. Hmmm... anyone have a clue about what I can do
next to home in on the problem with ''cp'' ?
>
> Here is the output using ''rsync'' :
>
> bash-3.2$ truss -topen rsync -a z1 z2
> open("/var/ld/ld.config", O_RDONLY)		Err#2 ENOENT
>   The rsync command and cp command work very differently.  cp mmaps up to 
8MB of the input file and writes
from the returned address of mmap, faulting in the pages as it writes  
(unless you are a normal user on Indiana,
in which case cp is gnu''s cp which reads/writes (so, why are there 2 
versions?)).  Rsync forks and sets up a socketpair between parent and
child processes then reads/writes.  It should be much slower than cp, 
and put much less stress on the disk.
It would be great to have a way to reproduce this.   I have not had any 
problems.  How large is the
directory you are copying?  Either the disk has not sent a response to 
an I/O operation, or the response was
somehow lost.  If I could reproduce the problem, I might try to dtrace 
the commands being sent to the HBA
and responses coming back...  Hopefully someone here who has experience 
with the disks you are using
will be able to help.

max

Rob at Logan.com

2008-May-03 16:21 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

no amount of playing with cp will fix a drive FW issue. but as
pointed out the slower rsync will tax the FW less. Looking at

device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us
sy wt id
sd0       0.0    0.0    0.0    0.0 35.0  0.0    0.0 100   0   0   0   0   0

it seems the disk still has requests queued but not
active, so the

echo "set sata:sata_max_queue_depth = 0x1" >> /etc/system

didn''t work.. (perhaps running bits older than snv_74?)
http://bugs.opensolaris.org/view_bug.do?bug_id=6589306

so lets try the older

echo "set sata:sata_func_enable = 0x7" >> /etc/system

but of cource fixing the drive FW is the answer.

ref:
http://mail.opensolaris.org/pipermail/storage-discuss/2008-January/004428.html

				Rob

Rob at Logan.com

2008-May-03 16:30 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

oops, I lied... according to my self

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-January/045141.html

"wait" are queued in solaris and active > 1 are in
the drives NCQ.

so the question is: Where are the drive''s command getting
dropped across 3 disks at the same time?

and in all cases its not a zfs issue, but a disk, controller
or storage-discuss at opensolaris.org issue.

			Rob

Simon Breden

2008-May-03 17:29 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Thanks Max, and the fact that rsync stresses the system less would help explain
why rsync works, and cp hangs. The directory was around 11GB in size.

If Sun engineers are interested in this problem then I''m happy to run
whatever commands they give me -- after all, I have a pure goldmine here for
them to debug ;-) And it *is* running on a ZFS filesystem. Opportunities like
this don''t come along every day :) Tempted? :)

Well, if I can''t tempt Sun, then for anyone who has the same disks, I
would be interested to see what happens on your machine:
Model Number: WD7500AAKS-00RBA0
Firmware revision: 4G30

I use three of these disks in a RAIDZ1 vdev within the pool.
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2008-May-03 17:39 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Hi Simon,
Simon Breden wrote:> Thanks Max, and the fact that rsync stresses the system less would help
explain why rsync works, and cp hangs. The directory was around 11GB in size.
>
> If Sun engineers are interested in this problem then I''m happy to
run whatever commands they give me -- after all, I have a pure goldmine here for
them to debug ;-) And it *is* running on a ZFS filesystem. Opportunities like
this don''t come along every day :) Tempted? :)
>
> Well, if I can''t tempt Sun, then for anyone who has the same
disks, I would be interested to see what happens on your machine:
> Model Number: WD7500AAKS-00RBA0
> Firmware revision: 4G30
>
> I use three of these disks in a RAIDZ1 vdev within the pool.
>  
>   I think Rob Logan is probably correct, and there is a problem with the 
disks, not zfs.  Have you
tried this with a different file system (ufs), or multiple dd commands 
running at the same time with
the raw disks?

max

Simon Breden

2008-May-03 19:44 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

@Max: I''ve not tried this with other file systems, and not with
multiple dd commands at the same time with raw disks. I suppose this is not
possible to do with my disks which are currently part of this RAIDZ1 vdev in the
pool without corrupting data? I''ll assume not.

@Rob: OK, let''s assume that, like you say, it''s not a ZFS
issue, but in fact a drive, firmware etc issue. That said, where should I create
a new thread -- in storage-discuss ? I will refer to these 2 threads here for
all the gory details ;-)

If this can be proven to be a disk problem then I want to return them under
warranty and get some different ones. Normally these disks have absolutely
excellent user feedback on newegg.com, so I''m quite surprised if the
disks are ALL bad.

I wonder if, in fact, there could be some issue between the
motherboard''s BIOS and the drives, or is this not possible? The
motherboard is an "Asus M2N-SLI Deluxe" and it uses an NVidia
controller (MCP 55?) which is part of the NVidia 570 SLI chipset, again not
exactly an exotic, unused chipset.

If it''s possible for the BIOS to affect the disk interface then I have
seen that a new BIOS is available, which I could try.

Also I could update to snv_b87, which is the latest one, although I first saw
this problem with that build (87), so the OS upgrade might not help.
 
 
This message posted from opensolaris.org

Dave

2008-May-03 21:56 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

I have similar, but not exactly the same drives:

format> inq
Vendor:   ATA
Product:  WDC WD7500AYYS-0
Revision: 4G30

Same firmware revision. I have no problems with drive performance, 
although I use them under UFS and for backing stores for iscsi disks.

FYI, I had random lockups and crashes on my Tyan MB with the MCP55 
chipset. I bought Supermicro AOL-SAT2-MV8''s and moved all my disks to 
them. Haven''t had a problem since.

http://de.opensolaris.org/jive/thread.jspa?messageID=204736

--
Dave

On 05/03/2008 01:44 PM, Simon Breden wrote:> @Max: I''ve not tried this with other file systems, and not with
multiple dd commands at the same time with raw disks. I suppose this is not
possible to do with my disks which are currently part of this RAIDZ1 vdev in the
pool without corrupting data? I''ll assume not.
> 
> @Rob: OK, let''s assume that, like you say, it''s not a ZFS
issue, but in fact a drive, firmware etc issue. That said, where should I create
a new thread -- in storage-discuss ? I will refer to these 2 threads here for
all the gory details ;-)
> 
> If this can be proven to be a disk problem then I want to return them under
warranty and get some different ones. Normally these disks have absolutely
excellent user feedback on newegg.com, so I''m quite surprised if the
disks are ALL bad.
> 
> I wonder if, in fact, there could be some issue between the
motherboard''s BIOS and the drives, or is this not possible? The
motherboard is an "Asus M2N-SLI Deluxe" and it uses an NVidia
controller (MCP 55?) which is part of the NVidia 570 SLI chipset, again not
exactly an exotic, unused chipset.
> 
> If it''s possible for the BIOS to affect the disk interface then I
have seen that a new BIOS is available, which I could try.
> 
> Also I could update to snv_b87, which is the latest one, although I first
saw this problem with that build (87), so the OS upgrade might not help.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Simon Breden

2008-May-04 00:37 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

Wow, thanks Dave. Looks like you''ve had this hell too :)

So, that makes me happy that the disks and pool are probably OK, but it does
seem an issue with the NVidia MCP 55 chipset, or at least perhaps the nv_sata
driver. From reading the bug list below, it seems the problem might be a more
general disk driver problem, perhaps not just limited to the nv_sata driver.

I looked at the post you listed, and followed a chain of bug reports:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6658565 (Accepted)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6642154 (dup of
6662881)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6662881 (fixed in
snv_87)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6669134 (fixed in
snv_90)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6662400 (dup of
6669134)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6671523 (Accepted)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6672202 (Need more
info)

Maybe I''ll try snv_87 again, or just wait until snv_90 is released.

My work around for until this issue is fixed is to use ''rsync''
to do intra-pool copies as it seems to stress the system less and thus prevents
ZFS file system lockup.

Thanks to everyone who helped.

I might post a link to this thread in the storage-discuss group to see if I can
get any further  help, if anyone there knows any more details on this.

That Supermicro AOC MV8 card looks good, but I would prefer not to have to buy
new hardware to fix what should hopefully turn out to be a driver problem.
 
 
This message posted from opensolaris.org

Simon Breden

2008-May-04 10:59 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

I have moved this saga to storage-discuss now, as this doesn''t appear
to be a ZFS issue, and it can be found here:

http://www.opensolaris.org/jive/thread.jspa?threadID=59201
 
 
This message posted from opensolaris.org

Simon Breden

2008-May-04 11:00 UTC

head link

[zfs-discuss] cp -r hanged copying a directory

I have moved this saga to storage-discuss now, as this doesn''t appear
to be a ZFS issue, and it can be found here:

http://www.opensolaris.org/jive/thread.jspa?threadID=59201
 
 
This message posted from opensolaris.org

zfs discuss - Apr 2008 - cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory

[zfs-discuss] cp -r hanged copying a directory