Jason Pyeron
2015-Feb-18 02:34 UTC
[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
> -----Original Message----- > From: Chris Murphy > Sent: Tuesday, February 17, 2015 20:48 > > On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote: > >> I'd post the entire dmesg somewhere > > > > http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log > > At least part of the problem happens before this log starts.Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 15 23:41:21 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8613 seconds. Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 16 02:04:55 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8735 seconds. Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0 Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0 Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 9224 seconds.> > >> What do you get for > >> smartctl -x <dev> > > > > http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log > > OK no smart extended test has been done, but also no pending bad or > relocated sectors, and no phy event errors either. So the write (10) > error seems isolated but it's still really suspicious, so I'd start > replacing hardware.Dell tech is enroute. New system board and disk controller.> > > > I have replaced the drive (and reinstalled) already, the > panics still happen once ever 30-40 hours. > > The only thing that suggests it might not be hardware are all the kvm > related messages in the kp.How so, each of the results I find say these are to be ignored.> So if you've changed kernels, or VM > configuration recently, then I'd revert. That's the limit of the mostNo changes from install out of the box.> likely software explanation. If there's no recent software changes, > then it must be hardware. >-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Chris Murphy
2015-Feb-18 04:38 UTC
[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron <jpyeron at pdinc.us> wrote:>> -----Original Message----- >> From: Chris Murphy >> Sent: Tuesday, February 17, 2015 20:48 >> >> On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote: >> >> I'd post the entire dmesg somewhere >> > >> > http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log >> >> At least part of the problem happens before this log starts. > > Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) > Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) > Feb 15 23:41:21 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8613 seconds. > Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) > Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) > Feb 16 02:04:55 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8735 seconds. > Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0 > Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 > Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0 > Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) > Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 9224 seconds.Doesn't seem related.> >> >> >> What do you get for >> >> smartctl -x <dev> >> > >> > http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log >> >> OK no smart extended test has been done, but also no pending bad or >> relocated sectors, and no phy event errors either. So the write (10) >> error seems isolated but it's still really suspicious, so I'd start >> replacing hardware. > > Dell tech is enroute. New system board and disk controller.I'm curious what they replace.> >> >> >> > I have replaced the drive (and reinstalled) already, the >> panics still happen once ever 30-40 hours. >> >> The only thing that suggests it might not be hardware are all the kvm >> related messages in the kp. > > How so, each of the results I find say these are to be ignored.Well I found two older kernel bugs similar to this that suggested the problem stopped happening when running kvm with 1vcpu, and in another case when the VM was rebuilt 32-bit instead of 64-bit. But my ability to read kernel call traces is very limited, I really don't know what's going on. If it's a kernel bug though, you could maybe clobber it with a substantially newer kernel. You might check out elrepo kernels. 2.6.32 is really old, granted the centos one you're running has a huge pile of backports that makes it less "ancient" from a stability perspective, but anything really new that's hard to backport likely isn't in that kernel. While you're waiting for Dell you could try either: kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm What's running in the VM? -- Chris Murphy
Jason Pyeron
2015-Feb-18 05:02 UTC
[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
> -----Original Message----- > From: Chris Murphy > Sent: Tuesday, February 17, 2015 23:38 > > On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron wrote: > >> -----Original Message----- > >> From: Chris Murphy > >> Sent: Tuesday, February 17, 2015 20:48 > >> > >> On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote: > >> >> I'd post the entire dmesg somewhere > >> > > >> > http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log > >> > >> At least part of the problem happens before this log starts. > ><snip/>> > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on > br0 to 192.168.5.58 port 67 (xid=0x48d081b6) > > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from > 192.168.5.58 (xid=0x48d081b6) > > Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to > 192.168.13.230 -- renewal in 9224 seconds. > > Doesn't seem related. > > > > > >> > >> >> What do you get for > >> >> smartctl -x <dev> > >> > > >> > > http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log > >> > >> OK no smart extended test has been done, but also no pending bad or > >> relocated sectors, and no phy event errors either. So the > write (10) > >> error seems isolated but it's still really suspicious, so I'd start > >> replacing hardware. > > > > Dell tech is enroute. New system board and disk controller. > > I'm curious what they replace.Both, but the backplane is not on the replacement list.> > > > >> > >> > >> > I have replaced the drive (and reinstalled) already, the > >> panics still happen once ever 30-40 hours. > >> > >> The only thing that suggests it might not be hardware are > all the kvm > >> related messages in the kp. > > > > How so, each of the results I find say these are to be ignored. > > Well I found two older kernel bugs similar to this that suggested the > problem stopped happening when running kvm with 1vcpu, and in another > case when the VM was rebuilt 32-bit instead of 64-bit. But my ability > to read kernel call traces is very limited, I really don't know what's > going on. >I can say, we have about 20 of the identical systems, doing the same work. PE2970 running RHEL6/Centos6 and libvirtd> If it's a kernel bug though, you could maybe clobber it with a > substantially newer kernel. You might check out elrepo kernels. 2.6.32 > is really old, granted the centos one you're running has a huge pile > of backports that makes it less "ancient" from a stabilityWe should start looking at Centos7/RHEL7, ug systemd..... But these machines are ancient too.> perspective, but anything really new that's hard to backport likely > isn't in that kernel. While you're waiting for Dell you could try > either: > > kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm > kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpmUnlikly, since I do not have a test plan. If I could reproduce the error on demand then it would be a valid experiment. Some of the systems are running RHEL6 which are under support, while the others are Centos6. The configs are kept as close as possible to each other. Besides I am doing the migration right now to another host.> > What's running in the VM?Mostly RHEL6/Centos6 VMs. But there are some windows systems too. This system was handling most of the CipherShed.org Jenkins CI farm. I can say the resources are oversubscribed by a 15x. But the system runs at below 0.10 at any random time. Thanks for the thoughs on this. -Jason -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Possibly Parallel Threads
- Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
- Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
- Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
- Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
- Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!