thr3ads.net - CentOS - [CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort! [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Jason Pyeron

2015-Feb-18 02:34 UTC

[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

> -----Original Message-----
> From: Chris Murphy
> Sent: Tuesday, February 17, 2015 20:48
> 
> On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
> >> I'd post the entire dmesg somewhere
> >
> > http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
> 
> At least part of the problem happens before this log starts.
Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58
port 67 (xid=0x48d081b6)
Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58
(xid=0x48d081b6)
Feb 15 23:41:21 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal
in 8613 seconds.
Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58
port 67 (xid=0x48d081b6)
Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58
(xid=0x48d081b6)
Feb 16 02:04:55 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal
in 8735 seconds.
Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr
wrmsr: 0xc0010004 data 0xffffffffffffd8f0
Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr
wrmsr: 0xc0010000 data 0x530076
Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr
wrmsr: 0xc0010004 data 0xffffffffffffd8f0
Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr
wrmsr: 0xc0010000 data 0x530076
Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58
port 67 (xid=0x48d081b6)
Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58
(xid=0x48d081b6)
Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal
in 9224 seconds.
> 
> >> What do you get for
> >> smartctl -x <dev>
> >
> > http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
> 
> OK no smart extended test has been done, but also no pending bad or
> relocated sectors, and no phy event errors either. So the write (10)
> error seems isolated but it's still really suspicious, so I'd start
> replacing hardware.
Dell tech is enroute. New system board and disk controller.
> 
> 
> > I have replaced the drive (and reinstalled) already, the 
> panics still happen once ever 30-40 hours.
> 
> The only thing that suggests it might not be hardware are all the kvm
> related messages in the kp.
How so, each of the results I find say these are to be ignored.
> So if you've changed kernels, or VM
> configuration recently, then I'd revert. That's the limit of the
most
No changes from install out of the box.
> likely software explanation. If there's no recent software changes,
> then it must be hardware.
> 
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.

Chris Murphy

2015-Feb-18 04:38 UTC

head link

[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron <jpyeron at pdinc.us>
wrote:>> -----Original Message-----
>> From: Chris Murphy
>> Sent: Tuesday, February 17, 2015 20:48
>>
>> On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
>> >> I'd post the entire dmesg somewhere
>> >
>> > http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
>>
>> At least part of the problem happens before this log starts.
>
> Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to
192.168.5.58 port 67 (xid=0x48d081b6)
> Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58
(xid=0x48d081b6)
> Feb 15 23:41:21 thirteen-230 dhclient[1272]: bound to 192.168.13.230 --
renewal in 8613 seconds.
> Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to
192.168.5.58 port 67 (xid=0x48d081b6)
> Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58
(xid=0x48d081b6)
> Feb 16 02:04:55 thirteen-230 dhclient[1272]: bound to 192.168.13.230 --
renewal in 8735 seconds.
> Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr
wrmsr: 0xc0010004 data 0xffffffffffffd8f0
> Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr
wrmsr: 0xc0010000 data 0x530076
> Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr
wrmsr: 0xc0010004 data 0xffffffffffffd8f0
> Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr
wrmsr: 0xc0010000 data 0x530076
> Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to
192.168.5.58 port 67 (xid=0x48d081b6)
> Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58
(xid=0x48d081b6)
> Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 192.168.13.230 --
renewal in 9224 seconds.
Doesn't seem related.

>
>>
>> >> What do you get for
>> >> smartctl -x <dev>
>> >
>> >
http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
>>
>> OK no smart extended test has been done, but also no pending bad or
>> relocated sectors, and no phy event errors either. So the write (10)
>> error seems isolated but it's still really suspicious, so I'd
start
>> replacing hardware.
>
> Dell tech is enroute. New system board and disk controller.
I'm curious what they replace.
>
>>
>>
>> > I have replaced the drive (and reinstalled) already, the
>> panics still happen once ever 30-40 hours.
>>
>> The only thing that suggests it might not be hardware are all the kvm
>> related messages in the kp.
>
> How so, each of the results I find say these are to be ignored.
Well I found two older kernel bugs similar to this that suggested the
problem stopped happening when running kvm with 1vcpu, and in another
case when the VM was rebuilt 32-bit instead of 64-bit. But my ability
to read kernel call traces is very limited, I really don't know what's
going on.

If it's a kernel bug though, you could maybe clobber it with a
substantially newer kernel. You might check out elrepo kernels. 2.6.32
is really old, granted the centos one you're running has a huge pile
of backports that makes it less "ancient" from a stability
perspective, but anything really new that's hard to backport likely
isn't in that kernel. While you're waiting for Dell you could try
either:

kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm
kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm

What's running in the VM?

-- 
Chris Murphy

Jason Pyeron

2015-Feb-18 05:02 UTC

head link

[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

> -----Original Message-----
> From: Chris Murphy
> Sent: Tuesday, February 17, 2015 23:38
> 
> On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron wrote:
> >> -----Original Message-----
> >> From: Chris Murphy
> >> Sent: Tuesday, February 17, 2015 20:48
> >>
> >> On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
> >> >> I'd post the entire dmesg somewhere
> >> >
> >> >
http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
> >>
> >> At least part of the problem happens before this log starts.
> >
<snip/>> > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on 
> br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
> > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 
> 192.168.5.58 (xid=0x48d081b6)
> > Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 
> 192.168.13.230 -- renewal in 9224 seconds.
> 
> Doesn't seem related.
> 
> 
> >
> >>
> >> >> What do you get for
> >> >> smartctl -x <dev>
> >> >
> >> > 
> http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
> >>
> >> OK no smart extended test has been done, but also no pending bad
or
> >> relocated sectors, and no phy event errors either. So the 
> write (10)
> >> error seems isolated but it's still really suspicious, so
I'd start
> >> replacing hardware.
> >
> > Dell tech is enroute. New system board and disk controller.
> 
> I'm curious what they replace.
Both, but the backplane is not on the replacement list.
> 
> >
> >>
> >>
> >> > I have replaced the drive (and reinstalled) already, the
> >> panics still happen once ever 30-40 hours.
> >>
> >> The only thing that suggests it might not be hardware are 
> all the kvm
> >> related messages in the kp.
> >
> > How so, each of the results I find say these are to be ignored.
> 
> Well I found two older kernel bugs similar to this that suggested the
> problem stopped happening when running kvm with 1vcpu, and in another
> case when the VM was rebuilt 32-bit instead of 64-bit. But my ability
> to read kernel call traces is very limited, I really don't know
what's
> going on.
> 
I can say, we have about 20 of the identical systems, doing the same work.
PE2970 running RHEL6/Centos6 and libvirtd
> If it's a kernel bug though, you could maybe clobber it with a
> substantially newer kernel. You might check out elrepo kernels. 2.6.32
> is really old, granted the centos one you're running has a huge pile
> of backports that makes it less "ancient" from a stability
We should start looking at Centos7/RHEL7, ug systemd..... But these machines are
ancient too.
> perspective, but anything really new that's hard to backport likely
> isn't in that kernel. While you're waiting for Dell you could try
> either:
> 
> kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm
> kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm
Unlikly, since I do not have a test plan. If I could reproduce the error on
demand then it would be a valid experiment. Some of the systems are running
RHEL6 which are under support, while the others are Centos6. The configs are
kept as close as possible to each other.

Besides I am doing the migration right now to another host.
> 
> What's running in the VM?
Mostly RHEL6/Centos6 VMs. But there are some windows systems too. This system
was handling most of the CipherShed.org Jenkins CI farm. I can say the resources
are oversubscribed by a 15x. But the system runs at below 0.10 at any random
time.

Thanks for the thoughs on this.

-Jason

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.

Possibly Parallel Threads

Search for more seemingly similar threads

CentOS - Feb 2015 - Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

Possibly Parallel Threads