Jason Pyeron
2015-Feb-08 03:53 UTC
[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
NOTE: this is happening on Centos 6 x86_64, 2.6.32-504.3.3.el6.x86_64 not Centos 5 Dell PowerEdge 2970, Seagate SATA drive, non-raid. I have this server which has been dying randomly, with no logs. I had a tail -f over ssh for a week, when this just happened. Feb 8 00:10:21 thirteen-230 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880057a0a080) Feb 8 00:10:21 thirteen-230 kernel: sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 1a 17 a1 6f 00 00 01 00 Feb 8 00:10:51 thirteen-230 kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000 Feb 8 00:10:51 thirteen-230 kernel: mptbase: ioc0: Initiating recovery Feb 8 00:11:13 thirteen-230 kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880057a0a080) Write failed: Connection reset by peer After reading https://access.redhat.com/solutions/108273, I am increasing the logging (shown below) but I am not confident about this wait and see approach. sysctl -w dev.scsi.logging_level=98367 I am also going to check smartctl output once I get onsite to power cycle the system. Other posts I have read, but I can not act on yet: * http://unix.stackexchange.com/questions/34173/mptscsih-ioc0-task-abort-success-rv-2002-causes-30-seconds-freezing * https://bugzilla.kernel.org/show_bug.cgi?id=18652 * https://bugzilla.redhat.com/show_bug.cgi?id=483424 * https://bugzilla.kernel.org/show_bug.cgi?id=42765 * http://sourceforge.net/p/smartmontools/mailman/message/23849184/ * http://kb.softescu.ro/category/hardware/dell/ -Jason -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Jason Pyeron
2015-Feb-08 04:59 UTC
[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
> -----Original Message----- > From: Jason Pyeron > Sent: Saturday, February 07, 2015 22:54 > > NOTE: this is happening on Centos 6 x86_64, > 2.6.32-504.3.3.el6.x86_64 not Centos 5 > > Dell PowerEdge 2970, Seagate SATA drive, non-raid. > > I have this server which has been dying randomly, with no logs.Here is a console picture. http://i.imgur.com/ZYHlB82.jpg> > I had a tail -f over ssh for a week, when this just happened. > > Feb 8 00:10:21 thirteen-230 kernel: mptscsih: ioc0: > attempting task abort! (sc=ffff880057a0a080) > Feb 8 00:10:21 thirteen-230 kernel: sd 4:0:0:0: [sda] CDB: > Write(10): 2a 00 1a 17 a1 6f 00 00 01 00 > Feb 8 00:10:51 thirteen-230 kernel: mptscsih: ioc0: WARNING > - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000 > Feb 8 00:10:51 thirteen-230 kernel: mptbase: ioc0: > Initiating recovery > Feb 8 00:11:13 thirteen-230 kernel: mptscsih: ioc0: task > abort: SUCCESS (rv=2002) (sc=ffff880057a0a080) > Write failed: Connection reset by peer > > After reading https://access.redhat.com/solutions/108273, I > am increasing the logging (shown below) but I am not > confident about this wait and see approach. > > sysctl -w dev.scsi.logging_level=98367 > > I am also going to check smartctl output once I get onsite to > power cycle the system.# smartctl -a /dev/sda smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-504.3.3.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION ==Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors) Device Model: ST1500DM003-9YN16G Serial Number: W24153R0 LU WWN Device Id: 5 000c50 05d03cc1d Firmware Version: CC82 User Capacity: 1,500,301,910,016 bytes [1.50 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Feb 7 23:41:00 2015 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 600) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 194) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 181943016 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 17 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 39599363 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 821 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 17 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 062 045 Old_age Always - 33 (Min/Max 30/33) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 16 193 Load_Cycle_Count 0x0032 098 098 000 Old_age Always - 4551 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 21 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 267112606073648 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2764453802303 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3442873711291 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.> > Other posts I have read, but I can not act on yet: > > * > http://unix.stackexchange.com/questions/34173/mptscsih-ioc0-task-abort-success-rv-2002-causes-30-seconds-freezing> * https://bugzilla.kernel.org/show_bug.cgi?id=18652 > * https://bugzilla.redhat.com/show_bug.cgi?id=483424 > * https://bugzilla.kernel.org/show_bug.cgi?id=42765 > * http://sourceforge.net/p/smartmontools/mailman/message/23849184/ > * http://kb.softescu.ro/category/hardware/dell/ > > -Jason-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Jason Pyeron
2015-Feb-16 17:02 UTC
[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
> -----Original Message----- > From: Jason Pyeron > Sent: Sunday, February 08, 2015 0:00 > > > -----Original Message----- > > From: Jason Pyeron > > Sent: Saturday, February 07, 2015 22:54 > > > > NOTE: this is happening on Centos 6 x86_64, > > 2.6.32-504.3.3.el6.x86_64 not Centos 5 > > > > Dell PowerEdge 2970, Seagate SATA drive, non-raid. > > > > I have this server which has been dying randomly, with no logs. > > Here is a console picture. > > http://i.imgur.com/ZYHlB82.jpgThanks to netconsole, I have the panic to post: Feb 16 06:06:56 BUG: soft lockup - CPU#0 stuck for 67s! [ksmd:88] Feb 16 06:06:56 Modules linked in: Feb 16 06:06:56 nf_nat Feb 16 06:06:56 mpt3sas Feb 16 06:06:56 mpt2sas Feb 16 06:06:56 raid_class Feb 16 06:06:56 mptctl Feb 16 06:06:56 ipmi_si Feb 16 06:06:56 ipmi_devintf Feb 16 06:06:56 netconsole Feb 16 06:06:56 configfs Feb 16 06:06:56 ebtable_nat Feb 16 06:06:56 ebtables Feb 16 06:06:56 nfs Feb 16 06:06:56 lockd Feb 16 06:06:56 fscache Feb 16 06:06:56 auth_rpcgss Feb 16 06:06:56 nfs_acl Feb 16 06:06:56 sunrpc Feb 16 06:06:56 bridge Feb 16 06:06:56 stp Feb 16 06:06:56 llc Feb 16 06:06:56 ipt_REJECT Feb 16 06:06:56 nf_conntrack_ipv4 Feb 16 06:06:56 nf_defrag_ipv4 Feb 16 06:06:56 iptable_filter Feb 16 06:06:56 ip_tables Feb 16 06:06:56 ip6t_REJECT Feb 16 06:06:56 nf_conntrack_ipv6 Feb 16 06:06:56 nf_defrag_ipv6 Feb 16 06:06:56 xt_state Feb 16 06:06:56 nf_conntrack Feb 16 06:06:56 ip6table_filter Feb 16 06:06:56 ip6_tables Feb 16 06:06:56 ipv6 Feb 16 06:06:56 dm_snapshot Feb 16 06:06:56 dm_bufio Feb 16 06:06:56 dm_zero Feb 16 06:06:56 vhost_net Feb 16 06:06:56 macvtap Feb 16 06:06:56 macvlan Feb 16 06:06:56 tun Feb 16 06:06:56 kvm_amd Feb 16 06:06:56 kvm Feb 16 06:06:56 ipmi_msghandler Feb 16 06:06:56 dcdbas Feb 16 06:06:56 serio_raw Feb 16 06:06:56 bnx2 Feb 16 06:06:56 k10temp Feb 16 06:06:56 amd64_edac_mod Feb 16 06:06:56 edac_core Feb 16 06:06:56 edac_mce_amd Feb 16 06:06:56 sg Feb 16 06:06:56 i2c_piix4 Feb 16 06:06:56 shpchp Feb 16 06:06:56 ext4 Feb 16 06:06:56 jbd2 Feb 16 06:06:56 mbcache Feb 16 06:06:56 sd_mod Feb 16 06:06:56 crc_t10dif Feb 16 06:06:56 mptsas Feb 16 06:06:56 mptscsih Feb 16 06:06:56 mptbase Feb 16 06:06:56 scsi_transport_sas Feb 16 06:06:56 ata_generic Feb 16 06:06:56 pata_acpi Feb 16 06:06:56 sata_svw Feb 16 06:06:56 radeon Feb 16 06:06:56 ttm Feb 16 06:06:56 drm_kms_helper Feb 16 06:06:56 drm Feb 16 06:06:56 i2c_algo_bit Feb 16 06:06:56 i2c_core Feb 16 06:06:56 dm_mirror Feb 16 06:06:56 dm_region_hash Feb 16 06:06:56 dm_log Feb 16 06:06:56 dm_mod Feb 16 06:06:56 [last unloaded: dell_rbu] Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 CPU 0 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 Modules linked in: Feb 16 06:06:56 nf_nat Feb 16 06:06:56 mpt3sas Feb 16 06:06:56 mpt2sas Feb 16 06:06:56 raid_class Feb 16 06:06:56 mptctl Feb 16 06:06:56 ipmi_si Feb 16 06:06:56 ipmi_devintf Feb 16 06:06:56 netconsole Feb 16 06:06:56 configfs Feb 16 06:06:56 ebtable_nat Feb 16 06:06:56 ebtables Feb 16 06:06:56 nfs Feb 16 06:06:56 lockd Feb 16 06:06:56 fscache Feb 16 06:06:56 auth_rpcgss Feb 16 06:06:56 nfs_acl Feb 16 06:06:56 sunrpc Feb 16 06:06:56 bridge Feb 16 06:06:56 stp Feb 16 06:06:56 llc Feb 16 06:06:56 ipt_REJECT Feb 16 06:06:56 nf_conntrack_ipv4 Feb 16 06:06:56 nf_defrag_ipv4 Feb 16 06:06:56 iptable_filter Feb 16 06:06:56 ip_tables Feb 16 06:06:56 ip6t_REJECT Feb 16 06:06:56 nf_conntrack_ipv6 Feb 16 06:06:56 nf_defrag_ipv6 Feb 16 06:06:56 xt_state Feb 16 06:06:56 nf_conntrack Feb 16 06:06:56 ip6table_filter Feb 16 06:06:56 ip6_tables Feb 16 06:06:56 ipv6 Feb 16 06:06:56 dm_snapshot Feb 16 06:06:56 dm_bufio Feb 16 06:06:56 dm_zero Feb 16 06:06:56 vhost_net Feb 16 06:06:56 macvtap Feb 16 06:06:56 macvlan Feb 16 06:06:56 tun Feb 16 06:06:56 kvm_amd Feb 16 06:06:56 kvm Feb 16 06:06:56 ipmi_msghandler Feb 16 06:06:56 dcdbas Feb 16 06:06:56 serio_raw Feb 16 06:06:56 bnx2 Feb 16 06:06:56 k10temp Feb 16 06:06:56 amd64_edac_mod Feb 16 06:06:56 edac_core Feb 16 06:06:56 edac_mce_amd Feb 16 06:06:56 sg Feb 16 06:06:56 i2c_piix4 Feb 16 06:06:56 shpchp Feb 16 06:06:56 ext4 Feb 16 06:06:56 jbd2 Feb 16 06:06:56 mbcache Feb 16 06:06:56 sd_mod Feb 16 06:06:56 crc_t10dif Feb 16 06:06:56 mptsas Feb 16 06:06:56 mptscsih Feb 16 06:06:56 mptbase Feb 16 06:06:56 scsi_transport_sas Feb 16 06:06:56 ata_generic Feb 16 06:06:56 pata_acpi Feb 16 06:06:56 sata_svw Feb 16 06:06:56 radeon Feb 16 06:06:56 ttm Feb 16 06:06:56 drm_kms_helper Feb 16 06:06:56 drm Feb 16 06:06:56 i2c_algo_bit Feb 16 06:06:56 i2c_core Feb 16 06:06:56 dm_mirror Feb 16 06:06:56 dm_region_hash Feb 16 06:06:56 dm_log Feb 16 06:06:56 dm_mod Feb 16 06:06:56 [last unloaded: dell_rbu] Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 Pid: 88, comm: ksmd Not tainted 2.6.32-504.8.1.el6.centos.plus.x86_64 #1 Feb 16 06:06:56 Dell Inc. PowerEdge 2970 Feb 16 06:06:56 /0JKN8W Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 RIP: 0010:[<ffffffff812a1411>] Feb 16 06:06:56 [<ffffffff812a1411>] __bitmap_empty+0x41/0x90 Feb 16 06:06:56 RSP: 0018:ffff88021831dcb0 EFLAGS: 00000202 Feb 16 06:06:56 RAX: 0000000000000000 RBX: ffff88021831dcb0 RCX: 0000000000000010 Feb 16 06:06:56 RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffffffff81e2f198 Feb 16 06:06:56 RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000 Feb 16 06:06:56 R10: ffffea0006679c20 R11: 0000000000000000 R12: 0000000000000000 Feb 16 06:06:56 R13: ffff8801c1b8f650 R14: 0000000198152467 R15: ffffffffa03af44a Feb 16 06:06:56 FS: 00007fc4756b09a0(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 Feb 16 06:06:56 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Feb 16 06:06:56 CR2: 000000c641faeff0 CR3: 0000000001a85000 CR4: 00000000000007f0 Feb 16 06:06:56 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 16 06:06:56 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Feb 16 06:06:56 Process ksmd (pid: 88, threadinfo ffff88021831c000, task ffff880218310040) Feb 16 06:06:56 Stack: Feb 16 06:06:56 ffff88021831dd00 Feb 16 06:06:56 ffffffff81052268 Feb 16 06:06:56 00007f30249b8000 Feb 16 06:06:56 ffffffff81e2f180 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 d> Feb 16 06:06:56 8000000198152025 Feb 16 06:06:56 ffff880219ade700 Feb 16 06:06:56 00007f30249b8000 Feb 16 06:06:56 ffff880219ade9c8 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 d> Feb 16 06:06:56 ffffea0006679c20 Feb 16 06:06:56 ffff880219e57ed0 Feb 16 06:06:56 ffff88021831dd30 Feb 16 06:06:56 ffffffff810522e6 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 Call Trace: Feb 16 06:06:56 [<ffffffff81052268>] ? flush_tlb_others_ipi+0x128/0x130 Feb 16 06:06:56 [<ffffffff810522e6>] ? native_flush_tlb_others+0x76/0x90 Feb 16 06:06:56 [<ffffffff8105240e>] ? flush_tlb_page+0x5e/0xb0 Feb 16 06:06:56 [<ffffffff811721c2>] ? try_to_merge_with_ksm_page+0x532/0x660 Feb 16 06:06:56 [<ffffffff811731a4>] ? ksm_scan_thread+0xeb4/0x1120 Feb 16 06:06:56 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 Feb 16 06:06:56 [<ffffffff811722f0>] ? ksm_scan_thread+0x0/0x1120 Feb 16 06:06:56 [<ffffffff8109e66e>] ? kthread+0x9e/0xc0 Feb 16 06:06:56 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 Feb 16 06:06:56 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 Feb 16 06:06:56 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Feb 16 06:06:56 Code: Feb 16 06:06:56 c0 Feb 16 06:06:56 7e Feb 16 06:06:56 24 Feb 16 06:06:56 48 Feb 16 06:06:56 83 Feb 16 06:06:56 3f Feb 16 06:06:56 00 Feb 16 06:06:56 48 Feb 16 06:06:56 89 Feb 16 06:06:56 f8 Feb 16 06:06:56 74 Feb 16 06:06:56 13 Feb 16 06:06:56 eb Feb 16 06:06:56 5c Feb 16 06:06:56 0f Feb 16 06:06:56 1f Feb 16 06:06:56 40 Feb 16 06:06:56 00 Feb 16 06:06:56 48 Feb 16 06:06:56 8b Feb 16 06:06:56 48 Feb 16 06:06:56 08 Feb 16 06:06:56 48 Feb 16 06:06:56 83 Feb 16 06:06:56 c0 Feb 16 06:06:56 08 Feb 16 06:06:56 48 Feb 16 06:06:56 85 Feb 16 06:06:56 c9 Feb 16 06:06:56 75 Feb 16 06:06:56 4b Feb 16 06:06:56 83 Feb 16 06:06:56 c2 Feb 16 06:06:56 01 Feb 16 06:06:56 41 Feb 16 06:06:56 39 Feb 16 06:06:56 d0 Feb 16 06:06:56 7f Feb 16 06:06:56 eb Feb 16 06:06:56 40 Feb 16 06:06:56 f6 Feb 16 06:06:56 c6 Feb 16 06:06:56 3f Feb 16 06:06:56 b8> Feb 16 06:06:56 01 Feb 16 06:06:56 00 Feb 16 06:06:56 last message repeated 2 times Feb 16 06:06:56 75 Feb 16 06:06:56 08 Feb 16 06:06:56 c9 Feb 16 06:06:56 c3 Feb 16 06:06:56 66 Feb 16 06:06:56 0f Feb 16 06:06:56 1f Feb 16 06:06:56 44 Feb 16 06:06:56 00 Feb 16 06:06:56 00 Feb 16 06:06:56 89 Feb 16 06:06:56 f0 Feb 16 06:06:56 48 Feb 16 06:06:56 63 Feb 16 06:06:56 d2 Feb 16 06:06:56 c1 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 Call Trace: Feb 16 06:06:56 [<ffffffff81052268>] ? flush_tlb_others_ipi+0x128/0x130 Feb 16 06:06:56 [<ffffffff810522e6>] ? native_flush_tlb_others+0x76/0x90 Feb 16 06:06:56 [<ffffffff8105240e>] ? flush_tlb_page+0x5e/0xb0 Feb 16 06:06:56 [<ffffffff811721c2>] ? try_to_merge_with_ksm_page+0x532/0x660 Feb 16 06:06:56 [<ffffffff811731a4>] ? ksm_scan_thread+0xeb4/0x1120 Feb 16 06:06:56 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 Feb 16 06:06:56 [<ffffffff811722f0>] ? ksm_scan_thread+0x0/0x1120 Feb 16 06:06:56 [<ffffffff8109e66e>] ? kthread+0x9e/0xc0 Feb 16 06:06:56 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 Feb 16 06:06:56 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 Feb 16 06:06:56 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Feb 16 06:07:01 Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1 Feb 16 06:07:01 Pid: 1950, comm: qemu-kvm Not tainted 2.6.32-504.8.1.el6.centos.plus.x86_64 #1 Feb 16 06:07:01 Call Trace: Feb 16 06:07:01 <NMI> Feb 16 06:07:01 [<ffffffff81530bdc>] ? panic+0xa7/0x16f Feb 16 06:07:01 [<ffffffff81014959>] ? sched_clock+0x9/0x10 Feb 16 06:07:01 [<ffffffff810ea65d>] ? watchdog_overflow_callback+0xcd/0xd0 Feb 16 06:07:01 [<ffffffff81120e07>] ? __perf_event_overflow+0xa7/0x240 Feb 16 06:07:01 [<ffffffff81119e14>] ? perf_event_update_userpage+0x24/0x110 Feb 16 06:07:01 [<ffffffff81121454>] ? perf_event_overflow+0x14/0x20 Feb 16 06:07:01 [<ffffffff8101e3fb>] ? x86_pmu_handle_irq+0x1eb/0x250 Feb 16 06:07:01 [<ffffffff81535ed9>] ? perf_event_nmi_handler+0x39/0xb0 Feb 16 06:07:01 [<ffffffff81537995>] ? notifier_call_chain+0x55/0x80 Feb 16 06:07:01 [<ffffffff815379fa>] ? atomic_notifier_call_chain+0x1a/0x20 Feb 16 06:07:01 [<ffffffff810a4ede>] ? notify_die+0x2e/0x30 Feb 16 06:07:01 [<ffffffff8153565b>] ? do_nmi+0x1bb/0x340 Feb 16 06:07:01 [<ffffffff81534f20>] ? nmi+0x20/0x30 Feb 16 06:07:01 [<ffffffff8153478e>] ? _spin_lock+0x1e/0x30 Feb 16 06:07:01 <<EOE>> Feb 16 06:07:01 [<ffffffff8114fdd3>] ? handle_pte_fault+0x833/0xb00 Feb 16 06:07:01 [<ffffffffa03987da>] ? kvm_ioapic_update_eoi+0x8a/0xf0 [kvm] Feb 16 06:07:01 [<ffffffff811502ca>] ? handle_mm_fault+0x22a/0x300 Feb 16 06:07:01 [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480 Feb 16 06:07:01 [<ffffffff8105d7d1>] ? update_curr+0xe1/0x1f0 Feb 16 06:07:01 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70 Feb 16 06:07:01 [<ffffffff8100bc0e>] ? invalidate_interrupt0+0xe/0x20 Feb 16 06:07:01 [<ffffffff81060c0c>] ? finish_task_switch+0x4c/0xf0 Feb 16 06:07:01 [<ffffffff815378de>] ? do_page_fault+0x3e/0xa0 Feb 16 06:07:01 [<ffffffff81534c95>] ? page_fault+0x25/0x30 Feb 16 06:07:01 [<ffffffff8129e862>] ? copy_user_generic_string+0x32/0x40 Feb 16 06:07:01 [<ffffffffa03926ab>] ? kvm_write_guest_cached+0x7b/0xa0 [kvm] Feb 16 06:07:01 [<ffffffffa03bf61f>] ? kvm_lapic_sync_to_vapic+0xcf/0x220 [kvm] Feb 16 06:07:01 [<ffffffffa03bdfb8>] ? kvm_apic_has_interrupt+0x48/0xd0 [kvm] Feb 16 06:07:01 [<ffffffffa03ac24d>] ? kvm_arch_vcpu_ioctl_run+0x93d/0x1010 [kvm] Feb 16 06:07:01 [<ffffffff810b2b73>] ? futex_wake+0x93/0x150 Feb 16 06:07:01 [<ffffffffa0392b04>] ? kvm_vcpu_ioctl+0x434/0x580 [kvm] Feb 16 06:07:01 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70 Feb 16 06:07:01 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 Feb 16 06:07:01 [<ffffffff811a3e92>] ? vfs_ioctl+0x22/0xa0 Feb 16 06:07:01 [<ffffffff811a435a>] ? do_vfs_ioctl+0x3aa/0x580 Feb 16 06:07:01 [<ffffffff811a45b1>] ? sys_ioctl+0x81/0xa0 Feb 16 06:07:01 [<ffffffff810e5afe>] ? __audit_syscall_exit+0x25e/0x290 Feb 16 06:07:01 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Feb 16 06:07:01 drm_kms_helper: panic occurred, switching back to text console Feb 16 06:07:01 BUG: scheduling while atomic: qemu-kvm/1950/0x14010000 Feb 16 06:07:01 Modules linked in: Feb 16 06:07:01 nf_nat Feb 16 06:07:01 mpt3sas Feb 16 06:07:01 mpt2sas Feb 16 06:07:01 raid_class Feb 16 06:07:01 mptctl Feb 16 06:07:01 ipmi_si Feb 16 06:07:01 ipmi_devintf Feb 16 06:07:01 netconsole Feb 16 06:07:01 configfs Feb 16 06:07:01 ebtable_nat Feb 16 06:07:01 ebtables Feb 16 06:07:01 nfs Feb 16 06:07:01 lockd Feb 16 06:07:01 fscache Feb 16 06:07:01 auth_rpcgss Feb 16 06:07:01 nfs_acl Feb 16 06:07:01 sunrpc Feb 16 06:07:01 bridge Feb 16 06:07:01 stp Feb 16 06:07:01 llc Feb 16 06:07:01 ipt_REJECT Feb 16 06:07:01 nf_conntrack_ipv4 Feb 16 06:07:01 nf_defrag_ipv4 Feb 16 06:07:01 iptable_filter Feb 16 06:07:01 ip_tables Feb 16 06:07:01 ip6t_REJECT Feb 16 06:07:01 nf_conntrack_ipv6 Feb 16 06:07:01 nf_defrag_ipv6 Feb 16 06:07:01 xt_state Feb 16 06:07:01 nf_conntrack Feb 16 06:07:01 ip6table_filter Feb 16 06:07:01 ip6_tables Feb 16 06:07:01 ipv6 Feb 16 06:07:01 dm_snapshot Feb 16 06:07:01 dm_bufio Feb 16 06:07:01 dm_zero Feb 16 06:07:01 vhost_net Feb 16 06:07:01 macvtap Feb 16 06:07:01 macvlan Feb 16 06:07:01 tun Feb 16 06:07:01 kvm_amd Feb 16 06:07:01 kvm Feb 16 06:07:01 ipmi_msghandler Feb 16 06:07:01 dcdbas Feb 16 06:07:01 serio_raw Feb 16 06:07:01 bnx2 Feb 16 06:07:01 k10temp Feb 16 06:07:01 amd64_edac_mod Feb 16 06:07:01 edac_core Feb 16 06:07:01 edac_mce_amd Feb 16 06:07:01 sg Feb 16 06:07:01 i2c_piix4 Feb 16 06:07:01 shpchp Feb 16 06:07:01 ext4 Feb 16 06:07:01 jbd2 Feb 16 06:07:01 mbcache Feb 16 06:07:01 sd_mod Feb 16 06:07:01 crc_t10dif Feb 16 06:07:01 mptsas Feb 16 06:07:01 mptscsih Feb 16 06:07:01 mptbase Feb 16 06:07:01 scsi_transport_sas Feb 16 06:07:01 ata_generic Feb 16 06:07:01 pata_acpi Feb 16 06:07:01 sata_svw Feb 16 06:07:01 radeon Feb 16 06:07:01 ttm Feb 16 06:07:01 drm_kms_helper Feb 16 06:07:01 drm Feb 16 06:07:01 i2c_algo_bit Feb 16 06:07:01 i2c_core Feb 16 06:07:01 dm_mirror Feb 16 06:07:01 dm_region_hash Feb 16 06:07:01 dm_log Feb 16 06:07:01 dm_mod Feb 16 06:07:01 [last unloaded: dell_rbu] Feb 16 06:07:01 192.168.13.230 Feb 16 06:07:01 Pid: 1950, comm: qemu-kvm Not tainted 2.6.32-504.8.1.el6.centos.plus.x86_64 #1 Feb 16 06:07:01 Call Trace: Feb 16 06:07:01 <NMI> Feb 16 06:07:01 [<ffffffff81060bb6>] ? __schedule_bug+0x66/0x70 Feb 16 06:07:01 [<ffffffff8153193c>] ? thread_return+0x6ac/0x7d0 Feb 16 06:07:01 [<ffffffffa002e35d>] ? write_msg+0xfd/0x110 [netconsole] Feb 16 06:07:01 [<ffffffffa00b2d0e>] ? drm_crtc_helper_set_config+0x1be/0xa60 [drm_kms_helper] Feb 16 06:07:01 [<ffffffff8106c85a>] ? __cond_resched+0x2a/0x40 Feb 16 06:07:01 [<ffffffff81531d30>] ? _cond_resched+0x30/0x40 Feb 16 06:07:01 [<ffffffff81174e18>] ? __kmalloc+0x138/0x230 Feb 16 06:07:01 [<ffffffff810ba332>] ? __module_text_address+0x12/0x60 Feb 16 06:07:01 [<ffffffffa00b2d0e>] ? drm_crtc_helper_set_config+0x1be/0xa60 [drm_kms_helper] Feb 16 06:07:01 [<ffffffffa013df27>] ? r100_mm_wreg+0x67/0x90 [radeon] Feb 16 06:07:01 [<ffffffffa01332d2>] ? radeon_crtc_cursor_set+0x92/0x6e0 [radeon] Feb 16 06:07:01 [<ffffffffa005e40c>] ? drm_mode_set_config_internal+0x5c/0xe0 [drm] Feb 16 06:07:01 [<ffffffffa00b0653>] ? drm_fb_helper_restore_fbdev_mode+0xb3/0xe0 [drm_kms_helper] Feb 16 06:07:01 [<ffffffffa00b0788>] ? drm_fb_helper_panic+0x78/0xa0 [drm_kms_helper] Feb 16 06:07:01 [<ffffffff81537995>] ? notifier_call_chain+0x55/0x80 Feb 16 06:07:01 [<ffffffff815379fa>] ? atomic_notifier_call_chain+0x1a/0x20 Feb 16 06:07:01 [<ffffffff81530c07>] ? panic+0xd2/0x16f Feb 16 06:07:01 [<ffffffff81014959>] ? sched_clock+0x9/0x10 Feb 16 06:07:01 [<ffffffff810ea65d>] ? watchdog_overflow_callback+0xcd/0xd0 Feb 16 06:07:01 [<ffffffff81120e07>] ? __perf_event_overflow+0xa7/0x240 Feb 16 06:07:01 [<ffffffff81119e14>] ? perf_event_update_userpage+0x24/0x110 Feb 16 06:07:01 [<ffffffff81121454>] ? perf_event_overflow+0x14/0x20 Feb 16 06:07:01 [<ffffffff8101e3fb>] ? x86_pmu_handle_irq+0x1eb/0x250 Feb 16 06:07:01 [<ffffffff81535ed9>] ? perf_event_nmi_handler+0x39/0xb0 Feb 16 06:07:01 [<ffffffff81537995>] ? notifier_call_chain+0x55/0x80 Feb 16 06:07:01 [<ffffffff815379fa>] ? atomic_notifier_call_chain+0x1a/0x20 Feb 16 06:07:01 [<ffffffff810a4ede>] ? notify_die+0x2e/0x30 Feb 16 06:07:01 [<ffffffff8153565b>] ? do_nmi+0x1bb/0x340 Feb 16 06:07:01 [<ffffffff81534f20>] ? nmi+0x20/0x30 Feb 16 06:07:01 [<ffffffff8153478e>] ? _spin_lock+0x1e/0x30 Feb 16 06:07:01 <<EOE>> Feb 16 06:07:01 [<ffffffff8114fdd3>] ? handle_pte_fault+0x833/0xb00 Feb 16 06:07:01 [<ffffffffa03987da>] ? kvm_ioapic_update_eoi+0x8a/0xf0 [kvm] Feb 16 06:07:01 [<ffffffff811502ca>] ? handle_mm_fault+0x22a/0x300 Feb 16 06:07:01 [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480 Feb 16 06:07:01 [<ffffffff8105d7d1>] ? update_curr+0xe1/0x1f0 Feb 16 06:07:01 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70 Feb 16 06:07:01 [<ffffffff8100bc0e>] ? invalidate_interrupt0+0xe/0x20 Feb 16 06:07:01 [<ffffffff81060c0c>] ? finish_task_switch+0x4c/0xf0 Feb 16 06:07:01 [<ffffffff815378de>] ? do_page_fault+0x3e/0xa0 Feb 16 06:07:01 [<ffffffff81534c95>] ? page_fault+0x25/0x30 Feb 16 06:07:01 [<ffffffff8129e862>] ? copy_user_generic_string+0x32/0x40 Feb 16 06:07:01 [<ffffffffa03926ab>] ? kvm_write_guest_cached+0x7b/0xa0 [kvm] Feb 16 06:07:01 [<ffffffffa03bf61f>] ? kvm_lapic_sync_to_vapic+0xcf/0x220 [kvm] Feb 16 06:07:01 [<ffffffffa03bdfb8>] ? kvm_apic_has_interrupt+0x48/0xd0 [kvm] Feb 16 06:07:01 [<ffffffffa03ac24d>] ? kvm_arch_vcpu_ioctl_run+0x93d/0x1010 [kvm] Feb 16 06:07:01 [<ffffffff810b2b73>] ? futex_wake+0x93/0x150 Feb 16 06:07:01 [<ffffffffa0392b04>] ? kvm_vcpu_ioctl+0x434/0x580 [kvm] Feb 16 06:07:01 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70 Feb 16 06:07:01 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 Feb 16 06:07:01 [<ffffffff811a3e92>] ? vfs_ioctl+0x22/0xa0 Feb 16 06:07:01 [<ffffffff811a435a>] ? do_vfs_ioctl+0x3aa/0x580 Feb 16 06:07:01 [<ffffffff811a45b1>] ? sys_ioctl+0x81/0xa0 Feb 16 06:07:01 [<ffffffff810e5afe>] ? __audit_syscall_exit+0x25e/0x290 Feb 16 06:07:01 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Feb 16 06:07:01 Clocksource tsc unstable (delta = -77309385171 ns). Enable clocksource failover by adding clocksource_failover kernel parameter.> > > > > I had a tail -f over ssh for a week, when this just happened. > > > > Feb 8 00:10:21 thirteen-230 kernel: mptscsih: ioc0: > > attempting task abort! (sc=ffff880057a0a080) > > Feb 8 00:10:21 thirteen-230 kernel: sd 4:0:0:0: [sda] CDB: > > Write(10): 2a 00 1a 17 a1 6f 00 00 01 00 > > Feb 8 00:10:51 thirteen-230 kernel: mptscsih: ioc0: WARNING > > - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000 > > Feb 8 00:10:51 thirteen-230 kernel: mptbase: ioc0: > > Initiating recovery > > Feb 8 00:11:13 thirteen-230 kernel: mptscsih: ioc0: task > > abort: SUCCESS (rv=2002) (sc=ffff880057a0a080) > > Write failed: Connection reset by peer > > > > After reading https://access.redhat.com/solutions/108273, I > > am increasing the logging (shown below) but I am not > > confident about this wait and see approach. > > > > sysctl -w dev.scsi.logging_level=98367 > > > > I am also going to check smartctl output once I get onsite to > > power cycle the system. > > # smartctl -a /dev/sda > smartctl 5.43 2012-06-30 r3573 > [x86_64-linux-2.6.32-504.3.3.el6.x86_64] (local build) > Copyright (C) 2002-12 by Bruce Allen, > http://smartmontools.sourceforge.net > > === START OF INFORMATION SECTION ==> Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors) > Device Model: ST1500DM003-9YN16G > Serial Number: W24153R0 > LU WWN Device Id: 5 000c50 05d03cc1d > Firmware Version: CC82 > User Capacity: 1,500,301,910,016 bytes [1.50 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Device is: In smartctl database [for details use: -P show] > ATA Version is: 8 > ATA Standard is: ATA-8-ACS revision 4 > Local Time is: Sat Feb 7 23:41:00 2015 EST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION ==> SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x00) Offline data > collection activity > was never started. > Auto Offline Data > Collection: Disabled. > Self-test execution status: ( 0) The previous > self-test routine completed > without error or no > self-test has ever > been run. > Total time to complete Offline > data collection: ( 600) seconds. > Offline data collection > capabilities: (0x73) SMART execute Offline > immediate. > Auto Offline data > collection on/off support. > Suspend Offline > collection upon new > command. > No Offline surface > scan supported. > Self-test supported. > Conveyance Self-test > supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data > before entering > power-saving mode. > Supports SMART auto > save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose > Logging supported. > Short self-test routine > recommended polling time: ( 1) minutes. > Extended self-test routine > recommended polling time: ( 194) minutes. > Conveyance self-test routine > recommended polling time: ( 2) minutes. > SCT capabilities: (0x3085) SCT Status supported. > > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 118 099 006 > Pre-fail Always - 181943016 > 3 Spin_Up_Time 0x0003 092 092 000 > Pre-fail Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 > Old_age Always - 17 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 > Pre-fail Always - 0 > 7 Seek_Error_Rate 0x000f 075 060 030 > Pre-fail Always - 39599363 > 9 Power_On_Hours 0x0032 100 100 000 > Old_age Always - 821 > 10 Spin_Retry_Count 0x0013 100 100 097 > Pre-fail Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 > Old_age Always - 17 > 183 Runtime_Bad_Block 0x0032 100 100 000 > Old_age Always - 0 > 184 End-to-End_Error 0x0032 100 100 099 > Old_age Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 > Old_age Always - 0 > 188 Command_Timeout 0x0032 100 100 000 > Old_age Always - 0 > 189 High_Fly_Writes 0x003a 100 100 000 > Old_age Always - 0 > 190 Airflow_Temperature_Cel 0x0022 067 062 045 > Old_age Always - 33 (Min/Max 30/33) > 191 G-Sense_Error_Rate 0x0032 100 100 000 > Old_age Always - 0 > 192 Power-Off_Retract_Count 0x0032 100 100 000 > Old_age Always - 16 > 193 Load_Cycle_Count 0x0032 098 098 000 > Old_age Always - 4551 > 194 Temperature_Celsius 0x0022 033 040 000 > Old_age Always - 33 (0 21 0 0 0) > 197 Current_Pending_Sector 0x0012 100 100 000 > Old_age Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 > Old_age Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 > Old_age Always - 0 > 240 Head_Flying_Hours 0x0000 100 253 000 > Old_age Offline - 267112606073648 > 241 Total_LBAs_Written 0x0000 100 253 000 > Old_age Offline - 2764453802303 > 242 Total_LBAs_Read 0x0000 100 253 000 > Old_age Offline - 3442873711291 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > No self-tests have been logged. [To run self-tests, use: smartctl -t] > > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 > minute delay. > > > > > > Other posts I have read, but I can not act on yet: > > > > * > > http://unix.stackexchange.com/questions/34173/mptscsih-ioc0-ta > sk-abort-success-rv-2002-causes-30-seconds-freezing > > * https://bugzilla.kernel.org/show_bug.cgi?id=18652 > > * https://bugzilla.redhat.com/show_bug.cgi?id=483424 > > * https://bugzilla.kernel.org/show_bug.cgi?id=42765 > > * http://sourceforge.net/p/smartmontools/mailman/message/23849184/ > > * http://kb.softescu.ro/category/hardware/dell/ > > > > -Jason-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Possibly Parallel Threads
- Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
- Disk near failure
- Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
- Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!
- Re: Disk near failure